1 article
P-GRPO tries to keep personalized gradients intact instead of flattening feedback into one global average.