Spurrious rewards for RLVR review
← Back to logbook
Paper
I read this notion site that talks about some crazy discoveries in how some models learn better with random or incorrect rewards. Thanks for sending this my way, harry! Dope find. Major takeaways:- It was shocking to see that Quen got higher performance with responses in math questions that contained Python code (example here). Note—this isn't using tool calls. It got it right by... reading the code? or something.
- I don't have a perfect grasp of GRPO and clipping, but authors claim something is happening with clipping causing random rewards to improve the model's performance.