Interesting article on using process reward learning rather than final outcome for #AI supervision. So instead of asking an AI a question and doing reward learning on the totality of the result, each step is evaluated. Good results for mathematics: https://openai.com/research/improving-mathematical-reasoning-with-process-supervision
Improving mathematical reasoning with process supervision
We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”).openai.com