AI Heap
Published on

Self-Generated Critiques Boost Reward Modeling for Language Models

arXiv:2411.16646 - [arXiv,PDF]
Authors
  • Name
    Yue Yu
  • Name
    Zhengxing Chen
  • Name
    Aston Zhang
  • Name
    Liang Tan
  • Name
    Chenguang Zhu
  • Name
    Richard Yuanzhe Pang
  • Name
    Yundi Qian
  • Name
    Xuewei Wang
  • Name
    Suchin Gururangan
  • Name
    Chao Zhang
  • Name
    Melanie Kambadur
  • Name
    Dhruv Mahajan
  • Name
    Rui Hou
  • Affiliation
    Department of Computer Science, University of XYZ
  • Affiliation
    Department of Electrical Engineering, University of ABC
  • Affiliation
    Department of Mathematics, University of DEF
  • Affiliation
    Department of Physics, University of GHI
  • Affiliation
    Department of Chemistry, University of JKL
  • Affiliation
    Department of Biology, University of MNO
  • Affiliation
    Department of Environmental Science, University of PQR
  • Affiliation
    Department of Statistics, University of STU
  • Affiliation
    Department of Sociology, University of VWX
  • Affiliation
    Department of History, University of YZA
  • Affiliation
    Department of Psychology, University of BCD
  • Affiliation
    Department of Philosophy, University of EFG
  • Affiliation
    Department of Political Science, University of HIJ
Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.