Language Model Self-improvement by Reinforcement Learning Contemplation without External Supervision

Main Article Content

Abstract

Language model self-improvement (LMSI) techniques have recently gained significant attention as they improve language models without requiring external supervision. A notable approach is reinforcement learning from AI feedback (RLAIF), which trains a reward model based on AI preference data and employs reinforcement learning (RL) algorithm to train the language model. However, RLAIF relies on a heuristic assumption that the AI model is able to provide effective feedback, which requires the language model to possess solid capability. In this paper, we present a novel LMSI method, Reinforcement Learning Contemplation (RLC). We disclose that it is simpler for language models to evaluate the text than to generate it, even for small models under 1B parameters. Leveraging the gap between the evaluation and generation, RLC evaluates the generated answers and updates language model using RL to maximize self-evaluation scores. We demonstrate the effectiveness of RLC on a wide range of challenging tasks, including reasoning, summarization, conditioned generation and emotion recognition, resulting in a increase in the answering accuracy (31.23% to 37.09%) for BigBench-hard reasoning tasks, and a rise in BERTScore for CNN/Daily Mail summarization tasks. In addition, RLC can be applied to models of different sizes (80M to 3B) and model structures (FLAN-T5, LLAMA-3.2 and QWEN-2.5), showcasing its broad applicability. We further verify that when training on larger scale of dataset, RLC improves language model’s evaluation and generation ability on unseen tasks simultaneously, enabling a general capability improvement without external supervision.

Article Details

Section
Articles