Language Model Self-improvement by Reinforcement Learning Contemplation without External Supervision

Jing-Cheng Pang; Kaiyuan Li; Pengyuan Wang; Xiong-Hui Chen; Jiacheng Xu; Zongzhang Zhang; Yang Yu

doi:10.1613/jair.1.20097

PDF

Published: Jun 30, 2026

DOI: https://doi.org/10.1613/jair.1.20097

Keywords:

reinforcement learning, natural language processing

Jing-Cheng Pang

a:1:{s:5:"en_US";s:18:"Nanjing University";}

https://orcid.org/0000-0002-6086-6979

Kaiyuan Li

https://orcid.org/0009-0009-1331-9604

Pengyuan Wang

https://orcid.org/0009-0006-1545-9457

Xiong-Hui Chen

https://orcid.org/0000-0003-4911-3148

Jiacheng Xu

https://orcid.org/0009-0008-4315-6976

Zongzhang Zhang

https://orcid.org/0000-0002-9238-4747

Yang Yu

https://orcid.org/0000-0002-1732-9545

Abstract

Language model self-improvement (LMSI) techniques have recently gained significant attention as they improve language models without requiring external supervision. A notable approach is reinforcement learning from AI feedback (RLAIF), which trains a reward model based on AI preference data and employs reinforcement learning (RL) algorithm to train the language model. However, RLAIF relies on a heuristic assumption that the AI model is able to provide effective feedback, which requires the language model to possess solid capability. In this paper, we present a novel LMSI method, Reinforcement Learning Contemplation (RLC). We disclose that it is simpler for language models to evaluate the text than to generate it, even for small models under 1B parameters. Leveraging the gap between the evaluation and generation, RLC evaluates the generated answers and updates language model using RL to maximize self-evaluation scores. We demonstrate the effectiveness of RLC on a wide range of challenging tasks, including reasoning, summarization, conditioned generation and emotion recognition, resulting in a increase in the answering accuracy (31.23% to 37.09%) for BigBench-hard reasoning tasks, and a rise in BERTScore for CNN/Daily Mail summarization tasks. In addition, RLC can be applied to models of different sizes (80M to 3B) and model structures (FLAN-T5, LLAMA-3.2 and QWEN-2.5), showcasing its broad applicability. We further verify that when training on larger scale of dataset, RLC improves language model’s evaluation and generation ability on unseen tasks simultaneously, enabling a general capability improvement without external supervision.

Issue

Vol. 86 (2026)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details