Label-Aware Pseudo-Training Sample Generation for Text Classification

Arash Yousefi Jordehi; Seyed Abolghasem Mirroshandel; Owen Rambow

doi:10.1613/jair.1.20868

PDF

Published: Feb 27, 2026

DOI: https://doi.org/10.1613/jair.1.20868

Arash Yousefi Jordehi

University of Guilan

https://orcid.org/0000-0001-8136-2246

Seyed Abolghasem Mirroshandel

a:1:{s:5:"en_US";s:20:"University of Guilan";}

https://orcid.org/0000-0001-8853-9112

Owen Rambow

Stony Brook University

https://orcid.org/0000-0003-2054-039X

Abstract

Deep learning models excel in various Natural Language Processing (NLP) tasks, but their performance (excluding approaches like zero-shot learning or few-shot learning) relies on ample data, posing challenges in fields with limited datasets. To address the poverty in the size of training data, a number of approaches could be taken, such as multi-task learning and data augmentation. Aiming to leverage Large Language Models (LLMs), we propose a data augmentation algorithm. It subtly alters sentences by inserting random words and utilizes LLMs to find the most fitting replacements within their embedding space. Taking inspiration from Prompt Tuning, the focus shifts from optimizing the input prompt to updating the inserted tokens’ embedding vectors by maximizing the conditional generation probability. This allows for vast sample generation while implicitly benefiting from the knowledge within LLMs. The results from our extensive set of experiments on various benchmark text classification tasks show a substantial improvement over the non-augmented outcomes.

Issue

Vol. 85 (2026)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details