Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Clovis Varangot-Reille; Christophe Bouvard; Mathieu Ciancone; Antoine Gourru; Marion Schaeffer; François Jacquenet

doi:10.1613/jair.1.19801

PDF

Published: Jun 30, 2026

DOI: https://doi.org/10.1613/jair.1.19801

Keywords:

intelligent query processing, natural language

Clovis Varangot-Reille

Wikit, Lyon, France; Laboratoire Hubert Curien, UMR CNRS 5516, Université Jean Monnet, Saint-Etienne, France

https://orcid.org/0000-0002-4188-8013

Christophe Bouvard

Wikit, Lyon, France

https://orcid.org/0009-0006-7962-0270

Mathieu Ciancone

Wikit, Lyon, France

https://orcid.org/0009-0003-6607-9583

Antoine Gourru

Laboratoire Hubert Curien, UMR CNRS 5516, Université Jean Monnet, Saint-Etienne, France

https://orcid.org/0000-0003-3571-2430

Marion Schaeffer

INSA Rouen Normandie, Rouen, France

https://orcid.org/0000-0001-9854-7482

François Jacquenet

Laboratoire Hubert Curien, UMR CNRS 5516, Université Jean Monnet, Saint-Etienne, France

https://orcid.org/0000-0002-0653-0710

Abstract

Background: Large Language Model (LLM)-based systems, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption.

Objectives: This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance.

Methods: We structure the survey by defining the objectives to optimise, such as cost minimisation and performance maximisation; the timing of routing within the LLM workflow, whether it occurs before or after generation; and the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies.

Results: There is a wide range of routing strategies, from lightweight, similarity-based and supervised methods, to more complex approaches involving LLM fine-tuning and reinforcement learning. Most current strategies adopt a pre-generation approach, which is generally more resource-efficient. This survey demonstrates that some low-resource solutions can provide generalisation capabilities.

Conclusions: Routing offers a practical way to improve the efficiency of LLM-based systems. By formalising routing as a performance–cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

Issue

Vol. 86 (2026)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details