Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Main Article Content

Clovis Varangot-Reille
Christophe Bouvard
Mathieu Ciancone
Antoine Gourru
Marion Schaeffer
François Jacquenet

Abstract

Background: Large Language Model (LLM)-based systems, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption.


Objectives: This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance.


Methods: We structure the survey by defining the objectives to optimise, such as cost minimisation and performance maximisation; the timing of routing within the LLM workflow, whether it occurs before or after generation; and the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies.


Results: There is a wide range of routing strategies, from lightweight, similarity-based and supervised methods, to more complex approaches involving LLM fine-tuning and reinforcement learning. Most current strategies adopt a pre-generation approach, which is generally more resource-efficient. This survey demonstrates that some low-resource solutions can provide generalisation capabilities.


Conclusions: Routing offers a practical way to improve the efficiency of LLM-based systems. By formalising routing as a performance–cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

Article Details

Section
Articles