Decision-Focused Learning: Foundations, State of the Art, Benchmark and Future Opportunities

Decision-focused learning (DFL) is an emerging paradigm that integrates machine learning (ML) and constrained optimization to enhance decision quality by training ML models in an end-to-end system. This approach shows significant potential to revolutionize combinatorial decision-making in real-world applications that operate under uncertainty, where estimating unknown parameters within decision models is a major challenge. This paper presents a comprehensive review of DFL, providing an in-depth analysis of both gradient-based and gradient-free techniques used to combine ML and constrained optimization. It evaluates the strengths and limitations of these techniques and includes an extensive empirical evaluation of eleven methods across seven problems. The survey also offers insights into recent advancements and future research directions in DFL. Code and benchmark: https://github.com/PredOpt/predopt-benchmarks


Introduction
Real-world applications frequently confront the task of decision-making under uncertainty, such as planning the shortest route in a city, determining optimal power generation schedules, or managing investment portfolios (Sahinidis, 2004;Liu and Liu, 2009;Kim et al., 2005;Hu et al., 2016;Delage and Ye, 2010;Garlappi et al., 2006). In such scenarios, estimating unknown parameters often poses a significant challenge. Figure 1: Decision-making under uncertainty involves both predictive and prescriptive analytics. In the predictive stage, the uncertain parameters are predicted from the feature variables using an ML model. In the prescriptive stage, a decision is prescribed by solving a CO problem using the predicted parameters.
Machine Learning (ML) and Constrained Optimization (CO) serve as two key tools for these complex problems. ML models estimate uncertain quantities, while CO models optimize objectives within constrained spaces. This sequential process, commonly referred to as predictive and prescriptive modeling, as illustrated in Figure 1, is prevalent in fields like operations research and business analytics (den Hertog and Postek, 2016). For instance, in portfolio management, the prediction stage forecasts asset returns, while the prescriptive phase optimizes returns based on these predictions.
A commonly adopted approach involves handling these two stages-prediction and optimization-separately and independently. This "two-stage" process first involves training an ML model to create a mapping between observed features and the relevant parameters of a CO problem. Subsequently, and independently, a specialized optimization algorithm is used to solve the decision problem, which is specified by the predicted problem parameters. The underlying assumption in this methodology is that superior predictions would lead to precise models and consequently, high-quality decisions. Indeed, if the predictions of parameters were perfectly accurate, they would enable the correct specification of CO models which can be solved to yield fully optimal decisions. However, ML models often fall short of perfect accuracy, leading to suboptimal decisions due to propagated prediction errors. Thus, in many applications, the predictive and prescriptive modelings are not isolated but rather, deeply interconnected, and hence should ideally be modeled jointly. This is the goal of the decision-focused learning (DFL) paradigm, which directly trains the ML model to make predictions that lead to good decisions. In other words, DFL integrates prediction and optimization in an end-to-end system trained to optimize a criterion (i.e., a loss function) that is based on the resulting decisions.
Since many ML models, including neural networks (NNs), are trained via gradient-based optimization, the gradients of the loss must be backpropagated through each constituent operation of the model. In DFL, the loss function is dependent on the solution of an optimization model, thus the optimization solver is embedded as a component of the ML model. In this integration of prediction and optimization, a key challenge is differentiating through the optimization problem. An additional challenge arises from decision models operating on discrete variables, which produce discontinuous mappings and hinder gradientbased learning. Hence, examining smooth surrogate models for these discrete mappings, along with their differentiation, becomes crucial. These two challenges are a central focus of this survey.
This manuscript presents a comprehensive survey of decision-focused learning and makes several contributions. First, to navigate the complex methodologies developed in recent years, the paper proposes the first categorization of DFL methods into four distinct classes: (1) analytical differentiation of optimization mappings, (2) analytical smoothing of optimization mappings, (3) smoothing by random perturbations, and (4) differentiation of surrogate loss functions. This categorization, as illustrated in Figure 4, serves as a framework for comprehending and organizing various DFL methodologies. Next, the paper compiles a selection of problem-specific DFL models, making them publicly available to facilitate broader access and usage. An integral part of this paper involves benchmarking the performance of various available methodologies on seven distinct problems. This provides an opportunity for comparative understanding and assists in identifying the relative strengths and weaknesses of each approach. The code and data used in the benchmarking are accessible through https://github.com/PredOpt/predopt-benchmarks. Finally, this survey addresses the critical need to look forward, by discussing the outstanding challenges and offering an outlook on potential future directions in the field of DFL.
Paper organization.
Following this introduction, the paper is structured as follows. Preliminary concepts are discussed in Section 2, which introduces the problem setting and explicates the challenges in implementing DFL. The subsequent Section 3 offers a comprehensive review of recently proposed methodologies for handling these challenges, neatly organized into broad classes of related techniques. Section 5 brings forth seven benchmark DFL tasks from public datasets, with a comparative evaluation of eight DFL methodologies presented in the following section. The manuscript concludes by providing a discourse on the current challenges and possible future directions in DFL research.

Preliminaries
This section presents an overview of the problem setting, along with preliminary concepts and essential terminology. Then, the central modeling challenges are discussed, setting the stage for a review of current methodologies in the design and implementation of DFL solutions. Throughout the manuscript, vectors are denoted by boldface lowercase letters, such as x, while scalar components within the vector x are represented with a subscript i, denoting the i th item within x as x i . Similarly, the vectors 1 and 0 symbolize the vector of all-ones and all-zeros, respectively.

Problem Setting
In operations research and business analytics, decisions are often quantitatively modeled using CO problems. These problems model various decision-making scenarios, but may not be efficiently solvable and often demand specialized solution algorithms that are tailored to their specific form. In many real-world applications, some parameters of the CO problems are uncertain and must be inferred from contextual data (hereafter referred to as features).
The settings considered in this manuscript involve estimating those parameters through predictive inferences made by ML models, and subsequently, the final decisions are modeled as the solution to the CO problems based on those inferences.
In this setting, the decision-making processes can be described by parametric CO problems, defined as, The goal of the optimization problem above is to find a solution x (c) ∈ R n , a minimizer of the objective function f , satisfying a set g of inequality and a set h of equality constraints. The parametric problem formulation defines x (c) as a function of the parameters c ∈ R k . In the present setting, this function can naturally be interpreted as part of an overall composite function that encompasses ML inference and decision-making, and returns optimal decisions given feature variables as input. CO problems can be categorized in terms of the forms taken by the functions defining their objectives (1a) and constraints (1b-1c). These forms also determine important properties of the optimization mapping c → x (c) when viewed as a function from problem parameters to optimal solutions, such as its continuity, differentiability, and injectivity.
In this manuscript, it is assumed that the constraints are fully known prior to solving, i.e., h(x, c) = h(x) and g(x, c) = g(x), and restrict the dependence on c to the objective function only. This is the setting considered by almost all existing works surveyed. While it is also possible to consider uncertainty in the constraints, this leads to the possibility of predicting parameters that lead to solutions that are infeasible with respect to the groundtruth parameters. The learning problem has not yet been well-defined in this setting (unless a recourse action to correct infeasible solutions is used (Hu et al., , 2023b). For this reason, in the following sections, only f is assumed to depend on c, so that g(x) ≤ 0 and h(x) = 0 are satisfied for all outputs of the decision model. For notational convenience, the feasible region of the CO problem in (1), will be denoted by F (i.e., x ∈ F if and only if g(x) ≤ 0 and h(x) = 0).
If the true parameters c are known exactly, the corresponding 'true' optimal decisions may be computed by solving (1). In such scenarios, x (c) will referred to as the fullinformation optimal decisions (Bertsimas and Kallus, 2020). This paper, instead, considers problems where the parameters c are unknown but can be estimated as a function of empirically observed features z. The problem of estimating c falls under the category of supervised machine learning problems. In this setting, a set of past observation pairs {(z i , c i )} N i=1 is available and used to train a ML model m ω (with trainable parameters ω), so that parameter predictions take the formĉ = m ω (z). Then, a decision x (ĉ) can be made based on the predicted parameters. x (ĉ) is referred to as a prescriptive decision. The overall learning goal is to optimize the set of prescriptive decisions made over a distribution of feature variables z ∼ Z, with respect to some evaluation criterion on those decisions. Thus, while the machine learning model m ω is trained to predictĉ, its performance is evaluated on the basis of the corresponding optimal solutions x (ĉ). This paper uses the terminol-ogy Predict-Then-Optimize problem to refer to the problem of predictingĉ, to improve the evaluation of x (ĉ).

Learning Paradigms
The defining challenge of the Predict-Then-Optimize problem setting is the gap in modeling between the prediction and the optimization components: while m ω is trained to predictĉ, it is evaluated based on the subsequently computed x (ĉ). Using standard ML approaches, learning of the predictionsĉ = m ω (z) can only be supervised by the ground-truth c under standard loss functions L, such as mean squared error or cross-entropy. In principle, it is favorable to train m ω to make predictionsĉ that optimize the evaluation criterion on x (ĉ) directly. This distinction motivates the definition of two alternative learning paradigms for Predict-Then-Optimize problems.
Prediction-focused learning (PFL). A straightforward approach to this supervised ML problem is to train the model to generate accurate parameter predictionsĉ with respect to ground-truth values c. This paper introduces the term prediction-focused learning to refer to this approach (also called two-stage learning (Wilder et al., 2019a)) because the model is trained with a focus on the accuracy of the parameter predictions preceding the decision model. Here, the training is agnostic of the downstream optimization problem. At the time of making the decision, the pre-trained model's predictionsĉ are passed to optimization routines which solve (1) to return x (ĉ). Typical ML losses, such as the mean squared error (MSE) or binary cross entropy (BCE), are used to train the prediction model in this case.
Such loss functions, like Eq. (2), which measure the prediction error ofĉ with respect to c, are referred to as prediction losses. Algorithm 1 illustrates prediction-focused learning using the MSE loss.
Decision-focused learning (DFL). By contrast, in decision-focused learning, the ML model is trained to optimize the evaluation criteria which measure the quality of the resulting decisions. As the decisions are realized after the optimization stage, this requires the integration of prediction and optimization components, into a composite model which produces full decisions. From this point of view, generating the predicted parametersĉ is an intermediary step of the integrated approach, and the accuracy ofĉ is not the primary focus in training. The focus, rather, is on the error incurred after optimization. A measure of error with respect to the integrated model's prescriptive decisions, when used as a loss function for training, is henceforth referred to as a task loss. The essential difference from the aforementioned prediction loss is that it measures the error in x (ĉ), rather than inĉ. The objective value achieved by using the predicted x (ĉ) is generally suboptimal with respect to the true objective parameters c. Often, the end goal is to generate predictionŝ c with an optimal solution x (ĉ) whose objective value in practice (i.e., f (x (ĉ), c)) comes close to the full-information optimal value f (x (c), c). In such cases, a salient notion of task loss is the regret, defined as the difference between the full-information optimal objective value and the objective value realized by the prescriptive decision. Equivalently, it is the magnitude of suboptimality of the decision x (ĉ) with respect to the optimal solution x (c) under ground-truth parameters c: Note that minimizing regret is equivalent to minimizing the value of f (x (ĉ), c), since the term f (x (c), c) is constant with respect to the prediction model. While regret may be considered the quintessential example of a task loss, other task losses can arise in practice. For example, when the ground-truth target data are observed in terms of decision values x, rather than parameter values c, they may be targeted using the typical training loss functions such as M SE(x (ĉ), x).
Relationship between prediction and task losses. As previously mentioned, an ML model is trained without considering the downstream CO problem in prediction-focused learning for Predict-Then-Optimize tasks; still the ML model is evaluated at test time on the basis of its resulting CO problem solutions. This is based on an underlying assumption that generating accurate predictions with respect to a standard regression loss will result in good prescriptive decisions. Note that zero prediction loss always implies zero task loss, sinceĉ = c implies x (ĉ) = x (c). However, in practice, it is impossible to learn a model that makes no prediction error on any sample. The model error can only be minimized in one metric, and the minimization of the prediction error and the resulting decision error do not in general coincide (Wilder et al., 2019a). Furthermore, the prediction error and the task loss are, in general, not continuously related. These principles are illustrated by the following example: Example. The shortcomings of training with respect to prediction errors can be illustrated with a relatively simple CO problem. For this illustration, consider a knapsack problem (Pisinger and Toth, 1998). The objective of the knapsack problem is to select a subset of maximal value from an overall set of items, each having its own value and unit weight, subject to a capacity constraint. The capacity constraint imposes that the sum of the weights of the selected items cannot be higher than the capacity C. This knapsack problem with unit weights can be formulated as follows: In a Predict-Then-Optimize variant of this knapsack problem, the item weights and knapsack capacity are known, but the item values are unknown and must be predicted using observed features. The ground-truth item value c implies the ground-truth solution x (c). Overestimating the values of the items that are chosen in x (c) (or underestimating the values of the items that are not chosen) increases the prediction error. Note that these kind of prediction errors, even if they are high, do not affect the solution, and thus do not affect the task loss either. On the other hand, even low prediction errors for some item values may change the solution, affecting the task loss. That is why after a certain point, reducing prediction errors does not decrease task loss, and sometimes may increase it. DFL aims to address this shortcoming of PFL: by minimizing the task loss directly, prediction errors are implicitly traded off on the basis of how they affect the resulting decision errors. Value of Item 2 Figure 2: An illustrative numerical example with a knapsack problem with two items to exemplify the discrepancy between prediction error and regret. The figure illustrates that two points can have the same prediction error but different regret. Furthermore, it demonstrates that overestimating the values of the selected items or underestimating the values of the items that are left out does not change the solution, and thus does not increase the regret, even though the prediction error does increase.
The discrepancy between the prediction loss and the task loss has been exemplified in Figure 2 for a very simple knapsack problem with only two items. For this illustration, assume that both the items are of unit weights and the capacity of the knapsack is one, i.e., only one of the two items can be selected. The true values of the first and second items are 2.5 and 3 respectively. The point (2.5, 3), marked with ,, represents the true item values. In this case the true solution is (0, 1), which corresponds to selecting only the second item. It is evident that any prediction in the blue shaded region leads to this solution. For instance, the point (1.5, 3), marked with , corresponds to predicting 1.5 and 3 as values of the two items respectively and this results in selecting the second item. On the other hand, the point (2.5, 2), marked with , triggers the wrong solution (1, 0), although the squared error values of and are identical. Also, note that overestimating the value of the second item does not change the solution. For instance, the point (1.5, 4), marked with L, corresponds to overestimating the value of the second item to 4 while keeping the value of the first item the same as the point in . This point is positioned directly above the point in and still stays in the blue-shaded region. Similarly, the point (0.5, 3), marked with M, results from underestimating the value of the first item and is in the blue shaded region too. Although these two points have higher values of squared error than the point marked with , they trigger the right solution, resulting in zero regret.
Empirical risk minimization and bilevel form of DFL. The minimization of either the prediction loss in PFL or the task loss in DFL, can be expressed as an empirical risk minimization (ERM) (Vapnik, 1999) problem over a training dataset containing feature variables and their corresponding parameters For concreteness, the re-spective ERM problems below assume the use of the MSE and regret loss functions, but the principles described here hold for a wide range of alternative loss functions. PFL, by minimizing the prediction error with respect to the ground-truth parameters directly, takes the form of a standard regression problem: which is an instance of unconstrained optimization. In the case of DFL, it is natural to view the ERM as a bilevel optimization problem: The outer-level problem (6a) minimizes task loss on the training set while the inner-level problem (6b) computes the mapping c → x (c). Solving (6) is computationally more challenging than solving (5) in the prediction-focused paradigm. In both cases, optimization by stochastic gradient descent (SGD) is the preferred solution method for training neural networks. Algorithms 1 and 2 compare the gradient descent training schemes for each of these problems. Algorithm 1 is a standard application of gradient descent, in which the derivatives of Line 6 are generally well-defined and can be computed straightforwardly (typically by automatic differentiation). Line 7 of Algorithm 2 shows that direct differentiation of the mapping c → x (c) can be used to form the overall task loss gradient dL dω , by providing the required chain rule term dx (ĉ) dĉ . However, this differentiation is nontrivial as the mapping itself lacks a closed-form representation. Further, many interesting and practical optimization problems are inherently nondifferentiable and even discontinuous as functions of their parameters, precluding the direct application of Algorithm 2 to optimize (6) by gradient descent. The following subsections review the main challenges of implementing Algorithm 2.
Algorithm 1 Gradient-descent in prediction-focused learning Input: Hyperparams: α-learning rate 1: Initialize ω. for each instance (z, c) do 4:ĉ = m ω (z) 5: end for 8: end for Algorithm 2 Gradient-descent in decisionfocused learning with regret as task loss end for 9: end for 2.3 Challenges to Implement DFL Differentiation of CO mappings. To minimize a task loss by gradient descent training, its partial derivatives with respect to the prediction model parameters ω must be computed to carry out at the parameter update at Line 7 of Algorithm 2. Since the task loss L is a function of x (ĉ), the gradient of L with respect to ω can be expressed in the following terms by using the chain rule of differentiation: The first term in the right side of (7), can be computed directly as L(x (ĉ), c) is typically a differentiable function of x (ĉ). A deep learning library (such as TensorFlow (Abadi et al., 2015), PyTorch (Paszke et al., 2019)) computes the last term by representing the neural network as a computational graph and applying automatic differentiation (autodiff) in the reverse mode (Baydin et al., 2018). However, the second term, dx (ĉ) dĉ , may be nontrivial to compute given the presence of two major challenges: (1) The mappingĉ → x (ĉ), as defined by the solution to an optimization problem, lacks a closed form which can be differentiated directly, and (2) for many interesting and useful optimization models, the mapping is nondifferentiable in some points, and has zero-valued gradients in others, precluding the straightforward use of gradient descent. As shown in the next subsection, even the class of linear programming problems, widely used in decision modeling, is affected by both issues. Section 3 details the various existing approaches aimed at overcoming these challenges.
Computational cost Another major challenge in decision-focused learning is the computational resources required to train the integrated prediction and optimization model. Note line 5 in Algorithm 2, which requires evaluating x (ĉ). This requires solving and differentiating the underlying optimization problem for each observed data sample, in each epoch. This imposes a significant computational cost even when dealing with small-scale and efficiently solvable problems, but can become an impediment in the case of large and

ML Model Constrained Optimization Task loss
Training data z < l a t e x i t s h a 1 _ b a s e 6 4 = " / Z i q f w C 7 h 3 k m q L r q 7 J a I b 0 y H Q X 8 = " > A A A C 2 X i c f V F L b 9 Q w E P a G V w m v F o 5 c I l Z I C K F V n N 1 s t 7 d K 9 M A F U S R 2 W y m J K s e Z 3 b q 1 n c h 2 o I u V A z f E C Y k T / B d + C P 8 G Z z c H t q 0 Y y Z p P 3 8 x 8 n k d e c a Z N G P 7 p e T d u 3 r p 9 Z + u u f + / + g 4 e P t n c e z 3 R Z K w p T W v J S H e d E A 2 c S p o Y Z D s e V A i J y D k f 5 + e s 2 f v Q R l G a l / G C W F W S C L C S b M 0 q M o 2 Z p L u z n 5 m S 7 H w 7 C l Q V X A e 5 A H 3 V 2 e L L T + 5 0 W J a 0 F S E M 5 0 T r B Y W U y S 5 R h l E P j p 7 W G i t B z s o D E Q U k E 6 M y u 2 m 2 C 5 4 4 p g n m p 3 J M m W L H / V l g i t F 6 K 3 1 M J n 2 g p B J G F T R U U T Y I z 3 9 p 0 1 X e i F n l m w 8 F 4 O B n v h a / a p c W T 3 W E L c B h H w 6 a P m 2 Z T I u c 1 r D U 2 J c J 4 j H H s K i M 8 j G P s Q D w K o 2 j S S r g u D s B t W s F b N 9 6 7 C h Q x p X p p U 6 I W g l w 0 t v P / S 2 N y n e a 8 O z i + f N 6 r Y B Y N 8 G g w e j / q 7 x 9 0 p 9 9 C T 9 E z 9 A J h t I v 2 0 R t 0 i K a I o j P 0 H f 1 E v 7 z E + + J 9 9 b 6 t U 7 1 e V / M E b Z j 3 4 y + c V N 9 l < / l a t e x i t > c, x ? (c) < l a t e x i t s h a 1 _ b a s e 6 4 = " l m e v M 1 o 8 r c A U G L 4 N z b V w k x 5 R s a Q = " > A A A C 8 X i c f V F L b 9 N A E N 6 Y V z G v F I 5 w s I i Q C q o i r x O n 4 V a J H r g g i k T a S r a J 1 u t J u q p 3 b e 2 u o d H K F / 4 F N 8 Q J i R P c + S H 8 G 9 a x D 6 S t G G k 1 n 7 6 Z + X Y e a Z k z p X 3 / T 8 + 5 d v 3 G z V t b t 9 0 7 d + / d f 9 D f f n i k i k p S m N E i L + R J S h T k T M B M M 5 3 D S S m B 8 D S H 4 / T s V R M / / g h S s U K 8 1 6 s S E k 6 W g i 0 Y J d p S 8 / 6 T O O W G 1 r t e 4 8 / r D 7 H S R O 6 0 5 P N 5 f + A P / b V 5 l w H u w A B 1 d j j f 7 v 2 O s 4 J W H I S m O V E q w n 6 p E 0 O k Z j S H 2 o 0 r B S W h Z 2 Q J k Y W C c F C J W Y 9 R e 8 8 s k 3 m L Q t o n t L d m / 6 0 w h C u 1 4 q n N 5 E S f q o u x h r w y l v K r 6 K j S i 2 l i m C g r D Y K 2 / y + q 3 N O F 1 6 z K y e n t w f P G 8 l 8 F R M M T j 4 f j d e L B / 0 J 1 + C z 1 G T 9 E O w m g P 7 a P X 6 B D N E E W f 0 X f 0 E / 1 y l P P F + e p 8 a 1 O d X l f z C G 2 Y 8 + M v w x j o j Q = = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " / Z i q f w C 7 h 3 k m q L r q 7 J a I b 0 y H Q X 8 = " > A A A C 2 X i c f V F L b 9 Q w E P a G V w m v F o 5 c I l Z I C K F V n N 1 s t 7 d K 9 M A F U S R 2 W y m J K s e Z 3 b q 1 n c h 2 o I u V A z f E C Y k T / B d + C P 8 G Z z c H t q 0 Y y Z p P 3 8 x 8 n k d e c a Z N G P 7 p e T d u 3 r p 9 Z + u u f + / + g 4 e P t n c e z 3 R Z K w p T W v J S H e d E A 2 c S p o Y Z D s e V A i J y D k f 5 + e s 2 f v Q R l G a l / G C W F W S C L C S b M 0 q M o 2 Z p L u z n 5 m S 7 H w 7 C l Q V X A e 5 A H 3 V 2 e L L T + 5 0 W J a 0 F S E M 5 0 T r B Y W U y S 5 R h l E P j p 7 W G i t B z s o D E Q U k E 6 M y u 2 m 2 C 5 4 4 p g n m p 3 J M m W L H / V l g i t F 6 K 3 G U K Y k 7 1 5 V h L X h v L x X V 0 U p v 5 J L N M V r U B S d f / z 2 s e m D J o V x I U T A E 1 f O k A o Y q 5 E Q J 6 S h S h x i 3 O 9 1 M J n 2 g p B J G F T R U U T Y I z 3 9 p 0 1 X e i F n l m w 8 F 4 O B n v h a / a p c W T 3 W E L c B h H w 6 a P m 2 Z T I u c 1 r D U 2 J c J 4 j H H s K i M 8 j G P s Q D w K o 2 j S S r g u D s B t W s F b N 9 6 7 C h Q x p X p p U 6 I W g l w 0 t v P / S 2 N y n e a 8 O z i + f N 6 r Y B Y N 8 G g w e j / q 7 x 9 0 p 9 9 C T 9 E z 9 A J h t I v 2 0 R t 0 i K a I o j P 0 H f 1 E v 7 z E + + J 9 9 b 6 t U 7 1 e V / M E b Z j 3 4 y + c V N 9 l < / l a t e x i t > dL(x ? (ĉ), c) dĉ < l a t e x i t s h a 1 _ b a s e 6 4 = " G P r g 2 s l S 9 / I K z L u F L y h W R a 4 a o X M = " i C K R t l I 8 R B 6 P k 1 g d e 0 a 2 B x p Z / h j + g B 9 g z Q 6 x Q u o K P g V P E g n S V l z J 8 t E 5 9 1 7 7 n p u W O d c m D M 8 b w b X r N 2 7 e 2 r r d v H P 3 3 v 3 t 1 s 6 D I 1 1 U i r I R L f J C n a R E s 5 x L N j L c 5 O y k V I y I N G f H 6 e m L W j / + y J T m h X < l a t e x i t s h a 1 _ b a s e 6 4 = " R G D B v R P f d J 8 r a a g A u A q l u z J I g 3 g = " o H E C n 4 D z y S L p q 2 4 k n W P z n 3 4 3 n u S T H B j w / B n L b h x 8 9 b t O 1 t 3 6 / f u P 3 j 4 q L H 9 F H X D a y i P 6 c C S j U W F N C B U 2 X m 0 6 W 4 k Z s E G M U h 0 Z q Q k V I 7 r t t H E S W Q 7 M J X l n 2 L P g r + A F W K F x A r + A C c t g o 4 G r m T 5 6 N z r c 3 3 v S a s 8 k w q h z x 3 n y t V r 1 2 / s 3 O z e u n 3 n 7 m 5 v 7 9 6 x L G t B 2 Y S W e S l O U y J Z n h V s o j K V s 9 N K M M L T n J 2 k Z 8 + a / M l b J m R W F q / V q m I J J 4 s i m 2 e U K E t N e 5 G O W 5 F I L N J E o 0 G A 8 H j o P U U D Z M M b W e A G a B x 6 J p 4 L Q v U M x i n X 5 + Z N L B U R + / G S K G 0 Z S M 1 j A 5 v s H 8 K Y a a + P B u N w i P w Q N o K h 5 3 l w 3 c L 1 I W 5 b I N Q H m z i a 7 n U + x L O S 1 p w V i u Z E y g i j S i W a C J X R n J l u X E t W E X p G F i y y s C C c y U S 3 A x j 4 y D I z O C + F P Y W C L f v 3 C 0 2 4 l C u e 2 k p O 1 F J e z D X k p b m U X 0 Z H t Z q H i c 6 K q l a s o O v + 8 z q H q o T N p u E s E 4 y q f G U B o S K z I 0 C 6 J H a L y v r R 7 c Y F e 0 d L z k k x 0 7 F g M x P h p K s v 2 D H 0 w u E Y t X Y E 4 a j 1 B a P A 9 U w f G 7 M t k e Y 1 W 2 t s S 6 B g i H H Q G I m 9 I M A W B D 5 y 3 b C R s L 8 4 Z H b T g r 2 w 4 7 2 s m C C q F E 9 0 T M S C k 3 O j N / f / y r J i X W Z v a / h v V + G / w b E 7 w P 7 A f + X 3 D w 4 3 1 u + A B + A h 2 A c Y j M A B e A 6 O w A R Q 8 B F 8 A z / A T + e 9 8 8 n 5 4 n x d l z q d z Z v 7 Y C u c 7 7 8 A P k b + 7 w = = < / l a t e x i t > c < l a t e x i t s h a 1 _ b a s e 6 4 = " b e q 0 t b W r + J 4 H n 2 P Y f y z 1 7 n Q U X 5 4 = " > A A A C 3 3 i c f V F L b x M x E H a W V 1 k e b e H I Z U W E h B C K 7 E 0 2 D b d K 9 M A F U S T S V t p d V V 5 n k l i 1 v S v b C 0 T W n r k h T k i c 4 I / w Q / g 3 e J M 9 k L Z i J G s + f T P z e R 5 F J b i x G P / p B T d u 3 r p 9 Z + d u e O / + g 4 e 7 e / u P T k x Z a w Z T V o p S n x X U g O A K p p Z b A W e V B i o L A a f F x e s 2 f v o R t O G l + m B X F e S S L h S f c 0 a t p 7 J s S a 3 L C u l Y 0 5 z v 9 f E A r y 2 6 C k g H + q i z 4 / P 9 3 u 9 s V r J a g r J M U G N S g i u b O 6 o t Z w K a M K s N V J R d 0 A W k H i o q w e R u 3 X Q T P f P M L J q X 2 j 9 l o z X 7 b 4 W j 0 p i V L H y m p H Z p L s d a 8 t p Y I a + j 0 9 r O J 7 n j q q o t K L b 5 f 1 6 L y J Z R u 5 h o x j U w K 1 Y e U K a 5 H y F i S 6 o p s 3 5 9 Y Z g p + M R K K a m a u U z D r E l J H j q X r f t O 9 a L I H R 6 M h 5 P x K / y y X < l a t e x i t s h a 1 _ b a s e 6 4 = " J W y i s A 1 r G s E h K S g K f + I X o C 9 2 s P 8 = " > A A A D V 3 i c f V H R i h M x F M 2 0 u t a q a + s + K P g S L E K V U i Z t p 1 v f F t w H H x R X s L s L z V g y m b Q d N p k Z k o y 7 J e T F / / J D 9 g / 8 C 8 2 0 F W y 7 e C H k c O 6 5 9 + b m R D l P l P b 9 W 6 9 S v X f / 4 E H t Y f 3 R 4 y e H T x v N Z + c q K y R l Y 5 r x T F 5 G R D G e p G y s E 8 3 Z Z S 4 Z E R F n F 9 H V + z J / 8 Z 1 J l W T p V 7 3 M W S j I P E 1 m C S X a U d P G D 4 N X T S Z y H o X G 7 / q r 6 O w B i 2 e S U A N j i A X R C 0 q 4 + W j b E E f C 3 N h v W G k i Y R s v i D a O g t S + 6 a x y D l g T 4 8 6 W c E t n 7 b T R + j s F 7 g O 0 A S 2 w i b N p 0 / u J 4 4 w W g q W a c q L U B P m 5 D g 2 R O q G c 2 T o u F M s J v S J z N n E w J Y K p 0 K z 2 t P C 1 Y 2 I 4 y 6 Q 7 q Y Y r 9 t 8 K Q 4 R S S x E 5 Z b m p 2 s 2 V 5 J 2 5 S N x F T w o 9 G 4 U m S f N C s 5 S u 5 8 8 K D n U G S 0 N g n E h G N V 8 6 Q K h M 3 A q Q L o j 7 b O 1 s q 9 d x y q 5 p J g R J Y 4 M l i + 0 E h X W z 4 9 q w P x q + W 5 s V j I 7 7 J U B + 0 O v b F r J 2 u 0 X E C 7 b u s W N 8 M E Q o c J U 9 1 A 8 C 5 E A w 8 H u 9 U d n C v e K U u Z + W 7 J N b 7 3 P O J N G Z f G s w k X N B b q z Z 3 P + T J e l a 5 m 5 n O N q 1 d x + c 9 7 p o 0 B 1 8 G b R O T j f W 1 8 B L 8 A q 0 A Q L H 4 A R 8 A G d g D C j 4 5 R 1 6 z 7 0 X l d v K 7 + p B t b a W V r x N z R H Y i m r z D x c 7 B 8 k = < / l a t e x i t > L(x ? (ĉ), c) < l a t e x i t s h a 1 _ b a s e 6 4 = " P Q b B c i f 5 Q 3 0 K z b F n E t W r C M G q 6 l 4 = " > A A A D B X i c f V H L b h M x F H W m P E p 4 N C 1 L N h Y R U k B V Z E 8 y a d h V o g s W I I p E 2 k q Z I f I 4 T m L V 9 o x s D z S y Z s 0 f 8 B f s E C s Q K / g H / g Z P M g v S V l z J u k f n 3 n t 8 H 2 k u u L E I / W k E W z d u 3 r q 9 f a d 5 9 9 7 9 B z u t 3 b 0 T k x W a s h H N R K b P U m K Y 4 I q N L L e C n e W a E Z k K d p q e v 6 j i p x + Y N j x T 7 + w y Z 4 k k c 8 V n n B L r q U k L x Z L Y B S X C v S o 7 M E 6 l u y j f x 8 Y S D T v x g l j n K U j L p / u r m A e T V h t 1 0 c r g V Y B r 0 A a 1 H U 9 2 G z / i a U Y L y Z S l g h g z x i i 3 i S P a c i p Y 2 Y w L w 3 J C z 8 m c j T 1 U R D K T u N V o J X z i m S m c Z d o / Z e G K / b f C E W n M U q Y + s x r E X I 5 V 5 L W x V F 5 H j w s 7 G y a O q 7 y w T N H 1 / 7 N C Q J v B a n 1 w y j W j V i w 9 I F R z P w K k C 6 I J t X 7 J z W a s 2 E e a S U n U 1 M W a T c s x T p r O x a u + x 3 q e J g 5 1 B 7 3 h 4 D n a r 5 Y W D Q 9 6 F c A o C n t l G 5 f l p k Q q C r b W 2 J R A 0 Q D j y F e G u B d F 2 I O o j 8 J w W E n 4 L o 6 Y 3 7 R m r / 1 4 b 3 K m i c 3 0 M x c T P Z f k o n S 1 / 1 8 a V + s 0 7 / 3 B 8 e X z X g U n Y R f 3 u / 2

Backward pass
Forward pass x ? (ĉ) < l a t e x i t s h a 1 _ b a s e 6 4 = " w e i 6 4 a u X Z F h L l j D r m S T + + 5 a I n S g = " > A A A C 7 n i c f V H N b t N A E N 6 4 / B T z 0 7 S c E B e L C K k g F H m d O A 2 3 S v T A B V E k 0 l a y T b R e T 5 J V v W t r d w 2 N V h Z v w Q 1 x Q u I E T 8 C D 8 D a s E x 9 I W z H S a j 5 9 M / P t / K R l z p T 2 / T 8 d Z + v G z V u 3 t + + 4 d + / d f 7 D T 3 d 0 7 U U U l K U x o k R f y L C U K c i Z g o p n O 4 a y U Q H i a w 2 l 6 / q q J n 3 4 E q V g h 3 u t l C Q k n c 8 F m j B J t q W n 3 U Z x y c 1 F / i J U m c j 9 e E G 0 s 4 9 H 6 2 b T b 8 / v + y r y r A L e g h 1 o 7 n u 5 2 f s d Z Q S s O Q t O c K B V h v 9 S J I V I z m k P t x p W C k t B z M o f I Q k E 4 q M S s Z q i 9 p 5 b J v F k h 7 R P a W 7 H / V h j C l V r y 1 G Z y o h f q c q w h r 4 2 l / D o 6 q v R s n B g m y k q D o O v / Z 1 X u 6 c J r 9 u R l T A L V + d I C Q i W z I 3 h 0 Q S S h 2 m 7 T d W M B n 2 j B O R G Z i S V k d Y Q T 1 5 h 4 1 X c k 5 2 l i / P 5 o M B 6 9 9 F 8 0 S w v H B 4 M G Y D 8 M B n U P 1 / W m R J p X s N b Y l P D D E c a h r Q z w I A y x B e H Q D 4 J x I 2 G 7 O A K 7 a Q l v 7 H h v S 5 B E F / K 5 i Y m c c 3 J R m 9 b / L 4 2 J d Z r 1 9 u D 4 8 n m v g p O g j 4 f 9 4 b t h 7 / C o P f 0 2 e o y e o H 2 E 0 Q E 6 R K / R M Z o g i j 6 j 7 + g n + u W U z h f n q / N t n e p 0 2 p q H a M O c H 3 8 B i 3 D n V g = = < / l a t e x i t > Figure 3: In decision-focused learning, the neural network model is trained to minimize the task loss (NP-)hard optimization problems. Section 3 reviews the techniques proposed thus far for reducing the computational demands of DFL and improving scalability.

Optimization Problem Forms
The effectiveness of solving an optimization problem depends on the specific forms of the objective and constraint functions. Considerable effort has been made to developing efficient algorithms for certain optimization forms. Below, the readers are provided an overview of the key and widely utilized types of optimization problem formulations.

Convex optimization
In convex optimization problems, a convex objective function is to be optimized over a convex feasible space. This class of problems is distinguished by the guarantee that any locally optimal is also globally optimal (Boyd et al., 2004). Since many optimization methods converge provably to local minima, convex problems are considered to be reliably and efficiently solvable relative to nonconvex problems. Despite this, convex optimization mappings still impose significant computational overhead on Algorithm 2 since they must be solved for each data sample in each epoch, and most convex optimizations are orders of magnitude more complex than conventional neural network layers . Like all parametric optimization problems, convex ones are implicitly defined mappings from parameters to optimal solutions, lacking a closed form that can be differentiated directly. However as detailed in Section 3.1, they can be canonicalized to a standard form, which facilitates automation of their solution and backpropagation by a single standardized procedure (Agrawal et al., 2019a). The class of convex problems is broad enough to include some which yield mappings x (ĉ) that are differentiable everywhere, and some which do not. The linear programs, which are convex and form nondifferentiable mappings with respect to their objective coefficients, are notable examples of the latter case and are discussed next. The portfolio optimization problem (44), which contains both linear and quadratic constraints, provides an example of a parametric convex problem which admits useful gradients over some regions of its parameter space and not others. Where the (quadratic) variance constraint (44b) is not active, it behaves as a linear program. Elsewhere, the optimal solution is a smooth function of its parameters.

Linear programming
Linear programs (LPs) are convex optimization problems whose objective and constraints are composed of affine functions. These programs are predominant as decision models in operations research, and have endless industrial applications since the allocation and transfer of resources is typically modeled by linear relationships between variables (Bazaraa et al., 2008). The parametric LPs considered in this manuscript take the following form: Compared to other classes of convex problems, LPs admit efficient solution methods, even for large-scale problems (Bazaraa et al., 2008;Ignizio and Cavalier, 1994). From a DFL standpoint, however, LPs pose a challenge, because the mapping c → x (c) is nondifferentiable. Although the derivatives of mapping (8) are defined almost everywhere, they provide no useful information for gradient descent training. To see this, first note the well-known fact that a linear program always takes its optimal value at a vertex of its feasible set (Bazaraa et al., 2008). Since the number of vertices in any such set is finite, (8) maps a continuous parameter space to a discrete set of solutions. As such, it is a piecewise constant mapping. Therefore its derivatives are zero almost everywhere, and undefined elsewhere. Prevalent strategies for incorporating linear programs in decision-focused learning thus typically rely on differentiating smooth approximations to the LP, as detailed in Section 3.2.
Many operations research problems, such as the allocation and planning of resources, can be modeled as LPs. Also many prototypical problems in algorithm design (e.g., sorting and top-k selection) can be formulated as an LP with continuous variables, despite admitting only discrete integer solutions, by relying on the total unimodularity of the constraint matrices (Bazaraa et al., 2008). In what follows, some examples of machine learning models of LPs and how they might occur in a Predict-Then-Optimize context are given.
Shortest paths. Given a directed graph with a given start and end node, the goal in the shortest path problem is to find a sequence of arcs of minimal total length that connects the start end the end node. The decision variables are binary indicators of each edge's inclusion in the path. The linear constraints ensure [0, 1] bounds on each indicator, as well as flow balance through each node. These flow balance constraints capture that, except for the start and end node, each node has as many incoming selected arcs as outgoing selected arcs. For the start node, there is one additional outgoing selected arc, and for the end node, there is one more incoming selected arc. The coefficients in the linear objective represent the arc lengths. In many realistic settings -as well as in several common DFL benchmarks (Elmachtoub and Grigas, 2022;Pogančić et al., 2020) -these are unknown, requiring them to be predicted from features before a shortest path can be computed. This motivating example captures the realistic setting in which the shortest route between two locations has to be computed, but in which the road traversal times are uncertain (due to unknown traffic conditions, for example), and have to be predicted from known features (such as day of the week, time of day and weather conditions).
Bipartite matching. Given is a graph consisting of two sets of nodes, and arcs connecting each node of the first set to each node of the second. The arcs are weighted but the weights are unknown and must be predicted. The optimization task is to choose a subset of arcs such that each node from each set is involved in a selected arc at most once, and the total weight of the selected arcs is maximized. The variables lie in [0, 1] and indicate the inclusion of each edge. The constraints ensure that each node is involved at most once in a selected arc. The linear objective coefficients represent arc weights. With a complete bipartite graph, matchings can be construed as permutations, and are presented a permutation matrices, which can be employed in tasks such as learning to rank (Kotary et al., 2022).
Sorting and Ranking. The sorting of any list of predicted values can be posed as a linear program over a feasible region whose vertices correspond to all of the possible permutations of the list. The related ranking, or argsort problem assigns to any lengthn list a permutation of sequential integers [n] which sorts the list. By smoothing the linear program, these basic operations can be differentiated and backpropagated .
Top-k selection. Given a set of items and item values that must be predicted, the task is to choose the subset of size k with the largest total value in selected items. In addition to [0, 1] bounds on the indicator variables, a single linear constraint ensures that the selected item indicators sum to k. A prevalent example can be found in multilabel classification Martins and Astudillo, 2016).
Computing the maximum. This is a special case of top-k selection where k = 1. When the LP's objective is regularized with the entropy term H(x) = x · log x, the mapping from predicted values to optimal solutions is equivalent to a softmax function (Agrawal et al., 2019a).
Max-flow/ min-cut. Given a network with predefined source and sink nodes, and predicted flow capacities on each arc, the task is to find the maximum flow rate that can be channeled from source to sink. Here the predicted flow capacities occupy the right-hand side of the linear constraints, which is not in line with the DFL problem description given in subsection 2.1. However, in the related min-cut problem-which is equivalent to the dual linear program of the max-flow problem-the flow capacities are the coefficients in the objective function. The max-flow problem can thus be cast as an equivalent min-cut problem and DFL can be used to learn to predict the flow capacities.

Integer linear programming
Integer Linear Programs (ILPs) are another mainstay in operations research and computer science. ILPs differ from LPs in that the decision variables x are restricted to integer values, i.e., x ∈ Z k where Z k is the set of integral vectors of appropriate dimensions. Like LPs, ILPs are challenging to use in DFL because they yield discontinuous, nondifferentiable mappings. Computationally however, they are more challenging due to their NP-hard complexity, which may preclude the exact computation of the mappingĉ → x (ĉ) at each step of Algorithm 2. Their differentiation is also significantly more challenging, since the discontinuity of their feasible regions prevents many smoothing techniques that can be applied in DFL with LPs. In the following, examples of how ILPs may occur in a Predict-Then-Optimize setting are provided.
Knapsack. The knapsack problem has been used as a benchmark in several papers about DFL Demirović et al., 2019). Given are weights of a set of items, as well as a capacity. The items also have associated values, which have to be predicted from features. The optimization task involves selecting a subset of the items that maximizes the sum of the weights associated with the selected items, whilst ensuring that the sum of the associated weights does not exceed the capacity.
Travelling salesperson problem In the travelling salesperson problem, the list of cities, and the distances between each pair of cities, is given. The goal is to find a path of minimal length that visits each city exactly once. In the Predict-Then-Optimize setting, the distances between the cities first have to be predicted (Pogančić et al., 2020) from observable empirical data.
Combinatorial portfolio optimization. Portfolio optimization involves making optimal investment decisions across a range of financial assets. In the combinatorial Predict-Then-Optimize variant, the decisions are discrete, and must be made on the basis of the predicted next period's increase in the value of several assets (Ferber et al., 2020).
Diverse bipartite matching. Diverse bipartite matching problems are similar to the bipartite matching problems described in 2.4.2, but are subject to additional diversity constraints (Ferber et al., 2020;Mulamba et al., 2021;Mandi et al., 2022) In this variant, edges have additional properties. The diversity constraints enforce lower and upper bounds on the proportion of edges selected with a certain property. This precludes the LP formulation, and makes the use of ILP more interesting.
Energy-cost aware scheduling. Energy-cost aware scheduling involves scheduling a set of tasks across a set of machines in a way that minimizes the overall energy cost involved. As future energy costs are unknown, they first have to be predicted (Mulamba et al., 2021;Mandi et al., , 2022.

Integer nonlinear programming
In integer nonlinear programming, the objective function and/or the constraints are nonlinear. Performing DFL on integer nonlinear programs faces the same challenges as performing DFL on ILPs: integer nonlinear programs are computationally expensive to solve, are implicit mappings with zero-valued gradients almost everywhere, and have discontinuous feasible regions, hindering the use of the smoothing techniques that can be applied in DFL with LPs. Additionally, because of their nonlinear nature, many of the techniques developed for DFL with ILPs, which assume linearity, do not work on integer nonlinear programs (Elmachtoub and Grigas, 2022;Pogančić et al., 2020). To the best of our knowledge, no DFL method has specifically been developed for or tested on integer nonlinear programs. The most closely related work is (Ferber et al., 2022), which learns an approximate ILP surrogates for integer nonlinear programs, which could then in turn be used in a DFL loop.

Review of Decision-focused Learning Methodologies
This section will describe several methodologies which address the challenge of differentiating an optimization mapping for DFL in gradient-based training. In essence, different ap- , which is used for backpropagation. This paper proposes the first categorization of existing DFL approaches into the following four distinct classes: Analytical Differentiation of Optimization Mappings: Methodologies under this category aim to compute exact derivative for backpropagation by differentiating the optimality conditions for certain optimization problem forms, for which the derivative exists and non-zero.
Analytical Smoothing of Optimization Mappings: These approaches deal with combinatorial optimization problems (for which the analytical derivatives are zero almost everywhere) by performing smoothing of combinatorial optimization problems, which results in approximate problems that can be differentiated analytically.
Smoothing by Random Perturbations: Methodologies under this category utilize implicit regularization through perturbations, constructing smooth approximations of optimization mappings.

Differentiation of Surrogate Loss Functions:
Methodologies under this category propose convex surrogate loss functions of specific task loss such as regret.
Decision-Focused Learning without Optimization in the Loop: These methodologies bypass the need for computing dL(x (ĉ)) dĉ by utilizing surrogate losses, which reflect the quality of the decisions, but do not require computing the solution of the optimization problem for differentiation. Figure 4 presents key characteristics of these four methodology classes, highlighting the types of problems that can be addressed within each class. Next, each category is thoroughly described.

Analytical Differentiation of Optimization Mappings
As discussed before, differentiating through parametric CO problems comes with two main challenges. First, since CO problems are complex, implicitly defined mappings from parameters to solutions, computing the derivatives is not straightforward. Second, since some  This subsection pertains to CO problems for which the second challenge does not apply, i.e., problems that are smooth mappings. For these problems, all that is required to implement DFL is direct differentiation of the mapping in Eq. (1).
Differentiating unconstrained relaxations. An early work discussing differentiation through constrained argmin problems in the context of machine learning is (Gould et al., 2016). It first proposes a technique to differentiate the argmin of a smooth, unconstrained convex function. When V (c) = argmin x f (c, x), it can be shown that when all second where f cx is the second partial derivative of f with respect to c followed by x. This follows from implicit differentiation of the first-order optimality conditions with respect to c, and rearranging terms. Here the variables c are the optimization problem's defining parameters, and the variables x are the decision variables. This technique is then extended to find approximate derivatives to constrained problems with inequality constraints g i (c, x) ≤ 0, 1 ≤ i ≤ m, by first relaxing the problem to an unconstrained problem, by means of the log-barrier function and then differentiating argmin x F (c, x) with respect to c for some choice of the scaling factor µ. Since this approach relies on approximations and requires hyperparameter tuning for the factor µ, subsequent works focus on differentiating constrained optimization problems directly via their own global conditions for optimality, as discussed next.
Differentiating KKT conditions of quadratic programs. More recent approaches are based on differentiating the optimality conditions of a CO problem directly, i.e., without first converting it to an unconstrained problem. Consider an optimization problem and its optimal solution: and assume that f , g and h are differentiable functions of x. The Karush-Kuhn-Tucker (KKT) conditions are a set of equations expressing optimality conditions for a solution x of problem (12) (Boyd et al., 2004): OptNet is a framework developed by  to differentiate through optimization mappings that are convex quadratic programs (QPs) by differentiating through these KKT conditions. In convex quadratic programs, the objective f is a convex quadratic function and the constraint functions g, h are linear over a continuous domain. In the most general case, each of f , g and h are dependent on a distinct set of parameters, in addition to the optimization variable x: When x ∈ R k and the number of equality constraints is M in and M eq , respectively, a QP problem is specified by parameters Q ∈ R k×k , c ∈ R k , R ∈ R k×M in , s ∈ R M in , A ∈ R k×Meq , and b ∈ R Meq . Note that for this problem to be convex, Q must be positive-semidefinite always, which can be ensured by learning instead parameters q ∈ R k and taking Q = q q.
A defining characteristic of quadratic programs such as (14) is their straightforward parameterization. This is due to the fact that any linear or quadratic function can be fully specified by a square matrix or a vector of parameters, respectively. Here, problem (12) is viewed as a mapping (Q, c, R, s, A, b) → x (Q, c, R, s, A, b), which parameterizes the space of all possible quadratic programs and their solutions. The presence of such a canonical form allows for separation of a problem's inherent structure from its parameters (Grant and Boyd, 2008), and is key to creating a differentiable mapping from parameters to optimal solutions in an automated way, without necessitating additional analytical transformations.
The gradients are sought with respect to each of the parameters in (Q, c, R, s, A, b). For this purpose,  argue that the inequalities (13b) and (13d) can be dropped, resulting in a system of equalities representing optimality conditions for x .
Exact gradients dx dP for any P ∈ {Q, c, R, s, A, b} can then be retrieved by solving the differential KKT conditions: where the shorthand d stands for the derivative d dP . This is another example of implicit differentiation, and requires solving a linear system of equations. Later, Konishi and Fukunaga (2021) extended the method of , where they compute the second order derivative of the solution. This allows to train gradient boosting models, which require the gradient as well as the Hessian matrix of the loss.
In summary, the techniques in this category compute the derivatives of the solution with respect to the parameters (if they exist) by leveraging implicit differentiation of the KKT conditions.
Differentiating optimality conditions of conic programs. Another class of problems with a parametric canonical form are the conic programs, which take the form: where K is a nonempty, closed, convex cone. A framework for differentiating the mapping (16) for any K is proposed in (Agrawal et al., 2019c), which starts by forming the homogeneous self-dual embedding of (16), whose parameters form askew-symmetric block matrix composed of A, b, and c. Following (Busseti et al., 2019), the solution to this embedding is expressed as the problem of finding a zero of a mapping containing a skew-symmetric linear function and projections onto the cone K and its dual. The zero-value of this function is implicitly differentiated, in a similar manner to the KKT conditions of a quadratic program as in . The overall mapping (16) is viewed as the composition of function that maps (A, b, c) onto the skew-symmetric parameter space of the self-dual embedding, the rootfinding problem that produces a solution to the embedding, and a transformation back to a solution of the primal and dual problems. The overall derivative is found by a chain rule applied over this composition.
Subsequent work (Agrawal et al., 2019a) leverages the above-described differentiation of cone programs to develop a more general differentiable convex optimization solver-Cvxpylayers. It is well known that conic programs of the form (16) can provide canonical representations of convex programs (Nemirovski, 2007). The approach described by Agrawal et al. (2019a) is based on this principle; that a large class of parametric convex optimization problems can be recast as equivalent parametric cone programs, with an appropriate choice of the cone K. A major benefit of this representation is that it allows a convex program to be separated with respect to its defining parameters (A, b, c) and its structure K, allowing a generic procedure to be applied for solving and differentiating the transformed problem with respect to A, b and c.
The framework for transforming convex programs to cone programs of the form (16) is drawn from (Grant and Boyd, 2008), which is based on two related concepts. First is the notion of disciplined convex programming, which assists the automation of cone transforms by imposing a set of rules or conventions on how convex programs can be represented. Second is the notion of graph implementations, which represent functions as optimization problems over their epigraphs, for the purpose of generically representing optimization problems and assisting conversion between equivalent forms. The associated software system called cvx allows for disciplined convex programs to be converted to cone programs via their graph implementations. Subsequently, the transformed problem is solved using conic optimization algorithms, and its optimal solution is converted to a solution of the original disciplined convex program. Differentiation is performed through each operation and combined by the chain rule. The transformation of parameters between respective problem forms, and the solution recovery step, are differentiable by virtue of being affine mappings (Agrawal et al., 2019a). The intermediate conic program is differentiated via the methods of (Agrawal et al., 2019c).
Solver unrolling and fixed-point differentiation. While the methods described above for differentiation through CO problems are generic and applicable to broad classes of problems, other practical techniques have been proven effective and even advantageous in some cases. A common strategy is that of solver unrolling, in which the solution to (1) is found by executing an optimization algorithm in the computational graph of the predictive model. Then, the mapping (1) is backpropagated simply by automatic differentiation or 'unrolling' through each step of the algorithm, thus avoiding the need to explicitly model dx (c) dc (Domke, 2012). While this approach leads to accurate backpropagation in many cases, it suffers disadvantages in efficiency due to the memory and computational resources required to store and apply backpropagation over the entire computational graph of an algorithm that requires many iterations . Additionally, it has been observed that unrolling over many solver iterations can leads to vanishing gradient issues reminiscent of recurrent neural networks (Monga et al., 2021). On the other hand, unrolling allows for the learning of unspecified algorithm parameters, such as gradient descent step sizes or weights in an augmented lagrangian, which can be exploited to accelerate the forward-pass convergence of the optimization solver. A comprehensive survey of algorithm unrolling for image processing applications is provided in (Monga et al., 2021).
Another way in which a specific solution algorithm may provide gradients though a corresponding optimization mapping, is by implicit differentiation of its fixed-point conditions.
Suppose that the solver iterations converge as k → ∞ to a solution x (c) of the problem (1), then the fixed-point conditions are satisfied. Assuming the existence of all derivatives on an open set containing c to satisfy the implicit function theorem, it follows by implicit differentiation with respect to c that which is a linear system to be solved for dx The relationship between unrolling and differentiation of the fixed-point conditions is studied by Kotary et al. (2023), which shows that backpropagation of (1) by unrolling (17) is equivalent to solving the linear system (19) by fixed-point iteration. As such, the convergence rate of the backward pass in unrolling is determined by the convergence rate of the equivalent linear system solve, and can be calculated in terms of the spectral radius of Φ.
Discussion. In contrast to most other differentiable optimization methods surveyed in this article, the analytical approaches in this subsection allow for backpropagation of coefficients that specify the constraints as well the objective function. For example,  propose parametric quadratic programming layers whose linear objective coefficients are predicted by previous layers, and whose constraints are learned through the layer's own embedded parameters. This is distinct from most cases of DFL, in which the optimization problems have fixed constraints and no trainable parameters of their own. Furthermore, the techniques surveyed in this subsection are aimed at computing exact gradients of parametric optimization mappings. However, many applications of DFL contain optimization mappings that are discontinuous and piecewise-constant. Such mappings, including parametric linear programs (8), have gradients that are zero almost everywhere and thus do not supply useful descent directions for SGD training. Therefore, the techniques of this subsection are often applied after regularizing the problem analytically with smooth functions, as detailed in the next subsection.

Analytical Smoothing of Optimization Mappings
To differentiate through combinatorial optimization problems, the optimization mapping first has to be smoothed. While techniques such as noise-based gradient estimation (surveyed in Section 3.3) provide smoothing and differentiation simultaneously, analytical differentiation first incorporates smooth analytical terms in the optimization problem's formulation, and then analytically differentiates the resulting optimization problem using the techniques discussed in Section 3.1.
Analytical smoothing of linear programs. Note that while an LP problem is convex and has continuous variables, only a finite number of its feasible solutions can potentially be optimal. These points coincide with the vertices of its feasible polytope (Bazaraa et al., 2008). Therefore the mapping x (ĉ) in (8), as a function ofĉ, is discontinuous and piecewise constant, and thus requires smoothing before it can be differentiated through. An approach to do so was presented in Wilder et al. (2019a), which proposes to augment the linear LP objective function with the Euclidean norm of its decision variables, so that the new objective takes the following form where the above equality follows from expanding the square and cancelling constant terms, which do not affect the argmax. This provides an intuition as to the effect of such a quadratic regularization: it converts a LP problem into that of projecting the point c µ onto the feasible polytope, which results in a continuous mappingĉ → x (ĉ). Wilder et al. (2019a) then train decision-focused models by solving and backpropagating the respective quadratic programming problem using the framework of , in order to learn to predict cost coefficients with minimal regret. At test time, the quadratic smoothing term is removed. This article refers to such regret-based DFL with quadratically regularized linear programs as the Quadratic Programming Task Loss method (QPTL).
Other forms of analytical smoothing for linear programs can be applied by adding different regularization functions to the objective function. Some common regularization terms for LPs include the entropy function H(x) = i x i log x i and the binary entropy function To differentiate the resulting smoothed optimization problems, the framework of Agrawal et al. (2019a) can be used. Alternatively, problem-specific approaches that do not employ (Agrawal et al., 2019a) have also been proposed. For example,  proposes a method for problems where H smooths an LP for differentiable sorting and ranking, and  proposes a way to differentiate through problems where H b is used in a multilabel classification problem. Both works propose fast implementations for both the forward and backward passes of their respective optimization problems.
In a related approach,  propose a general, differentiable LP solver based on log-barrier regularization. For a parametrized LP of standard form (8), gradients are computed for the regularized form in which the constraints x ≥ 0 are replaced with log-barrier approximations: While similar in this sense to (Gould et al., 2016), this method exploits several efficiencies specific to linear programming, in which the log-barrier term serves a dual purpose of rendering (21) differentiable and also aiding its solution. Rather than forming and solving this regularized LP problem directly, the solver uses an interior point method to produce a sequence of log-barrier approximations to the LP's homogenous self-dual (HSD) embedding. Early stopping is applied in the interior point method, producing a solution to (21) for some λ, which serves as a smooth surrogate problem for differentiation. A major advantage of this technique is that it only requires optimization of a linear program, making it in general more efficient than direct solution of a regularized problem as in the approaches described above.
Analytical smoothing of integer linear programs. To differentiate through ILPs, Wilder et al. (2019a) propose to simply drop the integrality constraints, and to then smooth and differentiate through the resulting LP relaxation, which is observed to give satisfactory performance in some cases. Ferber et al. (2020) later extended this work by using a more systematic approach to generate the LP relaxation of the ILP problem. They use the method of cutting planes to discover an LP problem that admits the same solution as the ILP. Subsequently, the method of (Wilder et al., 2019a) is applied to approximate the LP mapping's derivatives. Although this results in enhanced performance with respect to regret, there are some practical scalability concerns, since the cut generation process is time consuming but must be repeated for each instance in each training epoch.

Smoothing by Random Perturbations
A central challenge in DFL is the need for smoothing operations of non-smooth optimization mappings. Techniques that perform the smoothing operation by adding explicit regularization functions to the optimization problems' objective function have been surveyed in Section 3.2. This section instead surveys techniques use implicit regularization via perturbations. These techniques construct smooth approximations of the optimization mappings by adopting a probabilistic point of view. To introduce this point of view, the CO problem in this section is not viewed as a mapping from c to x (c). Rather, it is viewed as a function that maps c onto a probability distribution over the feasible region F. From this perspective, x (c) can be viewed as a random variable, conditionally dependent on c.
The motivation behind representing x (c) as a random variable is that the rich literature of likelihood maximization with latent variables, in fields such as Probabilisic Graphical Models (PGMs) (Koller and Friedman, 2009), can be exploited.
Implicit differentiation by perturbation. One seminal work in the field of PGMs is by Domke (2010). This work contains an important proposition, which deals with a setup where a variable θ 1 is conditionally dependent on another variable θ 2 and the final loss L is defined on the variable θ 1 . Let p(θ 1 |θ 2 ) and E[θ 1 |θ 2 ] be the conditional distribution and the conditional mean of θ 1 . The loss L is measured on the conditional mean E[θ 1 |θ 2 ] and the goal is to compute the derivative of L with respect to θ 2 . Domke (2010) proposes that the derivative of L with respect to θ 2 can be approximated by the following finite difference method: Notice that the first term in (22) is the conditional mean after perturbing the parameter θ 2 where magnitude of the perturbation is modulated by the derivative of L with respect to θ 1 . Taking inspiration from this proposition, by defining a conditional distribution p(x (ĉ)|ĉ), one can compute the derivative of the regret with respect toĉ in the context of DFL.
To perfectly represent the deterministic mapping c → x (c), the straightforward choice to define a the Dirac mass distribution, which assigns all probability mass to the optimal point and none to other points, i.e., Differentiation of blackbox combinatorial solvers. Note with the distribution in (23) that E x∼p(x|c) [x|c] = x (c). Hence, using conditional probability in the proposition in can be computed in the following way: The gradient computation methodology proposed by Pogančić et al. (2020) takes the form of (24). They interpret it as substituting the jump-discontinuous optimization mapping with a piece-wise linear interpolation. It is a linear interpolation of the mapping c → x (c) between the pointsĉ andĉ+δ dL(x (ĉ)) dx | x=x (ĉ) . Pogančić et al. (2020) call this differentiation of blackbox (DBB) solvers, because this approach considers the CO solver as a blackbox oracle, i.e., it does not take cognizance of how the solver works internally.
In a subsequent work, Sahoo et al. (2023) propose to treat dx (ĉ) dĉ as a negative identity matrix while backpropagating the loss. However, they notice that such an approach might run into unstable learning for scale-invariant optimization problems such as LPs and ILPs. To negate this effect, they suggest multiplying the cost vector with the matrix of the invariant transformation. In case of LPs and ILPs this can be achieved by normalizing the cost vector through projection onto the unit sphere.
Perturb-and-MAP. However, at this point it is worth mentioning that Domke (2010) assumes, in his proposition, that the distribution p(θ 1 |θ 2 ) in (22) belongs to the exponential family of distributions (Barndorff-Nielsen, 1978). Note that the distribution defined in (23) is not a distribution of the exponential family. Instead, a tempered softmax distribution belonging to exponential family can be defined to express the mapping in the following way: In this case, the log unnormalized probability mass at each x ∈ F is proportional to exp(−f (x, c)/τ ), the exponential of the negative of the tempered objective value. The idea behind (25) is to assign a probability to each feasible solution such that solutions with a better objective value have a larger probability. The parameter τ affects the way in which objective values map to probabilities. When τ → 0, the distribution becomes the argmax distribution in (23), when τ → ∞, the distribution becomes uniform. In other words, the value of τ determines how drastically the probability changes because of a change in objective value. Good values for τ are problem-dependent, and thus tuning τ is advised. Note that (22) deals with conditional expectation. As in the case of tempered softmax distribution, the conditional expectation is not always equal to the solution to the CO problem, it must be computed first to use the finite difference method in (22). However, computing the probability distribution function in (25) is not tractable, as the denominator (also called the partition function) requires iterating over all feasible points in F. Instead, Papandreou and Yuille (2011) propose a novel approach, known as perturb-and-MAP, to estimate the probability using perturbations. It states that the distribution of the maximizer after perturbing the log unnormalized probability mass by i.i.d. Gumbel(0, ) noise has the same exponential distribution as (25). To make it more explicit, ifc = c + η, where the perturbation vector η i.i.d.
∼ Gumbel(0, ), The perturb-and-MAP framework can be viewed as a method of stochastic smoothing (Abernethy et al., 2016). A smoothed approximation of the optimization mapping is created by considering the average value of the solutions of a set of nearby perturbed points. With the help of (26), the conditional distribution and hence the conditional mean can be approximated by Monte Carlo simulation.
Differentiable perturbed optimizers.  propose another approach for perturbation-based differentiation. They name it differentiable perturbed optimizers (DPO). They make use of the perturb-and-MAP framework to draw samples from the conditional distribution p(x|c). In particular, they use the reparameterization trick (Kingma and Welling, 2014;Rezende et al., 2014) to generate samples from p(x|c). The reparameterization trick uses a change of variables to rewrite x as a deterministic function of c and a random variable η. In this reformulation, x is still a random variable, but the randomness comes from the variable η. They consider η to be a random variable having a density proportional to exp(−ν(η)) for a twice-differentiable function ν. Moreover, they propose to multiply the random variable η with a temperature parameter > 0, which controls the strength of perturbing c by the random variable η. In summary, first c is perturbed with random perturbation vector η, where η is sampled from the aforementioned density function, and then the maximizer of the perturbed vector c + η is viewed as a sample from the conditional distribution for given values of c and , i.e., x (c) = x (c + η) is considered as a sample drawn from p(x|c) for a given . They call x (c) a perturbed optimizer. Note that, for → 0, x (c) → x (c). Like before, x (c) can be estimated by Monte Carlo simulation by sampling i.i.d. random noise η (m) from the aforementioned density function. The advantage is that the Monte Carlo estimate is continuously differentiable with respect to c. This Monte Carlo estimatex (c) can be expressed as: Moreover, its derivative can be estimated by Monte Carlo simulation too where ν is the first order derivative of ν. They can approximate dx (c) dc by dx (c) dc to implement the backward pass. As mentioned before, if → 0, the estimation will be an unbiased estimate of x (c). However, in practice, for low values of , the variance of the Monte-Carlo estimator will increase, leading to unstable and noisy gradients. This is in line with the smoothing-versus-accuracy trade-off mentioned before.  use this DPO framework to differentiate any optimization problem with linear objective. For a CO problem with discrete feasible space, they consider the convex hull of the discrete feasible region. Furthermore,  construct the Fenchel-Young loss function and show for Fenchel-Young loss function, the gradient can be approximated in the following way: In a later work, Dalle et al. (2022) extend the perturbation approach, where they consider multiplicative perturbation. This is useful when the cost parameter vector is restricted to be non-negative, such as in the applications of shortest path problem variants. The work of  can also be viewed as an extension of the DPO framework. They introduce stochastic softmax tricks (SST), an framework of Gumbel-softmax distribution, where they propose differentiable methods by sampling from more complex categorical distributions.
Implicit maximum likelihood estimation (I-MLE). Niepert et al. (2021) also use the perturb-and-MAP framework. However, they do not sample noise from the Gumbel distribution, rather they report better results when the noise η γ is sampled from a Sum-of-Gamma distribution with hyperparameter γ. Combining the finite difference approximation (22) with the perturb-and-MAP framework, the gradient takes the following form: where > 0 is a temperature parameter, which controls the strength of noise perturbation. Clearly, (30) turns into (24) when there is no noise perturbation, i.e., if η γ = 0.
Discussion. One major advantage of the methodologies explained in this section is that for gradient computation they call the optimization solver as a blackbox oracle and only use the solution returned by it for gradient computation. In essence, these techniques are not concerned with how the CO problem is solved. The users can utilize any techniques of their choice -constraint programming (CP) (Rossi et al., 2006), Boolean satisfiability (SAT) (Gomes et al., 2008) or linear programming (LP) and integer linear programming (ILP) to solve the CO problem.

Differentiation of Surrogate Loss Functions
The methodologies explained in the preceding sections can be viewed as implementations of differentiable optimization layers, which solve the optimization problem in the forward pass and return useful approximations of dx (ĉ) dĉ in the backward pass. Consequently, those methodologies can be used to introduce optimization layers anywhere in the network, and can be combined with arbitrary loss functions. In contrast, the methodologies that will be introduced next can only be used to differentiate regret (3)-a specific task loss. Hence, models can only be trained in an end-to-end fashion using these techniques when the CO problem occurs in the final layer of the model. Also note that the computation of the regret requires both the ground-truth cost vector c; as well as ground-truth solution x (c). If c is observed, x (c) can be computed. However, if only x (c) is observed, c cannot directly be recovered. Hence, the techniques that will be discussed next are not suitable when the true cost vectors c are not observed in the training data.
Smart "Predict, Then Optimize". Elmachtoub and Grigas (2022) developed Smart "Predict, Then Optimize" (SPO), a seminal work in DFL. As the gradient of the regret with respect to cost vectorĉ is zero almost everywhere, SPO instead uses a surrogate loss function that has subgradients which are useful in training. They start by proposing a convex surrogate upper bound of regret, which they call the SPO+ loss.
Then, they derive the following useful subgradient of L SP O+ (x (ĉ)): This subgradient can be used in place of ∇ĉRegret(x (ĉ), c) to update the model parameters in the backward pass. From a theoretical point of view, the SPO+ loss has the Fisher consistency property with respect to the regret, under certain distributional assumptions. A surrogate loss function satisfies the Fisher consistency property if the function that minimizes the surrogate loss also minimizes the true loss in expectation (Zou et al., 2008). Concretely, this means that minimizing the SPO+ loss corresponds to minimizing the regret in expectation. While training ML models with a finite dataset, an important property of considerable interest would be risk bounds (Massart and Nédélec, 2006). Liu and Grigas (2021)  The SPO framework is applicable not only to LPs, but to any CO problems where the cost parameters appear linearly in the objective function. This includes QPs, ILPs and MILPs.  empirically investigated how the framework performs on ILP problems. However, as these problems are much more computationally expensive to solve than the ones considered in (Elmachtoub and Grigas, 2022), they compared the regular SPO methodology with a variant in which the solver is instead used on the significantly cheaper to solve LP relaxations of the ILPs that must be solved during training. These LP relaxations are obtained by considering the continuous relaxation of the ILPs, i.e., they are variants of the ILPs in which the integrality constraints are dropped. Using the LP relaxations significantly sped up training, without any cost:  did not observe a significant difference in the final achieved regret between these two approaches, with both of them performing better than the prediction-focused approach. However, one should be cautious to generalize this result across different problems, as it might be dependent on the integrality gap between the ILP and its LP relaxation.
Next, within this category, a different type of DFL technique is being surveyed. In these DFL techniques, the surrogate loss functions are supposed to reflect the decision quality, but their computations do not involve solving the CO problems, thereby avoiding the zero-gradient problem.
Noise contrastive estimation. One such approach is introduced by Mulamba et al. (2021). Although their aim is still to minimize regret, computation of ∇ĉRegret(x (ĉ), c) has been avoided by using a surrogate loss function. In their work, the CO problem is viewed from a probabilistic perspective, as in (25). However, instead of maximum likelihood estimation, the noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010) method is adopted. NCE has been extensively applied in many applications such as language modeling (Mnih and Teh, 2012), information retrieval (Huang et al., 2013) and entity linking (Gillick et al., 2019). Its basic idea is to learn to discriminate between data coming from the true underlying distribution and data coming from a noise distribution. In the context of DFL, this involves contrasting the likelihood of ground-truth solution x (c) and a set of negative examples S. In other words, the following ratio is maximized: where x ∈ S is a negative example. Because the probability p τ (x (c)|ĉ) is defined as in (25), when = 1, maximizing (33) corresponds to minimizing the following loss: In other words, this approach learns to predict aĉ for which ground-truth solution x (c) achieves a good objective value, and for which other feasible solutions x achieve worse objective values. Note that when f (x (c),ĉ) ≤ f (x ,ĉ) for all x ∈ F, it holds that x (c) = x (ĉ), and thus the regret is zero. Also note that computing L N CE (ĉ, c) does not involve computing x (ĉ), circumventing the zero-gradient problem.
As an alternative to NCE, Mulamba et al. (2021) also introduce a maximum a posteriori (MAP) approximation, in which they only contrast the ground-truth solution with the most probable negative example from S according to the current model: Note that whenever x (ĉ) ∈ S, it holds that L M AP (ĉ, c) = f (x (c),ĉ) − f (x (ĉ),ĉ). This is also known as self-contrastive estimation (SCE) (Goodfellow, 2015) since the ground-truth is contrasted with the most likely output of the current model itself.
Also note that for optimization problems with a linear objective, the losses are L N CE (ĉ, c) = x ∈Sĉ (x (c) − x ) and L M AP (ĉ, c) =ĉ (x (c) − x ), where x = argmin x∈S f (x,ĉ). In order to prevent the model from simply learning to predictĉ = 0, the following alternate loss functions are proposed for these kinds of problems: Construction of S. Forming S by sampling points from the feasible region F is a crucial part of using the contrastive loss functions. To this end, Mulamba et al. (2021) proposes to construct S by caching all the optimal solutions in the training data. That is why they name S as 'solution cache'. While training, more feasible points are gradually added to S by solving for some of the predicted cost vectors. However, in order to avoid computational cost, the solver call is not made for each predicted cost during training. Whether to solve for a predicted cost vector is decided by pure random sampling, i.e., is based on a biased coin toss with probability p solve . Intuitively, the p solve hyperparameter determines the proportion of instances for which the CO problem is solved during training. Experimentally, it has been reported that p solve = 5% of the time is often adequate., which translates to solving for only 5% the predicted instances. This translates to reducing the computational cost by approximately 95%, since solving the CO problems represents the major bottleneck in terms of computation time in DFL training.
Approximation of a solver by solution-cache. Furthermore, Mulamba et al. (2021) propose a solver-free training variant for any methodology that treats the optimization solver as a blackbox oracle. Such methodologies include the aforementioned I-MLE, DBB, SPO. In this solver-free implementation, solving the optimization problem is substituted with a cache lookup strategy, where the minimizer within the cache S ⊂ F is considered as a proxy for the solution to the optimization problem (i.e., the minimizer within F). While this significantly reduces the computational cost as solving an optimization problem is replaced by a linear search within a limited cache. Such an approximation can be useful in case the optimization problem takes long to solve.
DFL as a learning to rank (LTR) problem. In a later work, Mandi et al. (2022) observe that L N CE (34) can be derived by formulating DFL as a pairwise learning to rank task (Joachims, 2002). The learning to rank task consists of learning the implicit order over the solutions in S invoked by the objective function values achieved by the solutions with respect to c. In other words, it involves learning to predict aĉ that ranks the solutions in S similarly to how c ranks them. In the pairwise approach, x (c) and any x ∈ S are treated as a pair and the model is trained to predictĉ such the ordering of each pair is the same for c andĉ. The loss is considered to be zero ifĉ x (c) is smaller thanĉ x by at least a margin of Θ > 0. The pairwise loss is formally defined in the following form: Another loss function is formulated by considering the difference in differences between the objective values at the true optimal x (c) and non-optimal x with c andĉ as the parameters.
Further, motivated by listwise learning to rank task (Cao et al., 2007), a loss function is proposed in (Mandi et al., 2022) where the ordering of all the items in S is considered, rather than the ordering of pairs of items. Cao et al. (2007) define this listwise loss based on a top-one probability measure. The top-one probability of an item is the probability of it being the best of the set. Note that such probabilistic interpretation in our setting is already defined in section 3.3. Mandi et al. (2022) make use of the tempered softmax probability defined in (25). Recall that this p τ (x|c) can be interpreted as a probability measure of x ∈ F being the minimizer of f (x, c) in F for a given c. However, as mentioned before, direct computation of p τ (x|c) requires iterating over all feasible points in F, which is intractable. Therefore In (Mandi et al., 2022), the probability is computed with respect to S ⊂ F. This probability measure finally is used to define a listwise loss -the cross-entropy loss between p τ (x|c) and p τ (x|ĉ), the distributions obtained for ground-truth c and predictedĉ. This can be written in the following form: The main advantage of (34), (35), (38), (39) and (40) is that they are differentiable and can be computed directly by any neural network library via automatic differentiation. Also note that the computation and differentiation of the loss functions are solver-free, i.e., they need not solve the optimization problem to compute the loss or its derivative.
Learning efficient surrogate solvers. Another research direction without optimization in the loop is based on reducing the computational cost associated with repeatedly solving optimization problems, by learning efficiently computable and differentiable surrogate losses that approximate and replace the true task loss. In (Shah et al., 2022), a surrogate of the regret function is learned by parametric local losses. Due to the difficulty of learning a single convex surrogate function to estimate regret, a convex local surrogate is learned for each data sample in training. By design, the surrogate losses are automatically differentiable, and thus they eliminate the need for a differentiable optimization solver.

Discussion
So far, in this section, an extensive overview of different DFL methodologies have been provided. For the ease of the readers, a summary of some of the key DFL techniques, discussed so far, have been provided in technique is compatible with any generic task loss. Techniques, termed as implementations of differentiable optimization layers, can be embedded in any stage of an NN architecture. The other techniques are applicable where optimization is the final stage of the pipeline (such as in Predict-Then-Optimize problem formulations) and a particular loss (most often regret) is used as the task loss.

Other Aspects of Decision-Focused Learning
In the following, some aspects related to DFL, which have not been discussed in this manuscript, will be highlighted. To begin with, it should be noted that certain CO problems may have multiple non-unique optimal solutions for an given cost vector. This can occur when the cost vector of a LP is parallel to one of the faces of the feasible polyhedron. Moreover, problems involving symmetric graphs often exhibit multiple optimal solutions, especially when the problems' solution can be transformed into other solutions through automorphisms (Weisstein, 2000). It is important to note that if the predicted cost vector has multiple non-unique optimal solutions, each of these solutions may have different value of regret. In such scenarios, Elmachtoub and Grigas (2022) propose to consider the worst-case regret. To do so, the set of optimal solutions ofĉ can be represented by X (ĉ). And then the worst-case regret can be defined in the following form: Having addressed the possibility of the presence of multiple non-unique optimal solutions in a CO problem, the focus now turns to other important facets of DFL.

Prediction-focused vs. Decision-focused learning
DFL methodologies are expected to perform better than a PFL approach in Predict-Then-Optimize problems, as the ML model is directly trained to achieve low regret. However, as discussed before, the implementation of DFL poses significant challenges. In fact, practitioners may be tempted to resort to a PFL approach to circumvent the computational costs associated with DFL. To encourage practitioners to adopt DFL methodologies, it is crucial to investigate scenarios where DFL methodologies outperform the PFL approach. To this end, Elmachtoub et al. (2023) conduct a theoretical comparison of the limiting distributions of the optimality gaps between the two approaches in the context of stochastic optimization. They show the PFL approach that does not consider optimization while training the model asymptotically outperforms the integrated prediction and optimization approach, employed by DFL methodologies, if the underlying prediction model is well-specified. This is intuitive, as a well-specified model tends to produce highly accurate predictions, which can contribute to the success of the neural network approach. In such cases, the DFL methodologies might perform worse than PFL since training in DFL involves approximate gradients (because the true gradient is zero almost everywhere), whereas the gradient is well-defined for a PFL approach. On the other hand, if the model is not well-specified, a PFL approach perform suboptimally compared to the DFL approach. Hence, it is recommended to use DFL when there exists aleatoric or epistemic uncertainty. As most real-world settings include various sorts of uncertainty, both aleatoric and epistemic, DFL methodologies are expected to outperform the PFL approach. In a separate work, Cameron et al. (2022) show that the suboptimality of the PFL becomes more pronounced in the presence of correlations between the predicted parameters.

Alternatives to gradient-based decision-focused learning
The methodologies explained so far implement DFL by gradient descent training, which is the go-to approach for training neural networks. However, note that there exist other machine learning frameworks, such as tree-based methods, which do not require gradientbased training. To avoid the problem of zero-valued gradients altogether, several works have considered alternatives to gradient-based learning instead.
In SPO Trees (SPOTs) (Elmachtoub et al., 2020), the predictive model is a decision tree or ensemble of decision trees. Such models can be learned by recursive partitioning with respect to the regret directly, and thus do not require the use of the SPO+ surrogate loss function introduced by Elmachtoub and Grigas (2022). Alternatively, the tree learning problem can be posed as a MILP and be solved by an off-the-shelf solver, in the same spirit as Jeong et al. (2022). Jeong et al. (2022) formulate the problem of minimizing regret as a mixed-integer linear program (MILP), when the predictive model is linear. They start from the bilevel optimization formulation, introduced in (6a) and (6b). First, the transition points where the solution of the lower level program (6b) changes are identified, and then the solution space is exhaustively partitioned, and for each partition the solution is annotated. This paves the way to construct a MILP formulation of the outer program (6a). This MILP problem is solved to learn the parameters ω of the linear predictive model. The resulting model is guaranteed to be globally optimal, which is not the case for gradient-based methods that might get stuck in a local optimum. However, their method is limited to ML models that are linear and optimization problems that are binary MILPs.  consider linear ML models and . and represent the objective function of the CO problem as a piece-wise linear function of the ML parameters. In this proposed technique, the ML parameters are updated via coordinate descent algorithm, where each component of the cost vector is updated at a time to minimize the regret keeping other components fixed. This technique requires identifying the transition points, where regret changes, as function of each component of the cost vector.  consider CO problems that can be solved by dynamic programming and identify the transition points using dynamic programming. In a later work, Guler et al. (2022) extend this technique by employing a 'divide-and-conquer' algorithm to identify the transition points for CO problems whose objective function is a bilinear function of the decision variables and the predicted parameters. This development generalizes the previous work  to cover much broader class of CO problems and offers a substantial speed improvement The 'branch & learn' approach proposed by , which consider CO problems that can be solved by recursion, also extends this technique.

Predicting parameters in the constraints
The majority of the works in DFL aim to predict parameters in the objective function and assume that the feasible space is precisely known. However, in many applications the unknown parameters occur in the constraints as well as in the objectives. When the parameters in the constraints are predicted and a prescribed decisions are made using the predicted parameters, one major issue is that the prescribed decisions might turn out to be infeasible with respect to the true parameters. In this case, the task loss should not only minimize the suboptimality of the prescribed decisions, but it should also penalize if the prescribed decisions become infeasible. Hence designing DFL algorithms suitable for such problems entails a few additional considerations. The first consideration deals with quantifying the extent of infeasibility when the prescribed decisions become infeasible with respect to the true parameters. In this regard,  introduce the notion of posthoc regret, wherein a non-negative penalty is added to regret to account for the conversion of infeasible solutions into feasible ones. This idea of a penalty function shares a fundamental resemblance to the concept of recourse action in stochastic programming (Ruszczyński and Shapiro, 2003). In a later work, Hu et al. (2023a) apply the 'branch & learn'  to minimize post-hoc regret in CO problems, solvable by recursion.
The second consideration is formulating a task loss striking a balance a trade-off between the measure of suboptimality and the measure of infeasibility. The next consideration is computing the gradients of this task loss with respect to the parameters in the constraints. Some of the techniques discussed in Section 3.1 can be utilized for this purpose. For example, the gradient can be obtained by solver unrolling. Tan et al. (2019) compute the gradient by unrolling a LP. As the parameters in the constraints are also present in the the KKT conditions (13), it is possible to compute the gradients for optimization problems, with differentiable constraint functions by differentiating the KKT conditions using the techniques discussed in Section 3.1.  shows how the gradient can be computed by differentiating the KKT conditions for packing and covering LPs. For an LP, Tan et al. (2020) provide a empirical risk minimization formulation considering both the suboptimlaity of the prescribed decision and the feasibility of the true optimal decisions. This formulation takes the form of a non-linear optimization program and they propose to compute the derivative by considering its sequential quadratic programming (SQP) approximation.
The task of computing the gradients of the task loss with respect to the parameters in the constraints is particularly challenging for combinatorial optimization problems, which often involves with discrete feasible space. For combinatorial optimization problems, it might happen that no constraints are active at the optimal point. So, slight changes of the parameters in the constraints do not change the optimal solution leading towards the problem of zero gradients. Hence coming up with meaningful gradients for back-propagation is a big challenge for combinatorial optimization problems. Paulus et al. (2021) develop a differentiable optimization layer for ILPs, which considers the downstream gradient of the solution as an input and returns the directions of the updating the parameters in the backward pass. They update the parameters along the directions so that the Euclidean distance between the solution of the updated parameter and the updated solution with the downstream gradient is minimized. For ILPs, Nandwani et al. (2023) view the task of constraint learning from the lens of learning hyperplanes, which is common in classification tasks. Such an approach requires negative samples. However, the negative samples in this setting must also include infeasible points, which is different from the framework proposed by Mulamba et al. (2021).

Model robustness in decision-focused learning
The issue of model robustness arises often in deep learning. As has been shown in many works, it is often possible for malicious actors to craft inputs to a neural network in such a way that the output is manipulated (evasion attacks) (Goodfellow et al., 2014), or to generate training data which cause adverse effects on the performance of the trained model (poisoning attacks). As a subset of machine learning, some adversarial settings also apply in DFL.
Evasion attacks, despite being the most commonly studied adversarial attacks, do not generalize straightforwardly to DFL since they inherently pertain to classification models with finite output spaces. On the other hand, it is shown by Kinsey et al. (2023) that effective poisoning attacks can be made against DFL models. The paper shows that while such attacks can be effective, they are computationally expensive due to the optimization which must be repeatedly evaluated to form the attacks. On the other hand, it is also demonstrated that poisoning attacks designed against two-stage models can be transferred to fully integrated DFL models.
Separately, Johnson-Yu et al. (2023) study robustness of decision-focused learning under label noise. The paper provides bounds on the degradation of regret when test-set labels are corrupted by noise relative to those of the training set. An adversarial training scheme is also proposed to mitigate this effect. The robust training problem is equivalent to finding the equilibrium solution to a Stackelberg game, in which a figurative adversary applies label noise that is optimized to raise regret, while the main player seeks model parameters that minimize regret.

Stochastic optimization
Settings in decision-focused learning based on stochastic optimization models are studied by Donti et al. (2017). In contrast to more typical settings, the downstream decision model is considered to be a stochastic optimization problem, for which it is only possible to predict parameters of a random distribution that models the coefficients of an optimization problem, such as the mean and variance of load demands in a power scheduling problem. Their work shows how such problems can be converted to DFL with deterministic decision models and solved using the techniques described in this article. To this end, it also introduces an effective technique for approximating the derivatives through arbitrary convex optimization problems, by forming and differentiating their quadratic programming approximations, as computed by sequential quadratic programming.

Problems other than optimization problems
Furthermore, we believe that DFL can be further extended to encompass problems beyond optimization problems, thereby broadening its applicability. For instance, to integrate symbolic reasoning into neural network architectures, Wang et al. (2019) make use of MAXSAT solvers and perform end-to-end training of the neural network by differentiating through the semidefinite program (SDP) relaxations of the MAXSAT problems. Wilder et al. (2019b) consider the K-means clustering in a graph as the optimization problem, i.e., the optimization problem in their case is to cluster the nodes of a given graph into K segments. They embed the K-means clustering as a layer in a neural network architecture by differenti-ate through the clustering layer. Wang et al. (2021) further extend DFL, for sequential decision making problems, where the decision making problems have been formulated as Markov decision processes (MDPs). In such cases, the DFL problem deals with the challenge of predicting the unknown parameters in the MDPs.
3.6.7 Active learning algorithm for DFL Active learning concerns with ML problems where labeled data are scarce or expensive to obtain. To address the challenge of limited training data, active learning algorithms choose the most informative instances for labeling (Settles, 2009).  study active learning in DFL paradigm. To choose datapoints for which to ask for a label, they propose to use notion of 'distance to degeneracy' (El Balghiti et al., 2019). Distance to degeneracy measures how far the predicted cost vector is from the set of cost vectors that have multiple optimal solutions. They argue that if distance to degeneracy is higher at a datapoint, there is more certainty regarding the solution (of the CO problem); hence they propose to acquire the label of a datapoint if its distance to degeneracy is lower than a threshold.

Multi-task decision-focused learning
In most DFL works, a single task is considered. For instance, in the shortest path benchmark considered by Elmachtoub and Grigas (2022), the grid structure and the start and end nodes are the same in all instances. However, one often has to deal with multiple tasks at once, in which it would be convenient to make decision-focused predictions, without having to train a separate model for each task. A first step in this direction was recently taken in (Tang and Khalil, 2023a). This paper proposes a way of training a model in a decision-focused way with respect to multiple tasks at once. They consider two kinds of architectures. The first is a regular multi-layer perceptron that outputs a single vectorĉ which is used in the different tasks. The different resulting task losses then get aggregated to inform the update to weights ω, i.e., the weights ω are trained to produce aĉ that generally works well for the different tasks considered. The second architecture is a multi-headed one, consisting of one or more shared first layers, followed by a dedicated head for every task. This means that a different vectorĉ i is produced for every task. Their results show that they can train a model that can make effective decision-focused predictions for multiple tasks at once, and that this is particularly beneficial when not that many training data are available. However, a remaining limitation is that the model can still not be trained with the aim of generalizing to new tasks.

Applications of Decision-Focused Learning
The Predict-Then-Optimize problem occurs in many real-world applications, as optimal decisions can be found by solving CO problems and due to the presence of uncertainty, some parameters of the CO problems must be estimated. Having seen the development of DFL for Predict-Then-Optimize problems in the preceding section, practical uses of DFL in various application domains will be presented below. As DFL techniques, which predict cost parameters have been reviewed in Section 3, applications, presented below, focus on the task of predicting only the cost parameters.
Computer vision. The DBB framework (Pogančić et al., 2020) (reviewed in Section 3.3) has been used by Rolinek et al. (2020) for differentiating rank-based metrics such as precision and recall and by Rolínek et al. (2020) and Kainmueller et al. (2014) for differentiating bipartite matching in deep graph and multi-graph matching problems respectively in the application of semantic keypoint matching of images.
Fair Learning to Rank. In learning to rank (LTR), a machine learning model must produce rankings of documents in response to user search queries, in which those most relevant to a given query are placed in the highest ranking positions. In this setting, the relevance of documents to queries is often measured empirically by historical user click rates (Cao et al., 2007). In fair learning to rank (FLTR), this relevance-based matching must be performed subject to strict constraints on the relative exposure between predefined groups. Due to the difficulty of enforcing such constraints on the outputs of a machine learning model, many FLTR frameworks resort to a two-stage approach in which prediction of query-document relevance scores is learned by a typical LTR model without constraints on fairness of exposure. At test time, the predicted relevance scores inform the objective of a separate fair ranking optimization program (Singh and Joachims, 2018). Kotary et al. (2021) use DFL to unify the prediction of relevance scores with the subsequent optimization of fair rankings, in an end-to-end model trained by SPO which learns to map user queries directly to the fair ranking policies that optimize user relevance. The result is a FLTR model which outperforms previous penalty-based models in terms of both user relevance and fairness, with the ability to directly control their trade-offs by modifying the fairness constraints of the optimization layer.
Route optimization. Ferber et al. (2023) present an interesting application, where DFL is used to combat the challenge of wildlife trafficking. They consider the problem of predicting the flight trajectory of traffickers based on a given source and destination airports pair. It is framed as a shortest path problem in a graph, where each node is an airport. In the prediction stage, the probability of using a directed edge (i, j) to leave the node i is predicted. In the optimization stage, the most likely path from the source to the destination is found by solving a shortest path problem where the negative log probabilities are used as edge weights. In this Predict-Then-Optimize formulation, the probabilities are predicted via DFL, using the DBB framework for gradient computation.
Solving a shortest path problem by considering the negative log probabilities as edge weights has also been explored by Mandi et al. (2021). In (Mandi et al., 2021), the objective is to prescribe most preferred routing in capacitated vehicle routing problem (CVRP) (Toth and Vigo, 2015) for last-mile delivery applications. A high probability value for the edge (i, j) indicates that it is the preferred edge to leave the node i. However, they do not observe any advantage of DFL paradigm over PFL paradigm and attribute this to the lack of training data instances (fewer than 200 instances). DFL is used for last-mile delivery applications by Chu et al. (2021) too. However, there the objective is to minimize total travel time. In the prediction stage, the travel times of all the edges are predicted and in the optimization stage, the CVRP is solved to minimize the total travel time. The underlying model is trained using the SPO framework to directly minimize the total travel time.
Maritime transportation. The inspection of ships by port state control has been framed as a Predict-Then-Optimize problem by Yang et al. (2022). Due to limited number of available personnel, the aim is to identify non-compliant ships with high detention risk beforehand and select those ships for inspection. A ship can be found to be non-compliant by port state control in multiple categories. If a ship is found to be non-compliant for a category, the deficiency number for that category will be recorded as one. In the prediction stage, a linear model is built to identify deficiency numbers of the ships in all the categories and in the optimization stage, a CO problem is solved to select ships maximizing the total number of deficiencies. Due to the nature of the large-scale optimization problem, training in the SPO framework is not practical. Therefore, they employ pairwise-comparison based loss function, similar to Eq. (38) to implement DFL. Ship maintenance activities by ship owners have been framed as Predict-Then-Optimize problems by Tian et al. (2023). The owners have to schedule regular maintenance activities to remain compliant. However, as maintenance activities are expensive, the objective of identifying categories that may warrant immediate detentions has been considered. To do so, in the prediction stage, a random forest model is built to predict the deficiency number (likelihood of non-compliance) for each category. In the optimization stage, a CO problem is formulated considering maintenance cost and detention cost to determine whether maintenance activity should be scheduled for each category. The random forest models are trained to directly minimize regret using SPOTs (Elmachtoub et al., 2020).
Scheduling. Wahdany et al. (2023) provide a use-case of DFL in renewable power system application. In their work, the prediction stage involves the task of generating wind power forecasts. As these forecasts are further used in power system energy scheduling, the task of minimizing power system operating costs has been considered. Cvxpylayers (Agrawal et al., 2019b) has been used to directly train the model with the objective of minimizing power system operating costs. DFL is applied in power system application by Sang et al. (2022) also. In the prediction stage electricity prices are predicted and the optimization stage deals with optimal energy storage system scheduling to maximize arbitrage benefits. Lower values of regret have been reported when the prices are predicted using the SPO framework.
Communication technology. DFL is applied in mobile wireless communication technology application by Chai et al. (2022). Fluid antenna system (Wong et al., 2020) is one of the recent development in mobile wireless communication technology. However, its effectiveness depends on the position of the radiating element, known as the port. Chai et al. (2022) frame the port selection problem as a Predict-Then-Optimize problem, where in the prediction stage signal-to-noise ratio for each position of the port is predicted and then the optimal position of the port is decided in the optimization stage. They use LSTM as the predictive model and report the SPO framework is very effective for such port selection applications.
Solving non-linear combinatorial optimization problems. Ferber et al. (2022) study the problem of learning a linear surrogate optimizer to solve non-linear optimization problems. The objective is to learn a surrogate linear optimizer whose optimal solution is the same as the solution to the non-linear optimization problem. Learning the parameters of the surrogate linear optimizer entails backpropagating through the optimizer, which is implemented using Cvxpylayers (Agrawal et al., 2019b).

Experimental Evaluation on Benchmark Problemsets
DFL recently has received increasing attention. The methodologies discussed in Section 3 have been tested on several different datasets. Because a common benchmark for the field has not yet been set up, comparisons among methodologies are sometimes inconsistent. In this section, an effort is made to propose several benchmark test problems for evaluating DFL methodologies. 1 Then some of the methodologies explained in Section 3 are compared on these test problems.

Problem Descriptions
All the test problems, which are selected for the evaluations, have been previously uses in the DFL literature and their datasets are publicly available. Needless to say, all these problems encompass the two stages-prediction and optimization. Table 2 provides an overview of the experimental setups associated with each test problem, including the specification of the CO problem and the type of predictive model. Next, these test problems are described in detail.

Shortest path problem on a 5 × 5 grid
This experiment is adopted from (Elmachtoub and Grigas, 2022). It is a shortest path problem on a 5 × 5 grid, with the objective of going from the southwest corner of the grid to the northeast corner where the edges can go either north or east. This grid consists of 25 nodes and 40 edges.
Formulation of the optimization problem. The shortest path problem on a graph with a set V of vertices and a set E of edges can be formulated as an LP problem in the 1. During the course of writing this manuscript, we became aware of the PyEPO project (Tang and Khalil, 2023b), which develops an interface for benchmarking DFL methodologies. However, it is important to emphasize that our work differs significantly from PyEPO. While PyEPO focuses on providing an interface for implementing DFL methodologies, our paper serves as a comprehensive survey that goes beyond benchmarking.
following form: Where A ∈ R |V |×|E| is the incidence matrix of the graph. The decision variable x ∈ R |E| is a binary vector whose entries are 1 only if the corresponding edge is selected for traversal. b ∈ R |V | is vector whose entry corresponding to the source and sink nodes are 1 and −1 respectively; all other entries are 0. The constraint (42b) must be satisfied to ensure the path will go from the source to the sink node. The objective is to minimize the cost of the path with respect to the (predicted) cost vector c ∈ R |E| .
Synthetic data generation process. In this problem, the prediction task is to predict the cost vector c from the feature vector z. The feature and cost vectors are generated according to the data generation process defined by Elmachtoub and Grigas (2022). For the sake of completeness, the data generation process is described below. 2 Each problem instance has cost vector of dimension |E| = 40 and feature vector of dimension p = 5. The training data consists of {(z i , c i )} N i=1 , which are generated synthetically. The feature vectors are sampled from a multivariate Gaussian distribution with zero mean and unit variance, i.e., z i ∼ N(0, I p ) To generate the cost vector, first a matrix B ∈ R |E|×p is generated, which represents the true underlying model. The cost vectors are then generated according to the following formula: where c ij is the j th component of cost vector c i . The Deg parameter specifies the extent of model misspecification, as a linear model is used as a predictive model in the experiment. The higher the value of Deg, the more the true relation between the features and cost coefficients deviates from a linear one and the larger the errors will be. Finally, ξ j i is a multiplicative noise term sampled randomly from the uniform distribution [1−ϑ, 1+ϑ]. The experimental evaluation involves five values of the parameter Deg, which are 1, 2, 4, 6 and 8, and the noise-halfwidth parameter ϑ being 0.5. Furthermore, for each setting, a different training set of of size 1000 is used. In each case, the final performance of the model is evaluated on a test set of size 10, 000.
Predictive model. In each setting, the underlying predictive model is a one-layer feedforward neural network without any hidden layer, i.e., a linear model. Basically the input to the model is a p dimensional vector, and output is a |E| dimensional vector. Note that a multi-layer neural network model can be used to to improve the accuracy of the predictive model. The intuition behind using a simple predictive model is to test the efficacy of the decision-focused models when the predictions are not 100% accurate. The decision-focused models are trained to minimize the regret and the prediction-focused model is trained by minimizing the MSE loss between the true and predicted cost vector.

Portfolio optimization problem
A classic problem that combines prediction and optimization is the Markowitz portfolio optimization problem, in which asset prices are predicted by a model based on empirical data, and then subsequently, a risk-constrained optimization problem is solved for a portfolio which maximizes expected return. This experiment is also adopted from (Elmachtoub and Grigas, 2022).
Formulation of the optimization problem. In portfolio optimization problem, the objective is to choose a portfolio of assets having highest return subject to a constraint on the total risk of the portfolio. The problem is formulated in the following form: where 1 is the vector of all-ones of same dimension as x,ĉ is the vector of asset prices, and Σ is a predetermined matrix of covariances between asset returns. The objective (44a) is to maximize the portfolio's total value. Eq. (44b) is a risk constraint, which bounds the overall variance of the portfolio, and (44c), (44d) model x as a vector of proportional allocations among assets.
Synthetic data generation process. Synthetic input-target pairs (z, c) are randomly generated, according to a random function with a specified degree of nonlinearity Deg ∈ N.
The procedure for generating the random data as follows: Given number of assets d and input features of size p, generate input samples x i ∈ R p element wise from i.i.d. standard normal distributions N(0, 1). Generate a random matrix B ∈ R d×p whose elements B ij ∈ {0, 1} are drawn from i.i.d. Bernoulli distributions which take the value 1 with probability 0.5. For a chosen noise magnitude ϑ, generate L ∈ R n×4 whose entries are drawn uniformly over [−0.0025ϑ, 0.0025ϑ]. Asset returns are calculated first in terms of their conditional meanc ij as Then the observed return vectors c i are defined as c ij :=r i + Lf + 0.01ϑξ, where f ∼ N(0, I 4 ) and noise ξ ∼ N(0, I d ) This causes the c ij to obey the covariance matrix Σ := LL + (0.01ζ) 2 I, which is also used to form the constraint (44b), along with a bound on risk, defined as γ := 2.25 e Σe where e is the equal-allocation solution (a constant vector). Five values of the parameter Deg-1, 4, 8, 16 have been used in the experimal evaluation. The value of noise magnitude parameter ϑ is set to 1. It is assumed that the covariance matrix of the asset returns does not depend on the features. The values of Σ and γ are constant, and randomly generated for each setting.
Predictive model Like the previous experiment, the underlying predictive model is a linear model, whose input is a feature vector z ∈ R p and output is the return vector c ∈ R d .

Warcraft shortest path problem
This experiment was adopted from (Pogančić et al., 2020). Each instance in this problem is an image of a terrain map using the Warcraft II tileset (Guyomarch, 2017). Each image represents a grid of dimension d×d. Each of the d 2 pixels has a fixed underlying cost, which is unknown and to be predicted. The objective is to identify the minimum cost path from the top-left pixel to the bottom-right pixel. From one pixel, one can go in eight neighboring pixels-up, down, front, back, as well as four diagonal ones. Hence, it is a shortest path problem on a graph with d 2 vertices and O(d 2 ) edges.
Formulation of the optimization problem. Note that this is a node-weighted shortest path problem, where each node (pixel) in the grid is assigned a cost value; whereas in the previous shortest path problem, each edge is assigned a cost value. However, this problem can be easily reduced to the more familiar edge weighted shortest path problem by 'node splitting'. 'Node splitting' splits each node into two separate nodes-entry and exit nodes and adds an edge, that has a weight equal to the node weight, from the entry node to the exit node. For each original edge, an edge, with null weight, from the exit node of the source node to the entry node of the sink node, is constructed.
Predictive model. The prediction task is to predict the cost associated with each pixel. The actual cost ranges from from 0.8 to 9.2 and is dependent on visible characteristics of the pixel. For instance, cost changes depending on whether the pixel represents a water-body, land or wood. The predictive model used in this case is a convolutional neural network (CNN), which predicts the cost of each node (pixel). The model takes the d × d image as an input and outputs costs of the d 2 pixels. The ResNet18 (He et al., 2016) architecture is slightly modified to form the ML model. The first five layers of ResNet18 are followed by a max-pooling operation to predict the underlying cost of each pixel. Furthermore, a Relu activation function (Agarap, 2019) is used to ensure the predicted cost remains positive, thereby avoiding the existence of negative cycles in the shortest path edge weights.

Energy-cost aware scheduling
This is a resource-constrained day-ahead job scheduling problem (Simonis et al., 1999) with the objective of minimizing energy cost. Tasks must be assigned to a given number of machines, where each task has a duration, an earliest start, a latest end, a resource requirement and a power usage. Each machine has a resource capacity constraint. Also, tasks cannot be interrupted once started, nor migrated to another machine and must be completed before midnight. The scheduling is done in one-day advance. So, the prediction task is to predict the energy prices of the next day.
Formulation of the optimization problem. The scheduling problem is formulated as an MIP. Let J be the set of tasks to be scheduled on a set of machines I while maintaining resource requirement of W resources. The tasks must be scheduled over T number of time slots. Each task j is specified by its duration ζ j , earliest start time ζ (1) j , latest end time ζ (2) j , power usage φ j . Let ρ jw be the resource usage of task j for resource w and q iw is the capacity of machine i for resource w. Let x jit be a binary variable which possesses 1 only if task j starts at time t on machine i. The objective of minimizing energy cost while satisfying the required constraints can be expressed by the following ILP: i∈I t∈T The (46b) constraint ensures each task is scheduled once and only once. The constraints in (46c) and (46d) ensure that the task scheduling abides by earliest start time and latest end time constraints. (46e) imposes the constraints of resource requirement.
Data description. The prediction task is to predict the energy prices one day advance. The energy price dataset comes from the Irish Single Electricity Market Operator (SEMO) (Ifrim et al., 2012). This dataset consists of historical energy price data at 30-minute intervals starting from midnight on the 1st of November, 2011 until the 31st of December, 2013. In this setup, each day forms an optimization instance, which comprises of 48 time slots, corresponding to 48 half-hour slots. Each half-hour instance of the data has calendar attributes, day-ahead estimates of weather characteristics, SEMO day-ahead forecasted energy-load, wind-energy production and prices, actual wind-speed, temperature and CO 2 intensity, which are used as features. So, the dimension of feature vector is 8. Note that, in this dataset, each c t in the cost vector is associated with an eight dimensional feature vector, i.e., c ∈ R 48 and z ∈ R 48×8 .
Predictive model. As energy prices of each half-hour slot is associated with 8 features, the input to the predictive model is a feature vector of dimension 8 and output is a scalar. In this case also, the predictive model is a linear model, i.e., a feed forward neural network without any hidden layer.

Knapsack problem
This experiment setup was adopted from . The objective of the knapsack problem is to choose a maximal value subset from a given set of items, subject to a capacity constraint. In this case, the weights of all items and the knapsack capacity are known. What are unknown are the values of the items. Hence, the prediction task is to predict item values of each item.
Formulation of the optimization problem. The formulation of the knapsack optimization problem with unit weights has already been provided in Eq. (4). However, in general the weights of all items are not equal. So, a general knapsack optimization problem can be formulated as follows: where w, c are the vector of weights and values respectively.
Data description. For this problem also, the dataset is adapted from the Irish Single Electricity Market Operator (SEMO) (Ifrim et al., 2012). In this setup, each day forms an optimization instance and each half-hour corresponds to a knapsack item. So the cost vector c and the weights w are of length 48, corresponding to 48 half-hours. Similar to the energy scheduling problem, each item of the cost vector is associated with a feature vector of dimension 8. The weight vector is fixed. The weights are generated synthetically as done by . The data generation is as follows. First a weight w i is assigned to each of the 48 half-hour slots, by sampling from the set {3, 5, 7}. In order to introduce correlation between the item weights and the item values, the energy price vector is multiplied with the weight vector and then a randomness is incorporated by adding Gaussian noise ξ ∼ N(0, 25), which produces the final item values c i . The motivation behind introducing correlation between the item weights and the item values stems from the fact that solving a knapsack problem with correlated item weights and values is considered to be hard to solve (Pisinger, 2005). The sum of the weights of each instance is 240. 60, 120, and 180 are the three values of capacity with which the experiments are performed.
Predictive model. Like the previous problem, the predictive model here also is a linear model, i.e., a feed forward neural network without any hidden layer.

Diverse bipartite matching
This experimental setup is adopted from Ferber et al. (2020). In this problem, two disjoint sets of nodes are provided and the objective is to match between the nodes of the two sets. The graph topologies are taken from the CORA citation network (Sen et al., 2008), where a node represent a publication and an edge represent a citation. So the matching problem is to identify the citation between the two sets of publications. Furthermore, the matching must obey diversity constraints, as described later. Note that this problem falls under the category of structured output prediction tasks (Nowozin and Lampert, 2011), which requires capturing dependencies and relationships between different parts of the output. In this matching problem, each edge does not have an associated cost in the true sense. Therefore, in the prediction-focused approach the model is trained by directly predicting the presence or absence of each edge. On the other hand, the DFL approaches consider the likelihood of the existence of each edge as the edge weights and then determine which edges should be present while ensuring all the constraints are satisfied.
Optimization problem formulation. Let S 1 and S 2 denote the two sets. The matching must satisfy the following diversity constraints: a minimum ρ 1 % and ρ 2 % of the suggested pairings should belong to same and distinct fields of study respectively. Let c ij be the likelihood of an edge existing between article i and j, ∀i ∈ S 1 , j ∈ S 2 .
With this probability, the matching can be performed by solving the following MIP, which ensures the diversity constraints: where φ ij is an indicator variable, which takes the value 1 only if article i and j are of same field, and 0 if they belong to two different fields.
Data description. The network is divided into 27 disjoint topologies, each containing 100 nodes. Each of the instant form an optimization instance. In each instance, the 100 nodes are split into two sets of 50 nodes S 1 and S 2 ; so each instance forms a bipartite matching problem between two sets of cardinality 50. Each publication (node) has 1433 bag-of-words features. The feature of an edge is formed by concatenating features of the two corresponding nodes. The prediction task is to estimate c ij values. In this problem, each individual c ij is associated with a feature vector of length 2866.
Predictive model. The predictive model is a neural network model. The input to the neural network is a 2866 dimensional vector and final output is a scalar between 0 and 1.
The neural network has one hidden layer and uses a sigmoid activation function on the output.

Subset selections
This experiment is a structured prediction task, in which the object is to learn a mapping from feature vectors to binary vectors which represent subset selections. Unlike the other experiments above, the ground-truth data take the form of optimal solutions to an optimization problem, rather than its corresponding problem parameters. Thus, the regret loss is not suitable for training a prediction model. Instead, a task loss based on the error of the predicted solutions with respect to ground-truth solutions is used in this experiment.
Optimization problem formulation. For any c ∈ R n , the objective of the optimization problem is to output a binary vector in R n , where the non-zero values correspond to the  top-k values of c. This can be formulated as an LP problem in the following form: As a totally unimodular linear program with integral parameters, this problem has (binary) integer optimal solutions. This mapping is known for its ability to represent subset selections in structured prediction, and is useful for multilabel classification  .
Data Description. Let U(0, 1) be a uniform random distribution; then a collection of feature vectors z are generated by z ∼ U(0, 1) n . For each z, its corresponding target data is a binary vector containing unit values corresponding to the top-k values of z, and zero values elsewhere. Three datasets are generated, each of 1000 training samples, in which the selection problem takes size n = 25, n = 50, and n = 100 respectively. The subset size k is chosen to be one fifth of n, in each case.
Predictive model. Like the previous problem, the predictive model here also is a linear model, i.e., a feed forward neural network without any hidden layer. In this problem, the task loss to train the predicative model is the negated inner product between true selection x and prescribed selectionx, i.e., L(x, x) =x · x, which is minimized whenx = x. Since the model is not regret-based, and does not assume access to the ground-truth parameters c, techniques which rely on such assumptions are not tested on this problem.

Experimental Results and Analysis
In this subsection, results of comparative evaluations of some of the methodologies introduced in Section 3 on the datasets mentioned in Section 5 are presented.  (37) [MAP]. The reason behind including predictionfocused approach is that it is considered as a benchmark. Note that among these methodologies, Listwise, Pairwise, Pairwise(diff), and MAP make use of a solution cache. The solution cache is implemented using the procedure proposed by Mulamba et al. (2021). In this approach, the solution cache is initialized by caching all the solutions in the training data and the cache is later expanded by employing a p solve parameter value greater than zero. As in (Mulamba et al., 2021;Mandi et al., 2022) it is reported that p solve = 5% is adequate for most applications, the value of p solve is set to 5%. Next, the procedure systematically followed for the empirical evaluations is explained.
Experimental setup and procedures. The performance of a methodology is sensitive to the choice of the methodology specific hyperparameters as well as some other fundamental hyperparameters, common in any neural network training such as learning rate. These are called hyperparameters because they cannot be estimated by training the model, rather they must be selected before training begins. Tuning hyperparameters is the process of identifying the set of hyperparameter values that are expected to produce the best model outcome. In the experimental evaluations, hyperparameter tuning is performed via grid search. In the grid search, each of the hyperparameters is tried for a set of values. The set of values to be tested on for each hyperparameter is predetermined. Grid search suffers from the curse of dimensionality in the hyperparameter space, as the number of combinations grows exponentially with the number of hyperparameters. However, it is possible to train the different models for different combination of hyperparameter in parallel as the combinations are independent.
The hyperparameter of each model for each experiment is selected based on performance on the validation dataset. For each hyperparameter a range of values as defined in Table 3 is considered. The hyperparameter combination which produces the lowest average regret on the validation dataset is considered to be the 'optimal' one. For both validation and testing, 10 trials are run where in every trial the network weights are initialized with a different seed. To be specific, values of seed from 0 to 9 have been considered. Each model for each setup is trained using Pytorch (Paszke et al., 2019) and PyTorch-Lightning (Falcon et al., 2019) with Adam optimizer (Kingma and Ba, 2014) and 'ReduceLROn-Plateau'(PyTorch, 2017) learning rate scheduler. As mentioned before, the learning rate of Adam optimizer is treated as a hyperparameter. For QPTL, the QP problems are solved using Cvxpylayers (Agrawal et al., 2019b). For other methodologies, which treat the CO solver as a blackbox solver, Gurobi (Gurobi Optimization, 2021) or OR-Tools (Perron and Furnon, 2020) is used as the solver. For MAP and LTR losses, the experiments are run with p solve being 5%.
Evaluation metric. After selecting the 'optimal' hyperparameter combination for each test problem, 10 trials of all the methodologies with the 'optimal' hyperparameter combination are run on test dataset. Unless otherwise mentioned the comparative evaluation is made based on relative regret on the test dataset. The relative regret is defined as follows: In practice, c (orĉ) can have non-unique optimal solutions. However, note that if all the entries in c are continuous, it is very unlikely that c will have non-unique solutions. For instance, in the case of an LP, the only circumstance in which the LP can have multiple solutions is when c is parallel to one of the faces of the LP polyhedron. Nevertheless, if the cost vector is predicted by an ML model, a pathological case might occur, especially at the beginning of model training, when all the cost parameters are zero. This results in all feasible solutions being optimal with zero cost. However, to avoid this complexity in the experiments, it is assumed that the solution x (ĉ) is obtained by calling an optimization oracle and that if there exist non-unique solutions, the oracle returns a single optimal solution by breaking ties in a pre-specified manner. This is true if a commercial solver such as Gurobi is used to solve the CO problem.

Comparative Evaluations
Next, the performances of the 11 methodologies in the 7 problems are presented with insights.
Shortest path problem on a 5 × 5 grid. The comparative evaluation for the synthetic shortest path problem in is shown in Figure 5 with the aid of box plots. To conserve space, boxplots for two values of Deg are shown in Figure 5. The boxplots for all the five degrees are shown in Figure A1 the Appendix. In Figure 5, model with no hidden layers. For Deg 1, the linear predictive model perfectly captures the data generation process. Consequently the PF approach is very accurate and it results in the lowest regret. SPO has slightly higher regret than the PF approach. All the other models have considerable higher regrets. It is followed by MAP and FY. For Deg 8, FY has the lowest regret, closely followed by I-MLE. Then comes Listwise and Pairwise ranking losses followed by QPTL and DBB. In this case, SPO performs poorer than them. MAP and HSD have very high regret but still lower than the PF approach. The relative regret worsens for the PF approach, as the value of Deg parameter is increased. For Deg 2, both PF and SPO have lowest regret. However, their differences with other models reduce in this case. FY, MAP and I-MLE come at the next three places respectively. For Deg 4, the PF model starts to result in high regret. In this case, I-MLE has the lowest regret, closely followed by FY and SPO. The next three spots are taken by MAP, Listwise and Pairwise respectively. DBB and HSD perform worse than the PF approach. For Deg 6, the best one FY, although its test regret is not very different from SPO, I-MLE and QPTL. Listwise, DBB, Pairwise and MAP come next.
Overall, FY and I-MLE are the top two best-performing approaches, for Deg > 2. For Deg values of 1 and 2, the PF approach has the lowest regret. Note that the performance of SPO is very consistent too. It performs considerably worse than I-MLE and FY only for Deg 8. On the other hand, HSD exhibits higher regret than the other DFL approaches. In fact it does better than the PF approach only for Deg 6 and 8. It also exhibits higher variances.
Portfolio optimization problem. Note that this is an optimization problem with continuous decision variables having quadratic constraints and a linear objective function. Hence, the HSD approach is not applicable for this problem, as it cannot handle nonlinear constraints. The boxplots of test regrets for noise magnitude parameter ϑ being 1 are shown in Figure 6.
In this problem, in some problem instances, all the return values are negative, which makes a portfolio with zero return to be the optimal portfolio. In such cases, relative regret turns infinite as the denominator is zero in Eq. (50). Hence, for this problem set, the absolute regret instead of relative regret is reported in Figure 6. The boxplots for Deg values of 1 and 16 are shown in The boxplots for all the four degrees are shown in Figure  A2 the Appendix.
Apparently the PF approach performs very well in this problem; but SPO manages to outperform PF slightly in all cases except for Deg 1. It is evident in Figure 6 that DBB, I-MLE, FY and QPTL perform miserably as they generate regret even higher than the PF approach. All these methodologies were proposed considering problems with linear constraints. Hence the concerns arise that these methodologies be suitable in the presence of quadratic constraints. On the other hand, LTR losses-Pairwise and Pairwise(diff) and contrastive loss function, MAP, perform even better than SPO for Deg 16. For Deg 16, again Pairwise is the best performing model, followed by Listwise, Pairwise(diff), MAP and SPO, in that order. For Deg 1, PF is the best followed by MAP and SPO. For Deg 4 and 8, Pairwise loss function has the lowest regret, closely followed by Pairwise(diff), MAP and SPO. For Deg 1, PF is the best one, followed by MAP, SPO, Pairwise and Pairwise(diff), in that order. The Listwise loss function exhibits high variance for Deg 1 and for Deg values of 4 and 8 it generates high regret for few instances. For Deg 16, it generates average test regret lower than SPO. In general, Figure 6 reveals DBB, I-MLE, FY and QPTL perform poorly in this problem, whereas, SPO, MAP Pairwise and Pairwise(diff) seem to be suitable methodologies for this problem.
Warcraft shortest path problem. Recall that this a shortest path problem in an image with dimension d×d. The optimization problem can be efficiently solved using Dijkstra's algorithm (Dijkstra, 1959), as underlying costs of all the pixel values are non-negative. Hence the shortest path problem is solved using Dijkstra's algorithm for the methodologies which view the CO solver as a blackbox oracle. However, HSD and QPTL require the problem to be formulated as an LP and require a primal-dual solver. Note in this experiment, the predictive ML model is a CNN, which predicts the cost of each pixel. In this case, training of the ML model is challenging due to the large number of parameters. Hence combining this ML model with computation-intensive modules such as interior point optimizer poses significant challenges. We could not run the experiments with HSD and QPTL because of this computational burden.
The dataset contains four values of d: 12, 18, 24, 30. Clearly, as the value of d increases, the optimization problem contains more number of parameters. The boxplots of comparative evaluations are summarized in Figure 7. The boxplots of other two values of d can be found in Figure A3 in the Appendix. First note that the PF approach, which is trained by minimizing mse loss between the predicted cost and true cost performs significantly worse than the DFL methodologies. In fact, the performance of the PF approach deteriorates as the image size increases. As the size of the image increases, the same level of prediction error induces greater inaccuracies in the solution. This is because an increase in the area of the image involves dealing with a greater number of decision variables in the CO problem. When the level of prediction error remains constant, the probability of the error in prediction changing at least one of the decision variables also increases. Consequently, there is a higher likelihood of error in the final solution. As the regret of the PF approach is significantly higher, note that the scale of the y-axis is changed to fit it into the plot.
Among the DFL methodologies, Listwise performs best for sizes 12, 18, and 30 and SPO performs best for size 30. In fact, for sizes 12, 18, and 24, there are not many variations between SPO, Listwise, and MAP. After them, the next three best-performing methodologies are Pairwise (diff), I-MLE and DBB. However, for size 30, DBB comes third after Listwise and MAP, followed by Pairwise (diff), SPO, and I-MLE in that order. FY and Pairwise perform slightly worse than the other DFL methodologies. In general, this set of experiments shows the advantage of the DFL approaches as all of them outperform the PF approach.
Energy-cost aware scheduling. There are three instances of this scheduling problem. All the instances have 3 machines. The first, second, and third instances contain 10, 15, and 20 tasks, respectively. In this problem, the underlying ML model is a simple linear model implemented as a neural network model with no hidden layers. The boxplot of comparative evaluations for the first instance is presented in Figure 8. The boxplots of the other instances can be found in Figure A4 in the Appendix. Note that the scheduling problem is an ILP problem. For HSD and QPTL, the LPs obtained by relaxing the integrality constraints have been considered. For the first instance, MAP and SPO result in the lowest average regret, closely followed by I-MLE. DBB, FY, and Pairwise(diff) perform better than the PF approach. The performances of the Listwise and Pairwise rankings are worse than the PF. QPTL and HSD also perform poorly in all three instances, probably because in this case the LP obtained by relaxing by removing the integrality is not a proper representation of the ILP. In fact, QPT fails to learn in this problem instance. In the second instance, FY, SPO, and I-MLE are the best three performing models. Then comes MAP and DBB, followed by Pairwise(diff). Again, performances of the Listwise and Pairwise rankings are worse than the PF. In the third instance, again, MAP and SPO deliver the lowest average regret. Then comes I-MLE and FY. The test regret of these two models is very similar. In this case, the performance of Pairwise(diff) is slightly worse than the PF approach, whereas, like before, performances of Listwise and Pairwise ranking are significantly worse. In general, across the three problem instances, it is possible to identify some common patterns. The first one is relaxing the integrality constraints fails to capture the essence of the combinatorial nature of the LP. Consequently, HSD and QPTL perform poorly. Secondly, Listwise and Pairwise ranking performances are significantly worse than the PF approaches. The learning curve suggests (refer to B), these models fail to converge in these problem instances, although in some epochs, they are able to perform significantly better than the PF approach, their pheromones never plateau. Lastly, SPO, MAP, FY, and I-MLE perform consistently better than the other models.
Knapsack problem. Three instantiations of the knapsack problem are considered for the experiment-each instantiation with a different capacity. The three capacity values are-60, 120 and 180. The boxplot corresponding to capacity value 60 is presented in Figure 9. The boxplots of the other two capacities can be found in Figure A5 in the Appendix. With a capacity of 60, the best three models are QPTL, DBB, and I-MLE, in that order. HSD, SPO, and MAP come next and perform better than the PF approach. FY and LTR losses perform worse than the PF approach. With a capacity of 120, the top three models are DBB, I-MLE, and QPTL. Then comes SPO, HSD and MAP. The Pairwise(diff) model performs slightly better than the PF approach, but the other two LTR losses and FY perform worse. With a capacity of 180, the best three models are DBB, I-MLE and SPO. HSD and QPTL perform better than the PF approach, but MAP, LTR losses, and FY perform worse. In general, for this problem, DBB and I-MLE are the best-performing models across the three capacity values. QPTL, SPO, HSD also consistently perform better than the PF approach in all three cases. However, FY and the LTR losses perform poorly in this problem.
Diverse bipartite matching. Three instantiations of the diverse bipartite matching problem are formed by changing the values of ρ 1 and ρ 2 . The values of (ρ 1 , ρ 2 ) for the three instantiations are (10%, 10%), (25%, 25%), (50%, 50%) respectively. The boxplot of comparative evaluations for (ρ 1 , ρ 2 ) being (50%, 50%), is presented in Figure 10. As mentioned before, in this problem, each edge is not associated with an edge weight in the true sense. Hence, the PF approach is trained by directly learning to predict whether an edge exists. So the loss used for supervised learning for the PF approach is BCE loss. The DFL approaches consider the predicted probability of each edge as the edge weight and then aim to minimize regret.
In this problem instance QPTL is the best-performing model. FY, Pairwise(diff) and Listwise take the next three places. MAP, Pairiwse, I-MLE and SPO also perform better than the PF approach. The performances of HSD and DBB are similar to that of the PF approach. Also note that the relative regrets of all the models are very high (higher than 80%) for all three instances. With ρ 1 and ρ 2 being 10%, I-MLE performs considerably better than all the other models. Then comes HSD, FY, Pairwise and Pairwise(diff) followed by SPO, MAP, DBB and Listwise. When ρ 1 and ρ 2 take the value of 25%, QPTL, I-MLE and HSD are the top there models, with significantly lower regret than the rest. In this instance, the regrets of Listwise, Pairwise, SPO, FY and MAP are higher than the PF approach. Across the instances, the performances of I-MLE and QPTL are consistently better than the PF approach. In the first two instances, other than I-MLE and QPTL, other DFL models do not significantly better than the PF approach. DFL approaches such as FY, Listwise, Pairwise and MAP perform considerably better than the PF approach only in the third instances. On the other hand, the test regret of DBB is smilar to the PF approach across the instances.
Learning subset selections. Subset selection problems of three dimensions: n = 25, n = 50, and n = 100 are considered for evaluation. In each case, the subset size k is chosen to be n 5 . The error of any predicted subsetx, with respect to ground truth x, is considered to be the fraction of items which are selected in x but not inx. Such occurrences are referred to as mismatches. Figure 11 shows the average mismatch rates over the size n = 25 instances that were achieved by each DFL methodology listed in Table 1, excluding those which assume groundtruth data in the form of problem parameters. Here, the ground-truth data are optimal solutions of (49) representing subset selections. For each assessed method, a distribution of results is shown, corresponding to 10 different randomly generated training datsets. Figure  A7 shows similar results over the larger problem instances.
Note that it is suggested in ) that the entropy function H(x) = i x i log x i is particularly well-suited as a regularizer of the objective in (49), for the purpose of multilabel classification, which is identical to the task in terms of its optimization component and the form of its target data. Hence a Cvxpylayers implementation of this model is included and referred to as ENT. Figure A7 shows that most of the assessed methods perform similarly, with DBB performing worst regardless of the problem's dimension. HSD is most sensitive with respect to the randomly generated training set; the rest show consistent performance across datasets. QPTL and IMLE each show a marginal advantage over the other methods, but DPO and ENT are also competitive. Across all methods, variation in performance over the randomly generated datasets tends to diminish as problem size increases.

Comparison on Runtime
While coming up with a useful gradient is considered to be the primary challenge of DFL, as mentioned in Section 2.3, computational cost associated with repeatedly solving CO problems gives rise to the second challenge. DFL methodologies with low computational cost are essential for scalability for implementing DFL for real-world large-scale problems. The importance of scalability and low computational cost becomes significant while dealing with large-scale CO problems, especially NP-hard combinatorial optimization problems. Note that while the shortest path and the knapsack problems are relatively easy to solve; the energy-cost aware scheduling problem is much more challenging and can be considered an example real-world large-scale NP-hard combinatorial optimization problems. That is why the scheduling problem is considered to compare the computational costs of the DFL methodologies.
The median training time of an epoch during training of each methodology for two instances of the scheduling problem are shown Figure 12. Recall that the first, second and third instances contain 10, 15 and 20 tasks respectively. So, the first one is the easiest of the three and the third one is the hardest one. The complexity of the scheduling problem is evident from the fact that a single instance of the the knapsack problem takes 0.001 seconds to solve, while solving the most difficult instance of the scheduling problem takes 0.1 seconds, both using Gurobi MIP solver. The readers are cautioned against placing excessive emphasis on the absolute values of training times in Figure 12, as they are subject to system overhead. However, some general conclusions can be drawn from the relative ordering of the training times. It is not surprising that the training time of the PF approach is the lowest, as it does not require solving the CO problem for model training. Training times of SPO, DBB, I-MLE and FY are almost 100 times higher than the PF approach. Although QPTL and HSD consider the relaxed LP problem, it is not always the case that they have lower training times. Recall that QPTL and HSD solve and differentiate the optimization problem using primal-dual solver, which involves matrix factorization. On the other hand, SPO, DBB, I-MLE and FY can leverage faster commercial optimization solvers, as they only require the optimal solution. However, for Instance 3, it seems solving the MIP problem is more computationally expensive than solving differentiating the underlying QP problem using Cvxpylayers.
On the other hand, Listwise, Pairwise, Pairwise(diff) and MAP, all of which are run with p solve = 5%, exhibit significantly lower training time than the other DFL methodologies. In fact, the training time of these methodologies are comparable to the PF approach. From this perspective, these methodologies can be viewed as bridging the gap between between PF and DFL approaches. The same conclusion generally holds true for other experiments as well. However, for relatively easier CO problems, the system overhead time sometime dominates over model training time, which might disrupt the ordering of the model training time.

Discussion
Our experimental evaluation reveals that no single methodology performs the best across all experiments. Certain methodologies excel on specific test problems, while others perform better on different test problems. Nevertheless, certain interesting characteristics emerge from the experimental evaluations. Firstly, the performance of SPO is consistently robust across problem sets, even though it may not outperform other techniques in every experiment. Secondly, MAP demonstrates consistent performance across most problem sets too; it only exhibits low quality performance specifically for Capacity=180 in the knapsack problem and in the bipartite matching problem when ρ 1 and ρ 2 are 25%. Additionally, among the LTR losses, Listwise and Pairwise often exhibit high variances, especially in the scheduling and the knapsack problems. The performance of Pairwise(diff) stands out among the LTR losses due to its lower variance. Its performance is comparable to or slightly worse than MAP for most problems other than the synthetic shortest path problem with high values of Deg, i.e., when the underlying predictive model is completely misspecified. Surprisingly, I-MLE, FY, DBB and QPTL perform worse than the PF approach for the portfolio optimization problem, where a quadratic constraint is present. Across the remaining problems, the performance of I-MLE is comparable to that of SPO and MAP. DBB performs considerably worse than I-MLE only in the bipartite matching problem. On the other hand, FY performs well in certain cases, but it is more susceptible to higher variance compared to I-MLE. This is particularly evident in the knapsack problem. Moreover, QPTL demonstrates good performance in most experiments. In fact, QPTL outperforms other models by a substantial margin in the bipartite matching problem. However, QPTL performs poorly compared to others in the scheduling problem, which is an ILP. In this case, the poor performance may be attributed to the fact that QPTL considers a relaxation of the ILP. In this problem the LP solution might differ significantly from the true ILP solution. This is not the case for the knapsack problem, because the solution of the relaxed LP does not deviate significantly from the ILP solution for the knapsack problem. HSD also considers relaxed LPs for ILP problems. However, it performs worse than QPTL for all but the scheduling problem, where it performs considerably better than QPTL. Finally, due to the limitation of computational resources, we were unable to run QPTL and HSD on the Warcraft shortest path problem. This highlights the advantage of DFL methodologies which can make use of any blackbox combinatorial solver (Dijkstra's shortest path solver for instance) to solve the CO problem. Continuing on this topic of computational cost, MAP and the LTR losses are considerably faster and less computationally intensive when they are run with low values of p solve . As MAP tends to have regret as low as SPO for most problem sets, it may be considered a favorable DFL technique for tackling large-scale real-world optimization problems.

Future Research Directions
While there is increasing interest in decision-focused learning research, it still need to evolve to incorporate new characteristics to tackle real-world problems. This section aims to summarize the wide range of challenges that remain open. A few promising research directions for future investigations that could be exploited in the upcoming days, are presented next.
DFL for related tasks/ Task generalization. In the current DFL framework, the ML model is tailored to a particular optimization task. However, in many applications the CO problem might slightly differ in different instantiations. For example, in the recent MIT-Amazon Last Mile Routing Challenge (Merchán et al., 2022), a TSP problem is solved every day for deciding the routing of last mile package delivery, but nodes of the TSPs change every day as the delivery locations vary. An interesting research direction would be to investigate how a model, which is trained to minimize regret of one optimization problem, would perform if evaluated on a similar but different optimization problems. Future work need to advance the approach proposed by Tang and Khalil (2023a) by training the ML model with the aim of generalizing to new tasks.
Noise contrastive loss functions to learn parameters in the constraints. One key advantage of the noise contrastive loss functions (called MAP in the experimental evaluations) proposed by Mulamba et al. (2021) is that it is differentiable. They view the DFL problem by learning to contrast the likelihood of ground-truth solution and a set of negative examples. However, this work does not consider the case of predicting parameters in the constraints. In future studies, there is potential to extend noise contrastive estimation approach by considering the prediction of parameters within the constraints. This can be achieved by learning to contrast the likelihood of feasible points with that of infeasible ones. However, the efficacy of such an approach may rely on how the infeasible points are selected and that is why an empirical investigation into this aspect would provide valuable insights.
Robust decision-focused learning framework to learn parameters in the constraints. While predicting parameters in the constraints of an optimization problem, the prescribed optimal decision might not be feasible with respect to the true parameters. In such scenarios, an interesting direction would be to recommend a solution which is feasible under extreme distributional variations of the parameters. We believe a framework for optimizing average performance and minimizing worst-case constraint violations could reveal new tracks for theoretical research as well as practical applications. The research in this regard can take inspiration from the well-established field of robust optimization (Ben-Tal et al., 2009).
Surrogate loss functions in the absence of ground-truth cost parameters. In may real-world applications, the true cost parameters of the objective function might be latent variables. In such cases the parameters are not observed, only the solutions are observed. So, the parameters would not be available for supervised learning, which entails the use of a task loss other than regret. DFL frameworks, which implement differentiable optimization layer, such as DBB, QPTL or I-MLE are compatible with any task loss. However, the SPO approach, which comes with a theoretical proof of convergence, requires the groundtruth cost vector for gradient computation. This is also true for noise contrastive and LTR losses, whose computation and differentiation do not involve solving the CO problem. Development of surrogate loss functions, which neither require solving nor the true cost vector, would be a valuable contribution with potentials in real-world applications.
Decision-focused learning by score function gradient estimation. Most of the DFL techniques focus on computing the derivative dx (ĉ) dĉ analytically or construct a surrogate task that provides useful gradients. However, there exists another alternative way to estimate the gradient-zeroth-order estimation of the gradient. A widely used approach to zeroth-order optimization is the score function gradient estimation (Williams, 1992). In order to apply score function gradient estimation in DFL, one has to assume the predicted parameter follows a distribution and then compute Monte Carlo estimate of regret by sampling cost vector from that distribution. The score function gradient estimation would return a gradient that moves the parameters of the distribution in directions that facilitate sampling the cost vector with low values of regret (or task loss in general). Although score function gradient estimation provides unbiased gradient; a major challenge of using this technique is it suffers from high variances, which might destabilize the learning. Hence, conducting further research to examine the potential application of score function gradient estimation in DFL would be a valuable contribution.
Non-linear objective function. Most of the works in DFL, consider optimization problem with linear objectives. This is the primary reason such problems have been considered for experimental evaluations in this work. Any convex optimization problem with nonlinear objective function can be differentiated through Cvxpylayers (Agrawal et al., 2019a). However, no DFL technique considers nonlinear objectives with discrete decision variables. As many real-world problems in OR are combinatorial optimization problems with discrete decision variables, developing ML techniques for such problems in the future could be beneficial in real life. For examples, the problem of optimally locating substations in an electrical network to minimize the costs of Distribution is formulated as nonlinear programming (Lakhera et al., 2011). Another classic OR problem, which does not have a linear objective function is the minimization of makespan in flowshop scheduling. Most of the methodologies discussed in this paper are not applicable to handle such problems.
Bilevel optimization techniques for DFL. As mentioned in Section 2.2, the empirical regret minimization problem can be cast as a pessimistic bilevel optimization problem. We believe that by understanding the mathematical object behind the learning process can lead to better algorithms for DFL, leaving a door open to the bilevel optimization community to tackle this problem.
Optimization as an intermediate layer within neural networks. In a Predict-Then-Optimize problem, the final task is to make a decision by solving a CO problem. However, in many other applications the optimization task may appear as an intermediate task. For instance, consider the task of selecting relevant patches in high resolution images, where the patches are being used for a downstream image recognition task. In (Cordonnier et al., 2021) the patch selection task is modeled as a Top-k selection CO problem. Note that the Top-k selection is embedded as an intermediate layer between two neural networks; where the upstream neural network assign score to each patch and the downstream neural network performs the recognition task. Techniques such as I-MLE, DBB, QPTL, DPO, which are implementation of differentiable optimization layer can be applied to tackle problems like this. Although the existence of downstream layer after the CO problem may give rise to novel challenges, embedding the CO problem as an intermediate layer could find extensive use across various domains.
Construction of solution cache. The loss functions which utilize solution cache are very effective to address the computational cost of DFL and promising for large NP-hard real-life CO problems. However, we believe there is a space for research to study the tradeoff between the size of the solution cache size and solution quality.

Conclusion
The survey article begins by underscoring the significance of Predict-Then-Optimize problem formulations, wherein an ML model is followed by a CO problem. The Predict-Then-Optimize problem has emerged as a powerful driving force in numerous real-world applications of artificial intelligence, operations research and business analytics. The key challenge in Predict-Then-Optimize problems is predicting the unknown CO problem parameters in a manner that yields high-quality solutions, in comparison to the retrospective solutions obtained when using the groundtruth parameters. To address this challenge, the DFL paradigm has previously been proposed, wherein the ML models are directly trained considering the CO problems using task losses that capture the error encountered after the CO problems. However to date, there is no comprehensive survey on DFL. This survey provides a comprehensive overview of DFL, highlighting recent technological advancements, applications and identifying potential future research directions. The problem description has been laid out in the beginning with examples and then the fundamental challenges in decision-focused learning have been presented. Afterward, Section 3, has presented a categorisation with five categories of DFL techniques, which have been thoroughly explained highlighting the trade-offs among these five categories. Then, in Section 4, some examples of applications of DFL techniques to address real-world Predict-Then-Optimize problems, across different domains have been provided. Furthermore, an extensive comparative evaluation on different problem sets between 11 DFL techniques have been provided in Section 5. Finally, a discussion of some of the open problems in DFL and an outline of potential research directions have been presented. While there has been significant recent progress in DFL, there remain challenges that need to be addressed. For instance, the development of DFL techniques, which can handle uncertain parameters occurring anywhere within a generic CO problem, will have a significant impact on various industrial applications.
We hope this survey article will assist readers to understand the paradigm of decisionfocused learning and grasp the fundamental challenges of implementing it in many real-world applications. We aspire this survey to potentially act as a catalyst, inspiring the application of decision-focused learning in diverse domains and contexts as well as stimulating further methodological research and advancements.