sunny-as2: Enhancing SUNNY for Algorithm Selection

SUNNY is an Algorithm Selection (AS) technique originally tailored for Constraint Programming (CP). SUNNY enables to schedule, from a portfolio of solvers, a subset of solvers to be run on a given CP problem. This approach has proved to be effective for CP problems, and its parallel version won many gold medals in the Open category of the MiniZinc Challenge -- the yearly international competition for CP solvers. In 2015, the ASlib benchmarks were released for comparing AS systems coming from disparate fields (e.g., ASP, QBF, and SAT) and SUNNY was extended to deal with generic AS problems. This led to the development of sunny-as2, an algorithm selector based on SUNNY for ASlib scenarios. A preliminary version of sunny-as2 was submitted to the Open Algorithm Selection Challenge (OASC) in 2017, where it turned out to be the best approach for the runtime minimization of decision problems. In this work, we present the technical advancements of sunny-as2, including: (i) wrapper-based feature selection; (ii) a training approach combining feature selection and neighbourhood size configuration; (iii) the application of nested cross-validation. We show how sunny-as2 performance varies depending on the considered AS scenarios, and we discuss its strengths and weaknesses. Finally, we also show how sunny-as2 improves on its preliminary version submitted to OASC.


Introduction
Solving combinatorial problems is hard and, especially for NP-hard problems, there is not a dominant algorithm for each class of problems. A natural way to face the disparate nature of combinatorial problems and obtain a globally better solver is to use a portfolio of different algorithms (or solvers) to be selected on different problem instances. The task of identifying suitable algorithm(s) for specific instances of a problem is known as per-instance Algorithm Selection (AS). By using AS, portfolio solvers are able to outperform state-of-the-art single solvers in many fields such as Propositional Satisfiability (SAT), Constraint Programming (CP), Answer Set Programming (ASP), Quantified Boolean Formula (QBF).
A significant number of domain-specific AS strategies have been studied. However, it is hard if not impossible to judge which of them is the best strategy in general. To address this problem, the Algorithm Selection library (ASlib) (Bischl et al., 2016) has been proposed. ASlib consists of scenarios collected from a broad range of domains, aiming to give a cross-the-board performance comparison of different AS techniques, with the scope of comparing various AS techniques on the same ground. Based on the ASlib benchmarks, rigorous validations and AS competitions have been recently held.
In this paper, we focus on the SUNNY portfolio approach , originally developed to solve Constraint Satisfaction Problems (CSPs). SUNNY is based on the k-nearest neighbors (k-NN) algorithm. Given a previously unseen problem instance P , it first extracts its feature vector F P , i.e., a collection of numerical attributes characterizing P , and then finds the k training instances "most similar" to F P according to the Euclidean distance. Afterwards, SUNNY selects the best solvers for these k instances, and assign a time slot proportional to the number of solved instances to the selected solvers. Finally, the selected solvers are sorted by average solving time and then executed on P .
Initially designed for CSPs, SUNNY has then been customized to solve Constraint Optimization Problems (COPs) and to enable the parallel execution of its solvers. The resulting portfolio solver, called sunny-cp , won the gold medal in the Open Track of the Minizinc Challenge (Stuckey, Feydy, Schutt, Tack, & Fischer, 2014)-the yearly international competition for CP solvers-in 2015, 2016, and 2017 (Amadini, Gabbrielli, & Mauro, 2018).
In 2015, SUNNY was extended to deal with general AS scenarios-for which CP problems are a particular case (Amadini, Biselli, Gabbrielli, Liu, & Mauro, 2015b). The resulting tool, called sunny-as, natively handled ASlib scenarios and was submitted to the 2015 ICON Challenge on Algorithm Selection (Kotthoff, Hurley, & O'Sullivan, 2017) to be compared with other AS systems. Unfortunately, the outcome was not satisfactory: only a few competitive results were achieved by sunny-as, that turned out to be particularly weak on SAT scenarios. We therefore decided to improve the performance of sunny-as by following two main paths: (i) feature selection, and (ii) neighborhood size configuration.
Feature selection (FS) is a well-known process that consists of removing redundant and potentially harmful features from feature vectors. It is well established that a good feature selection can lead to significant performance gains. In the 2015 ICON challenge, one version of sunny-as used a simple filter method based on information gain that, however, did not bring any benefit. This is not surprising, because filter methods are efficient but agnostic of the specific predictive task to be performed-they work as a pre-processing step regardless of the chosen predictor. Hence we decided to move to wrapper methods, which are more computationally expensive-they use the prediction system of interest to assess the selected features-but typically more accurate.
The neighborhood size configuration (shortly, k-configuration) consists in choosing an optimal k-value for the k-nearest neighbors algorithm on which SUNNY relies. sunny-as did not use any k-configuration in the 2015 ICON challenge, and this definitely penalized its performance. For example, Lindauer, Bergdoll, and Hutter (2016) pointed up that SUNNY can be significantly boosted by training and tuning its k-value.
The new insights on feature selection and k-configuration led to development of sunny-as2, an extension of sunny-as that enables the SUNNY algorithm to learn both the supposed best features and the neighborhood size. We are not aware of other AS approaches selecting features and neighborhood size in the way sunny-as2 does it. Moreover, sunny-as2 exploits a polynomial-time greedy version of SUNNY making the training phase more efficient-the worst-case time complexity of the original SUNNY is indeed exponential in the size of the portfolio.
In 2017, a preliminary version of sunny-as2 was submitted to the Open Algorithm Selection Challenge (OASC), a revised edition of the 2015 ICON challenge. Thanks to the new enhancements, sunny-as2 achieved much better results (Lindauer, van Rijn, & Kotthoff, 2019): it reached the overall third position and, in particular, it was the approach achieving the best runtime minimization for satisfaction problems (i.e., the goal for which SUNNY was originally designed). Later on, as we shall see, the OASC version of sunny-as2 was further improved. In particular, we tuned the configuration of its parameters (e.g., cross-validation mode, size of the training set, etc.) after conducting a comprehensive set of experiments over the OASC scenarios.
In this work, we detail the technical improvements of sunny-as2 by showing their impact on different scenarios of the ASlib. The original contributions of this paper include: • the description of sunny-as2 and its variants, i.e., sunny-as2-f, sunny-as2-k and sunny-as2-fk-performing respectively feature selection, k-configuration and both; 1 • extensive and detailed empirical evaluations showing how the performance of sunny-as2 can vary across different scenarios, and motivating the default settings of sunny-as2 parameters; • an original and in-depth study of the SUNNY algorithm, including insights on the instances unsolved by sunny-as2 and the use of a greedy approach as a surrogate of the original SUNNY approach; • an empirical comparison of sunny-as2 against different state-of-the-art algorithm selectors, showing a promising and robust performance of sunny-as2 across different scenarios and performance metrics.
We performed a considerable number of experiments to understand the impact of the new technical improvements. Among the lessons we learned, we mention that: • feature selection and k-configuration are quite effective for SUNNY, and perform better when integrated ; • the greedy approach enables a training methodology which is faster and more effective w.r.t. a training performed with the original SUNNY approach; • the "similarity assumption" on which the k-NN algorithm used by SUNNY relies, stating that similar instances have similar performance, is weak if not wrong; • the effectiveness of an algorithm selector is strongly coupled to the evaluation metric used to measure its performance. Nonetheless, sunny-as2 appears to more robust than other approaches when changing the performance metric.
The performance of sunny-as2 naturally varies according to the peculiarities of the given scenario and the chosen performance metric. We noticed that sunny-as2 performs consistently well on scenarios having a reasonable amount of instances and where the theoretical speedup of a portfolio approach, w.r.t the best solver of the scenario, is not minimal.
We also noticed that a limited amount of training instances is enough to reach a good prediction performance and that the nested cross-validation leads to more robust results. In addition, the results of our experiments corroborate some previous findings, e.g., that it is possible to reach the best performance by considering only a small neighborhood size and a small number of features.
Paper structure. In Sect. 2 we review the literature on Algorithm Selection. In Sect. 3 we give background notions before describing sunny-as2 in Sect. 4. Sect. 5 describes the experiments over different configurations of sunny-as2, while Sect. 6 provides more insights on the SUNNY algorithm, including a comparison with other AS approaches. We draw concluding remarks in Sect. 7, while the Appendix contains additional experiments and information for the interested reader.

Related Work
Algorithm Selection (AS) aims at identifying on a per-instance basis the relevant algorithm, or set of algorithms, to run in order to enhance the problem-solving performance. This concept finds wide application in decision problems as well as in optimization problems, although most of the AS systems have been developed for decision problems -in particular for SAT/CSP problems. However, given the generality and flexibility of the AS framework, AS approaches have also been used in other domains such as combinatorial optimization, planning, scheduling, and so on. In the following, we provide an overview of the most known and successful AS approaches we are aware of. For further insights about AS and related problems, we refer the interested reader to the comprehensive surveys in Kerschke, Hoos, Neumann, and Trautmann (2019); Kotthoff (2016); Amadini, Gabbrielli, and Mauro (2015c); Smith-Miles (2008).
About a decade ago, AS began to attract the attention of the SAT community and portfolio-based techniques started their spread. In particular, suitable tracks were added to the SAT competition to evaluate the performance of portfolio solvers. SATzilla (Xu, Hutter, Hoos, & Leyton-Brown, 2008 was one of the first SAT portfolio solvers. Its first version (Xu et al., 2008) used a ridge regression method to predict the effectiveness (i.e., the runtime or a performance score) of a SAT solver on unforeseen SAT instances. This version won several gold medals in the 2007 and 2009 SAT competitions.
In 2012, a new version of SATzilla was introduced . This implementation improved the previous version with a weighted random forest approach provided with a cost-sensitive loss function for punishing misclassification in direct proportion to their performance impact. These improvements allowed SATzilla to outperform the previous version and to win the SAT Challenge in 2012.
Another well-known AS approach for SAT problems is 3S (Kadioglu, Malitsky, Sabharwal, Samulowitz, & Sellmann, 2011). Like SUNNY, the 3S selector relies on k-NN under the assumption that performances of different solvers are similar for instances with similar features. 3S combines AS and algorithm scheduling, in static and dynamic ways. In partic-Besides the SAT and CSP settings, the flexibility of the AS framework led to the construction of effective algorithm portfolios in related settings. For example, portfolio solvers as Aspeed and claspfolio have been proposed for solving Answer-Set Programming (ASP) problems. Aspeed (Hoos, Kaminski, Lindauer, & Schaub, 2015) is a variant of 3S where the per-instance long-running solver selection has been replaced by a solver schedule. Lindauer et al. (2016) released ISA which further improved Aspeed by introducing an optimization objective "timeout-minimal" in the schedule generation. Claspfolio (Hoos, Lindauer, & Schaub, 2014) supports different AS mechanisms (e.g. ISAC-like, 3S-like, SATzilla-like) and was a gold medallist in different tracks of the ASP Competition 2009 and 2011. The contribution of ME-ASP (Maratea, Pulina, & Ricca, 2013a) is also worth mentioning. ME-ASP identifies one solver per ASP instance. To make its prediction robust, it exploits the strength of several independent classifiers (six in total, including k-NN, SVM, Random forests, etc) and chooses the best one according to their cross-validation performances on training instances. An improvement of ME-ASP is described in Maratea, Pulina, and Ricca (2013b), where the authors added the capability of updating the learned policies when the original approach fails to give good predictions. The idea of coupling classification with policy adaptation methods comes from AQME (Pulina & Tacchella, 2009), a multi-engine solver for quantified Boolean formulas (QBF).
SATZilla  has been rather influential also outside the SAT domain. For example, in AI planning, Planzilla (Rizzini, Fawcett, Vallati, Gerevini, & Hoos, 2017) and its improved variants (model-based approaches) were all inspired by the random forests and regression techniques proposed by SATZilla/Zilla. Similarly, for Satisfiability Modulo Theories (SMT) problems, MachSMT (Scott, Niemetz, Preiner, Nejati, & Ganesh, 2021) was recently introduced and its essential parts also rely on random forests. The main difference between the model-based Planzilla selector and MachSMT is that the first one chooses solvers minimizing the ratio between solved instances and solving time, while the latter only considers the solving time of candidate solvers.
A number of AS approaches have been developed to tackle optimization problems. In this case, mapping SAT/CSP algorithm selection techniques to the more general Max-SAT (Ansótegui, Gabàs, Malitsky, & Sellmann, 2016) and Constraint Optimization Problem (COP) (Amadini et al., 2016b) settings are not so straightforward. The main issue here is how to evaluate sub-optimal solutions, and optimal solutions for which optimality has not been proved by a solver. A reasonable performance metric for optimization problems computes a (normalized) score reflecting the quality of the best solution found by a solver in a given time window. However, one can also think to other metrics taking into account the anytime performance of a solver, i.e., the sub-optimal solutions it finds during the search (see, e.g., the area score of Amadini et al. (2016b)).
We reiterate here the importance of tracking the sub-optimal solutions for AS scenarios, especially for those AS approaches that, like SUNNY, schedule more than one solver. The importance of a good anytime performance has been also acknowledged by the MiniZinc Challenge (Stuckey, Becket, & Fischer, 2010), the yearly international competition for CP solvers, that starting from 2017 introduced the area score which measures the area under the curve defined by f s (t) = v where v is the best value found by solver s at time t. To our knowledge, SUNNY is the only general-purpose AS approach taking into account the area score to select a solver: the other approaches only consider the best value f (τ ) at the stroke of the timeout τ .
Considering the AS approaches that attended the 2015 ICON challenge (Lindauer et al., 2019), apart from sunny-as, other five AS systems were submitted: ASAP, AutoFolio, FlexFolio, Zilla, ZillaFolio. It is worth noticing that, unlike SUNNY, all of them are hybrid systems combining different AS approaches.
ASAP (Algorithm Selector And Prescheduler system) (Gonard, Schoenauer, & Sebag, 2017) relies on random forests and k-NN. It combines pre-solving scheduling and per-instance algorithm selection by training them jointly.
AutoFolio (Lindauer, Hoos, Hutter, & Schaub, 2015) combines several algorithm selection approaches (e.g., SATZilla, 3S, SNNAP, ISAC, LLAMA (Kotthoff, 2013)) in a single system and uses algorithm configuration (Hutter, Hoos, & Leyton-Brown, 2011) to search for the best approach and its hyperparameter settings for the scenario at hand. Along with the scenarios in ASlib, AutoFolio also demonstrated its effectiveness in dealing with Circuit QBFs (Hoos, Peitl, Slivovsky, & Szeider, 2018). Unsurprisingly, this paper also shows that the quality of the selected features can substantially impact the selection accuracy of AutoFolio.
FlexFolio (Lindauer, 2015) is a claspfolio-based AS system (Hoos et al., 2014) integrating various feature generators, solver selection approaches, solver portfolios, as well as solverschedule-based pre-solving techniques into a single, unified framework.
Zilla is an evolution of SATZilla (Xu et al., 2008 using pair-wise, cost-sensitive random forests combined with pre-solving schedules. ZillaFolio combines Zilla and AutoFolio by first evaluating both approaches on the training set. Then, it chooses the best one for generating the predictions for the test set. The OASC 2017 challenge  included a preliminary version of sunny-as2, improved versions of ASAP (i.e., ASAP.v2 and ASAP.v3), an improved version of Zilla (i.e., *Zilla) and a new contestant which came in two flavors: AS-ASL and AS-RF. Both AS-ASL and AS-RF (Malone, Kangas, Järvisalo, Koivisto, & Myllymäki, 2017) used a greedy wrapper-based feature selection approach with the AS selector as evaluator to locate relevant features. The system was trained differently for the two versions: AS-ASL uses ensemble learning model while AS-RF uses the random forest. A final schedule is built on the trained model.
One common thing between ASAP.v2/3, *Zilla and AS-RF/ASL is that all of them attempt to solve an unseen problem instance by statically scheduling a number of solver(s) before the AS process. The solver AS-ASL selects a single solver while ASAP and *Zilla define a static solver schedule. A comprehensive summary of the above approaches as well as several challenge insights are discussed in Lindauer et al. (2019).
For the sake of completeness we also mention parallel AS approaches, although they do not fall within the scope of this paper. The parallel version of SUNNY  won several gold medals in the MiniZinc Challenges by selecting relevant solvers to run in parallel on a per-instance basis. In contrast, the work by Lindauer, Hoos, Leyton-Brown, and Schaub (2017) studied the methods for static parallel portfolio construction. In addition to selecting relevant solvers, they also identifies performing parameter values for the selected solvers. Given a limited time budget for training, a large amount of candidate solvers and their wide configuration space, the task of making parallel portfolio is not trivial. Therefore, they examined greedy techniques to speed up their procedures, and clause sharing for algorithm configuration to improve prediction performance. Likewise, in the domain of AI planning, portfolio parallelization has also been investigated. An example is the static parallel portfolio proposed by Vallati, Chrpa, and Kitchin (2018) where planners are scheduled to each available CPU core.
We conclude by mentioning some interesting AS approaches that, however, did not attend the 2015 and 2017 challenges. The work by Ansotegui, Sellmann, and Tierney (2018) is built upon CSHC. They first estimate the confidence of the predicted solutions and then use the estimations to decide whether it is appropriate to substitute the solution with a static schedule. By using the OASC dataset, the authors demonstrated a significant improvement over the original CSHC approach reaching the state-of-the-art performance in several scenarios. In Mısır and Sebag (2017), the AS problem is seen as a recommendation problem solved with the well-known technique of collaborative filtering (Ekstrand, Riedl, & Konstan, 2011). This approach has a performance similar to the initial version of sunny-as. In Loreggia, Malitsky, Samulowitz, and Saraswat (2016), the authors introduce an original approach that transforms the text-encoded instances for the AS into a 2-D image. These images are later processed by a Deep Neural Network system to predict the best solver to use for each of them. This approach enables to find out (and also generate) relevant features for the Algorithm Selection. Preliminary experiments are quite encouraging, even though this approach still lags behind w.r.t. state-of-the-art approaches who are using and exploiting crafted instance features.

Preliminaries
In this section we formalize the Algorithm Selection problem (Bischl et al., 2016) and the metrics used to evaluate algorithm selectors. We then briefly introduce the feature selection process and the SUNNY algorithm on which sunny-as and sunny-as2 rely. We conclude by providing more details about the OASC and its scenarios.

Algorithm Selection Problem and Evaluation Metrics
To create an algorithm selector we need a scenario with more than one algorithm to choose, some instances on which to apply the selector, and a performance metric to optimize. This information can be formally defined as follows.
Definition 1 (AS scenario). An AS scenario is a triple (I, A, m) where: • I is a set of instances, • A is a set (or portfolio) of algorithms (or solvers) with |A| > 1, • m : I × A → R is a performance metric.
Without loss of generality, from now on we assume that lower values for the performance metric m are better, i.e., the goal is to minimize m.
An algorithm selector, or shortly a selector, is a function that for each instance of the scenario aims to return the best algorithm, according to the performance metric, for that instance. Formally: Definition 2 (Selector). Given an AS scenario (I, A, m) a selector s is a total mapping from I to A.
The algorithm selection problem (Rice, 1976) consists in creating the best possible selector. Formally: Definition 3 (AS Problem). Given an AS scenario (I, A, m) the AS Problem is the problem of finding the selector s such that the overall performance i∈I m(i, s(i)) is minimized.
If the performance metric m is fully defined, the AS Problem can be easily solved by assigning to every instance the algorithm with lower value of m. Unfortunately, in the real world, the performance metric m on I is only partially known. In this case, the goal is to define a selector able to estimate the value of m for the instances i ∈ I where m(i, A) is unknown. A selector can be validated by partitioning I into a training set I tr and a test set I ts . The training instances of I tr are used to build the selector s, while the test instances of I ts are used to evaluate the performance of s: i∈Its m(i, s(i)). As we shall see, the training set I tr can be further split to tune and validate the parameters of the selector.
Different approaches have been proposed to build and evaluate an algorithm selector. First of all, since the instances of I are often too hard to solve in a reasonable time, typically a solving timeout τ is set. For this reason, often the performance metric is extended with other criteria to penalize an algorithm selector that does not find any solution within the timeout. One of the most used is the Penalized Average Runtime (PAR) score with penalty λ > 1 that penalizes instances not solved within the timeout with λ times the timeout.
Formally, if m denotes the runtime, it is defined as For example, in both the 2015 ICON challenge and the OASC, the PAR 10 score was used for measuring the selectors' performance on every single scenario. Unfortunately, the PAR value can greatly change across different scenarios according to the timeout, making it difficult to assess the global performance across all the scenarios. Hence, when dealing with heterogeneous scenarios, it is often better to consider normalized metrics. As baselines, one can consider the performance of the single best solver (SBS, the best individual solver according to the performance metric) of the scenario as upper bound and the performance of the virtual best solver (VBS, the oracle selector always able to pick the best solver for all the instances in the test set) as lower bound. Ideally, the performance of a selector should be in between the performance of the SBS and that of the VBS. However, while an algorithm selector can never outperform the VBS, it might happen that it performs worse than the SBS. This is more likely to happen when the gap between SBS and VBS is exiguous.
Two metrics are often used in the literature to compare algorithm selectors: the speedup or improvement factor (Lindauer et al., 2019) and the closed gap . The speedup is a number that measures the relative performance of two systems. If m s and m VBS are respectively the cumulative performances of a selector s and the virtual best solver across all the instances of a scenario, the speedup of the VBS w.r.t. the selector is defined as the ratio between m s and m VBS . Since the selector can not be faster than the VBS, this value is always greater than 1, and values closer to 1 are better. To normalize this metric in a bounded interval (the upper bound varies across different scenarios) the fraction can be reversed by considering the ratio between m VBS and m s . In this case the value always falls in (0, 1], and the greater the value the better the selector.
Unlike the speedup, the closed gap score measures how good a selector is in improving the performance of the SBS w.r.t. the VBS in the AS scenario. Assuming that m SBS is the cumulative performance of the SBS across all the instances of the scenario, the closed gap is defined as: m SBS − m s m SBS − m VBS A good selector will have a performance m s close to the virtual best solver, which makes the closed gap score close to 1. On the contrary, a poor performance consists of having m s close to the single best solver m SBS , thus making the closed gap close to 0 if not even lower.
An alternative way to evaluate the performance of algorithm selectors is to use comparative scores without considering the SBS and VBS baselines. For example, in the MiniZinc Challenge (Stuckey et al., 2014) a Borda count is used to measure the performance of CP solvers. The Borda count allows voters (instances) to order the candidates (solvers) according to their preferences and giving to them points corresponding to the number of candidates ranked lower. Once all votes have been counted, the candidate with the most points is the winner. This scoring system can be applied to algorithm selectors in a straightforward way.
Formally, let (I, A, m) be a scenario, S a set of selectors, τ the timeout. Let us denote with m(i, s) the performance of selector s on problem i. The Borda score of selector s ∈ S on instance i ∈ I is Borda(i, s) = s ∈S−{s} cmp(m(i, s), m(i, s )) where the comparative function cmp is defined as: Since cmp is always in [0, 1], the score Borda(i, s) is always in [0, |S| − 1]: the higher its value, the more selectors it can beat. When considering multiple instances, the winner is the selector s that maximizes the sum of the scores over all instances, i.e., i∈I Borda(i, s).

Feature Selection
Typically, AS scenarios characterize each instance i ∈ I with a corresponding feature vector F(i) ∈ R n , and the selection of the best algorithm A for i is actually performed according to F(i), i.e., A = s(F(i)). The feature selection (FS) process allows one to consider smaller feature vectors F (i) ∈ R m , derived from F(i) by projecting a number m ≤ n of its features. The purpose of feature selection is simplifying the prediction model, lowering the training and feature extraction costs, and hopefully improving the prediction accuracy.
FS techniques (Guyon & Elisseeff, 2003) basically consist of a combination of two components: a search technique for finding good subsets of features, and an evaluation function to score these subsets. Since exploring all the possible subsets of features is computationally intractable for non-trivial feature spaces, heuristics are employed to guide the search of the best subsets. Greedy search strategies usually come in two flavors: forward selection and backward elimination. In forward selection, features are progressively incorporated into larger and larger subsets. Conversely, in backward elimination features are progressively removed starting from all the available features. A combination of these two techniques, genetic algorithms, or local search algorithms such as simulated annealing are also used.
There are different ways of classifying FS approaches. A well established distinction is between filters and wrappers. Filter methods select the features regardless of the model, trying to suppress the least interesting ones. These methods are particularly efficient and robust to overfitting. In contrast, wrappers evaluate subsets of features possibly detecting the interactions between them. Wrapper methods can be more accurate than filters, but have two main disadvantages: they are more exposed to overfitting, and they have a much higher computational cost. More recently, also hybrid and embedded FS methods have been proposed (Jovic, Brkic, & Bogunovic, 2015). Hybrid methods combine wrappers and filters to get the best of these two worlds. Embedded methods are instead integrated into the learning algorithm, i.e., they perform feature selection during the model training.
In this work we do not consider filter methods. We refer the interested readers to Amadini, Biselli, Gabbrielli, Liu, and Mauro (2015a) to know more about SUNNY with filter-based FS.

SUNNY and sunny-as
The SUNNY portfolio approach was firstly introduced in Amadini et al. (2014). SUNNY relies on a number of assumptions: (i) a small portfolio is usually enough to achieve a good performance; (ii) solvers either solve a problem quite quickly, or cannot solve it in reasonable time; (iii) solvers perform similarly on similar instances; (iv) a too heavy training phase is often an unnecessary burden. In this section we briefly recap how SUNNY works, while in Sect. 6 we shall address in more detail these assumptions-especially in light of the experiments reported in Sect. 5.
SUNNY is based on the k-nearest neighbors (k-NN) algorithm and embeds built-in heuristics for schedule generation. Despite the original version of SUNNY handled CSPs only, here we describe its generalised version-the one we used to tackle general ASlib scenarios.
Let us fix the set of instances I = I tr ∪ I ts , the set of algorithms A, the performance metric m, and the runtime timeout τ . Given a test instance x ∈ I ts , SUNNY produces a sequential schedule σ = [(A 1 , t 1 ), . . . , (A h , t h )] where algorithm A i ∈ A runs for t i seconds on x and h i=1 t i = τ . The schedule is obtained as follows. First, SUNNY employs k-NN to select from I tr the subset I k of the k instances closest to x according to the Euclidean Table 1: Runtime (in seconds). τ means the solver timeout.
distance computed on the feature vector F(x). Then, it uses three heuristics to compute σ: scheduling the sequential execution of the algorithms according to their performance in I k .
The heuristics H sel , H all , and H sch are based on the performance metric m, and depend on the application domain. For CSPs, H sel selects the smallest sets of solvers S ⊆ A that solve the most instances in I k , by using the runtime for breaking ties; H all allocates to each A i ∈ S a time t i proportional to the instances that S can solve in I k , by using a special backup solver for covering the instances of I k that are not solvable by any solver; Finally, H sch sorts the solvers by increasing solving time in I k . For Constraint Optimization Problems the approach is similar, but different evaluation metrics are used to also consider the objective value and sub-optimal solutions (Amadini et al., 2016b). For more details about SUNNY we refer the interested reader to Amadini et al. (2014Amadini et al. ( , 2016b. Below we show Example 1 illustrating how SUNNY works on a given CSP. Example 1. Let x be a CSP, A = {A 1 , A 2 , A 3 , A 4 } a portfolio, A 3 the backup solver, τ = 1800 seconds the solving timeout, I k = {x 1 , ..., x 5 } the k = 5 neighbors of x, and the runtime of solver A i on problem x j defined as in Tab. 1. In this case, the smallest set of solvers that solve most instances in the neighborhood are {A 1 , A 2 , A 3 }, {A 1 , A 2 , A 4 }, and {A 2 , A 3 , A 4 }. The heuristic H sel selects S = {A 1 , A 2 , A 4 } because these solvers are faster in solving the instances in I k . Since A 1 and A 4 solve 2 instances, A 2 solves 1 instance and x 1 is not solved by any solver, the time window [0, τ ] is partitioned in 2 + 2 + 1 + 1 = 6 slots: 2 assigned to A 1 and A 4 , 1 slot to A 2 , and 1 to the backup solver A 3 . Finally, H sch sorts in ascending order the solvers by average solving time in I k . The final schedule produced by SUNNY is, therefore, σ = [(A 4 , 600), (A 1 , 600), (A 3 , 300), (A 2 , 300)].
One of the goals of SUNNY is to avoid the overfitting w.r.t. the performance of the solvers in the selected neighbors. For this reason, their runtime is only marginally used to allocate time to the solvers. A similar but more runtime-dependent approach like, e.g., CPHydra (Bridge et al., 2012) would instead compute a runtime-optimal allocation (A 1 , 3), (A 2 , 593), (A 4 , 122), able to cover all the neighborhood instances, and then it would distribute this allocation in the solving time window [0, τ ]. SUNNY does not follow this logic to not be too tied to the strong assumption that the runtime in the neighborhood faithfully reflect the runtime on the instance to be solved. To understand the rationale behind this choice, let us see the CPHydra-like schedule above: A 1 is the solver with the best average runtime in the neighborhood, but its time slot is about 200 times less than the one of A 2 , and about 40 times less than the one of A 4 . This schedule is clearly skewed towards A 2 , which after all is the solver having the worst average runtime in the neighborhood.
As one can expect, the design choices of SUNNY have pros and cons. For example, unlike the CPHydra-like schedule, the schedule produced by SUNNY in Example 1 cannot solve the instance x 2 although x 2 is actually part of the neighborhood. More insights on SUNNY are provided in Sect. 6.
By default, SUNNY does not perform any feature selection: it simply removes all the features that are constant over each F(x), and scales the remaining features into the range [−1, 1] (scaling features is important for algorithms based on k-NN). The default neighborhood size is √ I tr , possibly rounded to the nearest integer. The backup solver is the solver A * ∈ A minimising the sum i∈Itr m(i, A * ), which is usually the SBS of the scenario.
The sunny-as (Amadini, Biselli, et al., 2015b) tool implements the SUNNY algorithm to handle generic AS scenarios of the ASlib. In its optional pre-processing phase, performed offline, sunny-as can perform a feature selection based on different filter methods and select a pre-solver to be run for a limited amount of time. At runtime, it produces the schedule of solvers by following the approach explained above.

2017 OASC Challenge
In 2017, the COnfiguration and SElection of ALgorithms (COSEAL) group (COSEAL group, 2013) organized the first Open Algorithm Selection Challenge (OASC) to compare different algorithm selectors.
The challenge is built upon the Algorithm Selection library (ASlib) (Bischl et al., 2016) which includes a collection of different algorithm selection scenarios. ASlib distinguishes between two types of scenarios: runtime scenarios and quality scenarios. In runtime scenarios the goal is to select an algorithm that minimizes the runtime (e.g., for decision problems). The goal in quality scenarios is instead to find the algorithm that obtains the highest score according to some metric (e.g., for optimization problems). ASlib does not consider the anytime performance: the sub-optimal solutions computed by an algorithm are not tracked. This makes it impossible to reconstruct ex-post the score of interleaved executions. For this reason, in the OASC the scheduling was allowed only for runtime scenarios.
The 2017 OASC consisted of 11 scenarios: 8 runtime and 3 quality scenarios. Differently from the previous ICON challenge for Algorithm Selection held in 2015, the OASC used scenarios from a broader domain which come from the recent international competitions on CSP, MAXSAT, MIP, QBF, and SAT. In the OASC, each scenario is evaluated by one pair of training and test set replacing the 10-fold cross-validation of the ICON challenge. The participants had access to performance and feature data on training instances (2/3 of the total), and only the instance features for the test instances (1/3 of the total).
In this paper, since SUNNY produces a schedule of solvers not usable for quality scenarios, we focus only on runtime scenarios. An overview of them with their number of instances, algorithms, features, and the timeouts is shown in Tab. 2. 3

sunny-as2
sunny-as2 is the evolution of sunny-as and the selector that attended the 2017 OASC competition. The most significant innovation of sunny-as2 is arguably the introduction of an integrated approach where the features and the k-value are possibly co-learned during the training step. This makes sunny-as2 "less lazy" than the original SUNNY approach, which only scaled the features in [−1, 1] without performing any actual training. 4 The integrated approach we developed is similar to what has been done in Zyout, Abdel-Qader, and Jacobs (2011); Park and Kim (2015) in the context of biology and medicine. However, to the best of our knowledge, no similar approach has been developed for algorithm selection. Based on training data, sunny-as2 automatically selects the most relevant features and/or the most promising value of the neighborhood parameter k to be used for online prediction. We recall that, differently from sunny-as2, sunny-as had only a limited support for filter-based feature selection, it only allowed the manual configuration of SUNNY parameters, and did not support all the evaluation modalities of the current selector.
The importance of feature selection and parameters configuration for SUNNY were independently discussed with empirical experiments conducted by Lindauer et al. (2016); Amadini, Biselli, et al. (2015a). In particular, Amadini, Biselli, et al. (2015a) demonstrated the benefits of a filter-based feature selection, while Lindauer et al. (2016) highlighted that parameters like the schedule size |σ| and the neighborhood size k can have a substantial impact on the performance of SUNNY. In this regard, the authors introduced TSUNNY, a version of SUNNY that-by allowing the configuration of both |σ| and k parametersyielded a remarkable improvement over the original SUNNY. Our work is however different because: first, we introduce a greedy variant of SUNNY for selecting subset of solvers; second, we combine wrapper-based feature selection and k-configuration, while their system does not deal with feature selection.
To improve the configuration accuracy and robustness, and to assess the quality of a parameters setting, sunny-as2 relies on cross-validation (CV) (Kohavi, 1995). Cross-validation is useful to mitigate the well-known problem of overfitting. In this regard it is fundamental to split the dataset properly. For example, in the OASC only one split between test and training instances was used to evaluate the performance of algorithm selectors. As also noticed by the OASC organizers (Lindauer et al., 2019), randomness played an important role in the competition. In particular, they stated that "this result demonstrates the importance of evaluating algorithm selection systems across multiple random seeds, or multiple test sets".
To evaluate the performance of our algorithm selector by overcoming the overfitting problem and to obtain more robust and rigorous results, in this work we adopted a repeated nested cross-validation approach (Loughrey & Cunningham, 2005). A nested cross-validation consists of two CVs, an outer CV which forms test-training pairs, and an inner CV applied on the training sets used to learn a model that is later assessed on the outer test sets.
The original dataset is split into five folds thus obtaining five pairs (T 1 , S 1 ) . . . , (T 5 , S 5 ) where the T i are the outer training sets and the S i are the (outer) test sets, for i = 1, . . . , 5.
For each T i we then perform an inner 10-fold CV to get a suitable parameter setting. We split each T i into further ten sub-folds T i,1 , . . . , T i,10 , and in turn for j = 1, . . . , 10 we use a sub-fold T i,j as validation set to assess the parameter setting computed with the inner training set, which is the union of the other nine sub-folds k =j T i,k . We then select, among the 10 configurations obtained, the one for which SUNNY achieves the best PAR10 score on the corresponding validation set. The selected configuration is used to run SUNNY on the paired test set S i . Finally, to reduce the variability and increase the robustness of our approach, we repeated the whole process for five times by using different random partitions. The performance of sunny-as2 on each scenario was then assessed by considering the average closed gap scores over all the 5 × 5 = 25 test sets.
Before explaining how sunny-as2 learns features and k-value, we first describe greedy-SUNNY, the "greedy variant" of SUNNY.

greedy-SUNNY
The selection of solvers performed by SUNNY might be too computationally expensive, i.e., exponential in the size of the portfolio in the worst case. Therefore, to perform a quicker estimation of the quality of a parameter setting, we introduced a simpler variant of SUNNY that we called greedy-SUNNY.
As for SUNNY, the mechanism of schedule generation in greedy-SUNNY is driven by the concept of marginal contribution , i.e., how much a new solver can improve the overall portfolio. However, greedy-SUNNY differs from SUNNY in the way the schedule of solvers is computed. Given the set N of the instances of the neighborhood, the original SUNNY approach computes the smallest set of solvers in the portfolio that maximizes the number of solved instances in N . The worst-case time complexity of this procedure is exponential in the number of available solvers.
To overcome this limitation, greedy-SUNNY starts from an empty set of solvers S and adds to it one solver at a time by selecting the one that is able to solve the largest number of instances in N . The instances solved by the selected solver are then removed from N and the process is repeated until a given number λ of solvers is added to S or there are no more instances to solve (i.e., N = ∅). 5 Based on some empirical experiments, the default value of λ was set to a small value (i.e., 3) as also suggested by the experiments in Lindauer et al. (2016). If λ is a constant, the time-complexity of greedy-SUNNY is O(nk) where k = |N | and n is the number of available solvers.

Learning the Parameters
sunny-as2 provides different procedures for learning features and/or the k value. The configuration procedure is performed in two phases: (i) data preparation, and (ii) parameters configuration.

Data Preparation
The dataset is first split into 5 folds (T 1 , S 1 ) . . . , (T 5 , S 5 ) for the outer CV, and each T i is in turn split in T i,1 , . . . , T i,10 for the inner CV by performing the following four steps: 1) each training instance is associated to the solver that solves it in the shortest time; 2) for each solver, the list of its associated instances is ordered from the hardest to the easiest in terms of runtime; 3) we select one instance at a time from each set associated to each solver until a global limit on the number of instances is reached; 4) the selected instances are finally divided into 10 folds for cross-validation. 6 At step 4), sunny-as2 offers three choices: random split, stratified split (Kohavi, 1995) and rank split (a.k.a. systematic split in Reitermanova (2010)). The random split simply partitions the instances randomly. The stratified split guarantees that for each label (in our context, the best solver for that instance) all the folds contains roughly the same percentage of instances. The rank split ranks the instances by their hardness, represented by the sum of the runtime, then each fold takes one instance in turn from the ranked instances.
While the stratified approach distributes the instances based on the best solver able to solve them, the rank method tries to distribute the instances based on their hardness. In the first case, every fold will likely have a witness for every label, while in the latter every fold will be a mixture of easy and hard instances.

Parameters Configuration
sunny-as2 enables the automated configuration of the features and/or the k-value by means of the greedy-SUNNY approach introduced in Sect. 4.1. The user can choose between three different learning modes, namely: 1. sunny-as2-k. In this case, all the features are used and only the k-configuration is performed by varying k in the range [1, maxK ] where maxK is an external parameter set by the user. The best value of k is then chosen.
2. sunny-as2-f. In this case, the neighborhood size k is set to its default value (i.e., the square root of the number of instances, rounded to the nearest integer) and a wrapper-based feature selection is performed. Iteratively, starting from the empty set, sunny-as2-f adds to the set of already selected features the one which better decreases the PAR 10 . The iteration stops when the PAR 10 increases or reaches a time cap.
Algorithm 1 Configuration procedure of sunny-as2-fk. return bestF , bestK 27: end function of a feature with k varying in [1, maxK ] does not improve the PAR 10 score or a given time cap is reached. The resulting feature set and k value are chosen for the online prediction. 7 Algorithm 1 shows through pseudocode how sunny-as2-fk selects the features and the kvalue. The learnFK algorithm takes as input the portfolio of algorithms A, the maximum schedule size λ for greedy-SUNNY, the set of training instances I, the maximum neighborhood size maxK, the original set of features F, and the upper bound maxF on the maximum number of features to be selected. learnFK returns the learned value bestK ∈ [1, maxK] for the neighborhood size and the learned set of features bestF ⊆ F having |bestF | ≤ maxF .
After the i-th iteration of the outer for loop (Lines 7-17) the current set of features currF eatures consists of exactly i features. Each time currF eatures is set, the inner for loop is executed n times to also evaluate different values of k on the dataset I. The evaluation is performed by the function getScore, returning the score of a particular setting obtained 7. Since sunny-as2-fk integrates the feature selection into the k-configuration process, it may be considered as an embedded FS method.
with greedy-SUNNY (cf. Sect. 4.1). However, getScore can be easily generalized to assess the score of a setting obtained with an input algorithm different from greedy-SUNNY (e.g., the original SUNNY approach). At the end of the outer for loop, if adding a new feature could not improve the score obtained in the previous iteration (i.e., with |currF eatures|−1 features) the learning process terminates. Otherwise, both the features and the k-value are updated and a new iteration begins, until the score cannot be further improved or the maximum number of features maxF is reached.
If d = min(maxF, |F|), n = min(maxK, |I|) and the worst-case time complexity of getScore is γ, then the overall worst-case time complexity of learnFK is O(d 2 nγ). This cost is still polynomial w.r.t. |A|, d, and n because getScore is polynomial thanks to the fact that λ is a constant.
From learnFK one can easily deduct the algorithm for learning either the k-value (for sunny-as2-k) or the selected features (for sunny-as2-f): in the first case, the outer for loop is omitted because features do not vary; in the second case, the inner loop is skipped because the value of k is constant.
We conclude this section by summarizing the input parameters that, unlike features and k-value, are not learned automatically by sunny-as2: 1. split mode: the way of creating validation folds for the inner CV, including: random, rank, and stratified split. Default: rank.
2. training instances limit: the maximum number of instances used for training. Default: 700.
3. feature limit: the maximum number of features for feature selection, used by sunny-as2-f and sunny-as2-fk. Default: 5.
5. schedule limit for training (λ): the limit of the schedule size for greedy-SUNNY. Default: 3.
6. seed: the seed used to split the training set into folds. Default: 100.
7. time cap: the time cap used by sunny-as2-f and sunny-as2-fk to perform the training. Default: 24 h.
The default values of these parameters were decided by conducting an extensive set of manual experiments over ASlib scenarios, with the goal of reaching a good trade-off between the performance and the time needed for the training phase (i.e., at most one day). In Sect. 5.3 we shall report some of these experiments.

Experiments
In this section we present (part of) the experiments we conducted over several different configurations of sunny-as2. We first present the benchmarks and the methodology used (Sect. 5.1). Then, in Sect. 5.2 we assess the impact of the new components of sunny-as2 to quantify what we can gain by learning the neighborhood size, by using a smaller number of features, and by using greedy-SUNNY instead of, or together with, the original SUNNY approach. Finally, in Sect. 5.3 we use as baseline sunny-as2-fk, i.e., sunny-as2's most comprehensive approach that exploits both the learning of the neighborhood size and the feature selection, to understand how its performance can vary by tuning one parameter at a time and by leaving the other parameters to their default values.
In the following, unless otherwise specified, sunny-as2 always denotes the sunny-as2-fk variant.

Experimental Setting
We evaluated sunny-as2 on the runtime scenarios of the ASlib. In particular, we selected the 8 runtime scenarios of the OASC challenge described in Sect. 3.4 (see Tab. 2). These scenarios contain problem instances belonging to the following domains: Constraint Satisfaction, Mixed-Integer Programming, SAT solving, Max-SAT solving, Quantified Boolean Formulas, and learning in Bayesian networks. To avoid biases towards a specific domain, 8 we added four more ASlib scenarios representing all those domains that were not considered in the OASC, namely: Answer Set Programming, Pre-marshalling problem, Subgraph Isomorphisms, and Traveling Salesman Problem (see Tab. 3).
We used the repeated nested cross-validation with 5 repetitions, 5 folds in the outer loop and 10 folds in the inner loop, explained in Sect. 4. For the OASC scenarios, we used only the instances belonging to the training set of the OASC competition since we later on wanted to check the performance of the last version of sunny-as2 on the OASC test sets. For the four additional scenarios, since they did not come with a separation between training and test sets, we instead applied the repeated cross-validation on all their instances. For each scenario, the performance of sunny-as2 was evaluated with the average closed gap score over all the repetitions. In each repetition, the closed gap score was calculated as explained in Sect. 3.1 by using the PAR 10 as performance metric m.
All the experiments were conducted on Linux machines equipped with Intel Corei5 3.30GHz processors and 8 GB of RAM. We used a time cap of 24 hours for learning the parameters. All the ASlib scenarios are publicly available at https://github.com/ coseal/aslib _ data.

Assessment of New Components
In this section we measure the impact of the new components we introduced in this paper. We assess what we can gain by learning the neighborhood size and/or the number of features, and how greedy-SUNNY can improve the original SUNNY algorithm.

Learning Modes
We compared the sunny-as2-f, sunny-as2-k, and sunny-as2-fk variants of sunny-as2 against the original version of sunny-as that does not exploit any parameter configuration. We run the different sunny-as2 learning modes with their default parameters for all the scenarios. Tab. 4 shows the average closed gap of each approach across all the repetitions performed. Interestingly, there is not a dominant learning mode. As also shown in Lindauer et al. (2016), a proper k-configuration leads to a good performance improvement for SUNNY-indeed, sunny-as2-k is able to reach the peak performance in 7 scenarios out of 12. However, sunny-as2-fk has the best average closed gap. One reason for this is the poor performance of sunny-as2-k in the TSP scenario.
The original sunny-as is clearly less promising than any other variant of sunny-as2, even though for the GRAPHS scenario it achieves the best performance. What we can conclude from Tab. 4 is that most of the performance improvement is due to the selection of the right neighborhood size k. However, feature selection can also give a positive contribution. Fig. 1 depicts with boxplots the closed gap scores reported in Tab. 1. Specifically, for each scenario we collected the corresponding 25 closed gap scores, one for each test set. Each box of Fig. 1 delimits the first and third quartile of the closed gap distribution, while the horizontal line inside each box is the median. The vertical whiskers indicate the rest of the distribution excluding diamonds, which are considered as outliers since they are outside the inter-quartile ranges. The larger the box, the less stable a learning system is. For example, sunny-as2-f is quite unstable in Caren, Mira and Graphs scenarios. sunny-as2-fk looks more robust than sunny-as2-f, while sunny-as2-k and sunny-as seems to be slightly more stable in most cases.
Tab. 5 shows the average time (in minutes) spent for training each fold. As we can see, sunny-as2-k is the fastest approach, followed by sunny-as2-f and sunny-as2-fk. This is not surprising because learning features is a computationally expensive task, especially when wrapper methods are used.

greedy-SUNNY vs SUNNY
As mentioned in Sect. 4.1, greedy-SUNNY was introduced to speed up the training process.
Here we empirically show that greedy-SUNNY not only speeds up the training of sunny-as2, but it also outperforms the performance achieved by using the original SUNNY approach for training. In the following experiments we use sunny-as2 with its default parameters, by only varying the approach adopted for generating the schedule of solvers for both training and testing. In the latter case, we use a time limit of a week due to the long computation time required by the original SUNNY approach to create schedules.
Note that SUNNY and greedy-SUNNY are interoperable because they share the same underlying AS approach. The only difference is the way they select the subsets of solvers: greedy-SUNNY uses a possibly not optimal greedy approach, while SUNNY relies on an exhaustive search-possibly exponential in the portfolio size. The output of the learnFK algorithm (Algorithm 1) is always a set of features F and a neighborhood size k, regardless of whether SUNNY or greedy-SUNNY is used on the validation set. These parameters are then used to compute the schedules on the unforeseen test instances, regardless of whether SUNNY or greedy-SUNNY is used to select the solvers.
The results are reported in Tab. 6. Column names denote the pairs of functions used for the training and testing respectively. For brevity, we write GSUNNY instead of greedy-SUNNY. For instance, the second column "SUNNY-GSUNNY" means that SUNNY has been used for training, and greedy-SUNNY for testing. Tab. 6 shows that the peak performance in each scenario is always reached when SUNNY is used for testing. This makes sense: using greedy-SUNNY on an unforeseen instance might be useful in a time-sensitive context where an exponential-time solvers' selection is not acceptable but, in general, SUNNY provides a more precise scheduling. However, the GSUNNY-GSUNNY column shows that on average the score of sunny-as2 using greedy-SUNNY only is not far from the best performance achieved by GSUNNY-SUNNY.
The most interesting thing of Tab. 6 is probably that, on average, using greedy-SUNNY for learning the features and the k value not only speeds up the training but also improves the prediction accuracy. Indeed, the score achieved by GSUNNY-SUNNY and GSUNNY-GSUNNY is consistently better than the one of SUNNY-SUNNY and SUNNY-GSUNNY. We conjecture that, in the training phase, it might be more important to prioritise the first λ solvers solving the most instances in the neighborhood rather than selecting a sub-portfolio from all the available solvers as done by SUNNY (we recall that the maximum schedule size λ for the default greedy-SUNNY is 3).
greedy-SUNNY can be particularly useful on scenarios with a large number of solvers. This is evident in Tab. 7 describing the hours spent for training using the different approaches. 9 As expected, greedy-SUNNY is quicker than the original SUNNY approach for any considered scenarios.

Tuning the Parameters
In this section we study the sensitivity of the parameters that sunny-as2 cannot learn, namely: the split modes for cross-validation, the limit on the numbers of features to select, the limit on the number of training instances, and finally the schedule limit λ. We conclude the section by reporting an analysis on the performance variability of sunny-as2. For all the experiments, we set the parameters of sunny-as2 to their default values and we varied one parameter at a time. We mark with 'Timeout' the cases where the training phase for at least one fold did not finish within a day. When a training timeout occurs for a specific scenario, we assign to it a closed gap score of 0, i.e., the score of the single best solver. In other terms, if we cannot train a scenario within 24 hours we simply assume that the single best solver is used for that scenario.

Cross-Validation
We study the effects of different cross-validations when training the model. Tab. 8 compares different cross-validation approaches for all the scenarios in our benchmark. For these experiments we set the internal parameters of sunny-as2 to their default values (cf. Sect. 4.2.2) except the split mode one. The three split modes we examined are: random, stratified and rank. The random mode generates folds in a random way; the stratified mode generates folds based on class label (fastest algorithm); the rank first orders the instances according to their hardness (cf. 4.2.1), then systematically partitions them into each fold.
As shown in Tab. 8, the closed gap of rank CV is on average better than both random and stratified CV. It appears that distributing instances according to their hardness leads to more balanced folds, and this in turn implies a better training. However, there is not a single dominant approach: stratified is the best in four scenarios, rank in six, and random in only two scenarios. It appears that stratified CV performs better than rank in scenarios with a higher number of instances. Fig. 2 shows the boxplots of Tab. 8. We can see that the rank mode looks more stable than random and stratified modes in most scenarios except Svea.

Number of Training Instances
Here we study the impact of the number of training instances. As above, we fixed the default parameter values listed in Sect. 4.2.2, and we just varied the limit of training instances.
It is worth noting that, as detailed in the procedure of data preparation (cf. Sect. 4.2), when the limit is below the total amount of instances of a scenario, the instances are not selected randomly but chosen according to their best solvers and their hardness in order to have a more representative training set. We run sunny-as2 with different instance limits starting from 100 (the smallest scenario has less than 100 instances) with increments of 100, with a time cap of 1 day of computation per fold.
Tab. 9 presents the average closed gap scores for experiments with up to 1000 instances. The last column reports the results achieved by considering all the training instances, while the other columns contain the results achieved by considering a fixed number of training instances (i.e., 100, 200, . . . , 1000). The last row reports the average closed gap score across all scenarios. In case a scenario has less instances than required, we simply consider all of them.
We omit the results for GRAPHS and TSP scenarios with more than 1000 instances, since their closed gaps are below the maximal value reached with 800 instances for GRAPHS and with 500 instances for the TSP scenario. The GRAPHS scenario timeouts with 2500 instances while TSP timeouts with 1500 instances.
We can note that by reducing the number of training instances the closed gap of sunny-as2 does not worsen significantly. After 200 instances, increasing the number of training instances does have a limited impact: the score oscillates around 0.41 and 0.46. The best average score of 0.4556 is obtained with 700 instances. We conjecture that this is partially due to the procedure for data preparation (cf. Sect. 4.2.1) that picks the instances after ordering them by hardness, thus reducing the folds skewness. The resulting set of instances is large enough to form a homogeneous set reflecting the instance class distribution of the entire scenario even after a random or stratified split. Adding more instances is not always beneficial. First, a large number of training instances deteriorates the running time performance of the k-NN algorithm on which SUNNY relies producing a slowdown of the solver selection process for both SUNNY and greedy-SUNNY. Second, it can also cause a degradation of performances. Probably this is due to the fact that more instances can introduce additional noise that impacts the selection of solvers by sunny-as2.

Number of Features
It is well established that a small number of features is often enough to provide a competitive performance for an AS system-and a machine learning system in general. For example, according to the reduced set analysis performed by Bischl et al. (2016), no AS scenario required more than 9 features. To better understand the impact of the number of features we run sunny-as2 with a feature limit from one to ten (i.e., maxF ∈ [1, 10] when calling the learnFK function shown in Algorithm 1), and by leaving the other parameters set to their default values as specified in Sect. 4.2.2. Tab. 10 shows the closed gap results. As we can see, often the highest performance was reached with a limited amount of features, and in no scenario the best performance was exclusively achieved with the original feature set. Although there is not a dominant value for all the scenarios, the overall average score is achieved when the limit of features is set to five.

Schedule Size for greedy-SUNNY
In the training process, greedy-SUNNY uses a parameter λ to limit the size of the generated schedule and to be faster than the original SUNNY approach when computing the schedule of solvers. We then investigated the performance of sunny-as2 by varying the λ parameter of greedy-SUNNY (see Algorithm 1 in Sect. 4).
One thing to note before introducing the impact of varying λ for greedy-SUNNY is that, in general, the original SUNNY approach does not produce large schedules. This can be seen, e.g., in Tab. 11, reporting the average size of the schedules found by the original SUNNY approach and its standard deviation when using a 5-fold cross-validation to train SUNNY. Despite no limit on the schedule size is given, SUNNY tends to produce schedules with an average size that varies from one to three, generally around two. This happens because SUNNY aim to selects the smallest subset of solvers solving the most instances in the neighborhood.   This witnesses that, in order to understand how sunny-as2 performance is impacted by the λ parameter, there is no need to consider high values for λ. For this reason in Tab. 12 we report the closed gap score achieved when running sunny-as2 with its default values except λ, which is varied from one to six. obtained by running SUNNY). 10 By observing the average results for each λ value, we can see that the overall best performance was reached with λ set to three. If λ is less than three, for most scenarios, the results are worse, and when λ is greater than three the performances are the same, if not slightly worse.

Variability
One of the major concerns when dealing with predictions is the potentially huge impact of randomness (e.g., how the folds are split) on the results. To cope with the variability of our experiments we have adopted the repeated nested cross-validation approach that produces more robust results since the randomness of the inner cross-validation is weighted out in the outer cross-validation. Additionally, the repetition of the results takes into account the variability induced by creating randomly the folds for the outer cross-validation. If we look at the performance on the single outer folds, we can notice that sunny-as2's performance can have a significant variation. For example, Tab. 13, compares the closed gap score of the training set and the test set when running sunny-as2 on the first repetition of the 5 cross-validation with default parameters.
It is quite obvious that the closed gap is higher in the training instances because in these cases the instances used for the training are also used for the testing. It is more interesting to observe that the closed gap of the training set is sometimes not uniform (e.g., TSP scenario) and that there can be significant differences between the closed gap of the train and test set for certain folds (e.g., Caren scenario). This could mean that a random fold splitting might have a big impact on the learning.

Insights on SUNNY
In this section we study the SUNNY algorithm more in depth, by exploring its strengths and its weaknesses with the aim of finding meaningful patterns. We also show the virtual performance that the version of sunny-as2 described in this paper would have achieved in the 2017 OASC, and we provide a new empirical comparison including new scenarios and AS approaches. Before that, let us recall from Sect. 3.3 the informal assumptions on which the original SUNNY algorithm relied on, namely: (i) a small portfolio is usually enough to achieve a good performance; (ii) solvers either solve a problem quite quickly, or cannot solve it in reasonable time; (iii) solvers perform similarly on similar instances; (iv) a too heavy training phase is often an unnecessary burden.
Points (i) and (ii) motivate the way SUNNY selects and schedules its solvers, respectively. In fact, if few solvers are enough to solve a given set of problems then the solvers selection of SUNNY never falls in its worst-case-exponential w.r.t. the size of the portfolio. If condition (ii) holds, then the time allocated to each selected solver can be small-provided, of course, that the right solvers are chosen.
Points (iii) and (iv) explain why the k-NN algorithm has been chosen for SUNNY. If assumption (iii) holds, then the solvers' performance over the neighborhood of the new, unforeseen instance to be solved are a good estimation for the solvers' performance on that instance. Assumption (iv) guided the choice of a lazy approach, in fact the k-NN algorithm does not build explicitly a prediction model.
As seen in Sect. 4, sunny-as2 partly disagrees with point (iv): sunny-as2-f, and sunny-as2-fk variants of sunny-as2 actually mitigate the laziness of SUNNY by adding a training phase where k-configuration and/or feature selection are performed. The experiments of Sect. 5 somehow confirmed hypothesis (i) (see, e.g., Sect. 5.3.4 and in particular Tab. 11) and rejected hypothesis (iv): a proper training phase, even if computationally expensive, may remarkably boost the performance of SUNNY. Let us now try to empirically understand if conditions (ii) and (iii) are verified on the scenarios evaluated in Sect. 5. Fig. 3 and 4 plot the runtime of the SBS (green), the VBS (blue), and sunny-as2 (yellow) on every instance of each scenario. The instances are sorted in ascending order by the runtime of the corresponding algorithm selector. The runtime distributions depicted in the plots provide evidence for hypothesis (ii): the runtime curves are essentially flat until they grow very quickly towards the end.
Hypothesis (iii) informally states that solvers perform similarly on similar instances, assuming that the feature vectors are able to describe the nature of the instances. To get an idea of the similarity between instances and solvers' performances we decided to use the Jaccard index which, given sets A and B, is computed as J(A, B) = |A ∩ B| |A ∪ B| . This index is a value between 0 (when A ∩ B = ∅) and 1 (when A = B) that gives a measure of the similarity of sets A and B: the higher the index, the more similar the sets are. For each instance i ∈ I of a given scenario (I, A, m) we compute J(F i , P i ) where F i ⊆ I is the "regular" k-neighborhood computed by sunny-as2 according to the feature vectors, and P i ⊆ I is the "oracle-like" k-neighborhood computed by sunny-as2 according to the performance vectors defined as m(i, A 1 ), . . . , m(i, A n ) where A = {A 1 , . . . , A n }. Fig. 5 shows the average Jaccard index computed over all the training instances for each repetition using the runtime as metric for the performance vectors. As we can see, the average index is usually pretty low: it is below 0.1 for the majority of scenarios and the  maximum index is below 0.2. The average value considering all the scenarios is 0.0636, i.e., on average, for every ten instances of P i ∪ F i less than one belongs to the intersection P i ∩ F i . The low Jaccard index raises major doubts on the assumption that solvers perform similarly on similar instances. We shall talk more in depth about this aspect in the next section.

Hard Scenarios for SUNNY
Let us now closely investigate the scenarios where sunny-as2 struggled, trying to extract meaningful patterns.
We start by extending the study performed in Sect. 5.3.5 by considering the closed gap distribution over the 25 training/test folds of all the repetitions (Tab. 13 of Sect. 5 refers to the first repetition only). Fig. 6 shows the closed gap performance with boxplots.  We found two indicators that seem to well represent the link between sunny-as2's performance and the AS scenarios, i.e., the number of instances unsolved by the SBS and the speedup of the VBS w.r.t. the SBS. Both metrics somehow measure the distance between SBS and VBS: the former only focuses on the problems solved, while the latter also takes runtime into account. Fig. 6a and 6b show the closed gap score distributions for each scenario, sorted respectively by increasing number of SBS unsolved instances and by speedup of the VBS w.r.t. SBS. For representation purposes, the few closed gap scores having a value below −1 were replaced with −1.
From Fig. 6a and Fig. 6b one can clearly see that sunny-as2 tends to have a more strong and stable performance in scenarios with higher values of the two indicators (e.g., Bado and CPMP). Conversely, its performance is poor for scenarios with lower values for these indicators (e.g., TSP and Mira).
The plots in Fig. 6a and Fig. 6b are similar. However, the position of Caren scenario in Fig. 6b may suggest that the number of SBS unsolved instances is a more reliable indicator to analyze the performance of sunny-as2 in terms of closed gap.
Overall, SUNNY seems to not work well when the SBS has little room for improvements. We argue that the difficulties of SUNNY in scenarios with a low value of the above indicators are quite normal. Conversely, an algorithm selector performing too well in those scenarios might denote an overfitting w.r.t. the few instances for which the SBS is not a good choice. Moreover, there can be other (co-)explanations for the bad performance of SUNNY on TSP, Mira and Caren. In fact, TSP is the scenario with the lowest number of algorithms (only 4) and the performance of the SBS almost overlaps with that of the VBS (see Fig. 3a). Caren and Mira are instead the scenarios with the fewest number of instances: only 66 and 145 respectively.
We further investigated the cases where sunny-as2 did not work well by focusing on the instances that it could not solve. We distinguish them in two categories: (i) those unsolved because wrong solvers were scheduled, i.e., no solver in the schedule could actually solve that instance within the timeout; and (ii) those unsolved because not enough time was allocated, i.e., at least one of the scheduled solvers could actually solve that instance with a time slot larger than the allocated one. Fig. 7 shows the instances unsolved by sunny-as2 for each scenario, grouped by the above categories. The plot also shows the portfolio size of each scenario. It is quite interesting to see that in all the scenarios except CPMP around 70% of sunny-as2's failures are due  to a wrong identification of the solvers from the neighborhood instances. This means that probably the Achilles' heel of SUNNY is not the way the solvers are scheduled, but rather the way they are predicted. The underlying k-NN algorithm might not be the best choice because the assumption that similar instances have similar behavior does not always hold.
Despite the good closed gap score reached by sunny-as2 on the CPMP scenario, this is the only scenario where the number of unsolved instances due to the time allocation is greater than the number of unsolved instances due to wrongly scheduled solvers. We conjecture that this behavior is motivated by two co-factors: CPMP has the lowest number of available solvers and the SBS performance is quite far from the VBS performance (see Fig. 4d). For these reasons, SUNNY tends to allocate less time to the SBS and more time to the other solvers w.r.t. to scenarios where the speedup of the VBS is low, even when few solvers are available (e.g., the TSP scenario). In fact, we note from Tab. 11 that the average number of scheduled solvers for the CPMP scenario is 2.4, i.e., the 60% of the overall portfolio size.
Summarizing, according to the experiments we conducted in this work, we can say that the hypothesis (i) and (ii), stating that a small portfolio is usually enough to achieve a good performance and that solvers either solve a problem quite quickly, or cannot solve it in reasonable time are mostly true. Conversely, hypothesis (iii) "solvers perform similarly on similar instances" and (iv) "a too heavy training phase is often an unnecessary burden" are not empirically confirmed.

Comparison with Other Approaches
In this section we provide a comparison between sunny-as2 and other state-of-the-art AS approaches. First of all, we show what would have been the performance of the improved sunny-as2 in the 2017 OASC. In fact, the version of sunny-as2 submitted to OASC was Tab. 14 presents the virtual performance of sunny-as2 in the 2017 OASC. In addition to the original competitors (viz., *Zilla, ASAP, AS-RF and the preliminary versions of sunny-as2 called sunny-as2-fk-OASC and sunny-as2-k-OASC in Tab. 14) we added three more baselines: AutoFolio , 12 the best performing approach of the 2015 ICON challenge in terms of PAR 10 score; the original SUNNY approach (Amadini et al., 2014), not performing any training; and an off-the-shelf random forest approach trained on the whole training set without additional cross validations. The latter was implemented with Scikit-learn (Pedregosa et al., 2011) by labeling each instance with the fastest solver solving it (i.e., it maps the AS problem into a classification problem and uses random forest to tackle the classification problem). The number of estimators was set to 200, as done by ASAP.v2.
sunny-as2 was trained as explained in Sect. 4.2.1. For each scenario we picked the configuration that achieved the highest closed gap score among the 5 different configurations obtained on the training set (one for each fold of the outer cross-validation, we did not perform repetitions here). For *Zilla (Cameron, Hoos, Leyton-Brown, & Hutter, 2017), AS-RF (Malone et al., 2017), and ASAP (Gonard et al., 2017) approaches, we only present the results they obtained in the OASC 2017 because no new version of these systems have been released since then. 13 Note that *Zilla, AutoFolio and AS-RF configure their system hyperparameters automatically thus they do not require manual tuning. ASAP instead identified good performing parameters before the competition. We tried several other parameters for ASAP, without however outperforming the ones used in the challenge (cf. Tab. 23 in 11. Appendix B describes in detail the technical differences between the current version of sunny-as2 and the one submitted to OASC. 12. The version of AutoFolio we used is AutoFolio 2015 which attended the ICON challenge. Unfortunately, we experienced some issues with the most recent version Lindauer (2016) due to the external libraries dependencies used by AutoFolio for parameter tuning. Without parameter tuning, the recent version of AutoFolio has worse results than the 2015 edition and therefore, for fairness reason, we reported only the results of AutoFolio 2015. 13. The performance of these approaches are available at COSEAL group (2013). Note that the competition reported in (Lindauer et al., 2019) used a different closed gap metric, i.e., 1 − m SBS −ms m SBS −m V BS , and the scoring tool was slightly amended. This work considers the fixed version of *Zilla since the original one submitted to the competition had a critical bug (Lindauer et al., 2019). Tab. 14 shows that sunny-as2 has the highest average closed gap, and it is the best approach in Bado and Sora scenarios. Its performance is quite close to the one of sunny-as2fk-OASC: the difference is greater than 0.2 only in two scenarios, i.e., Caren and Mira. This is not surprising since sunny-as2 is quite similar to sunny-as2-fk-OASC. ASAP.v2 does not attain the best score in any scenario, but in general its performance is robust and effectivethis confirms what reported in (Gonard et al., 2019). AutoFolio is slightly behind ASAP.v2, nevertheless it achieves good results and it is the best approach for the Magnus scenario. As sunny-as2, also AutoFolio suffers in scenarios like Caren and Mira having a small number of instances. *Zilla and ASAP.v3 also close more than 50% of the gap between the SBS and the VBS. sunny-as2-k-OASC is instead slightly below this threshold: the performance difference w.r.t. sunny-as2-fk-OASC denotes the importance of a proper feature selection. The original SUNNY approach is even worse: this confirms the effectiveness of the strategy introduced by sunny-as2. At the bottom of the table we find the AS approaches based on random forest. This witnesses that turning an AS problem into a classification problem does not seem a good idea in general.
One thing to note is that the results of the OASC competition are based on a single training-test split. As discussed also by Lindauer et al. (2019), this "increases the risk of a particular submission with randomized components getting lucky". For this reason, we also compared the performance of sunny-as2 and the other AS approaches we could reproduce 14 using the default 10 cross-validation splits of the ASlib. In addition to the ASlib scenarios considered so far, we included all the runtime scenarios added to the ASlib after the OASC challenge, viz. GLUHACK-2018, SAT18-EXP and MAXSAT19-UCMS (cf. Tab. 15). In the experiments in Sect. 5, we did not consider these scenarios because we already had 2 SAT scenarios (i.e., Magnus and Monty) and 2 Max-Sat scenarios(i.e., Svea and Sora).
Tab. 16 shows the results using the default splits of ASlib for each scenario. The last two rows of the table denote, respectively, the average closed gap across all the scenarios, and the average closed gap across all the scenarios excluding the TSP scenario. The latter considerably unbalances the results because the performance of all the selectors except ASAP is on average worse than the SBS, hence the closed gap score is negative. We recall that for the TSP scenario the performance of SBS and VBS are very close (cf. Fig. 3a). The best approach in these new experiments is ASAP which looks fairly robust for all the scenarios. sunny-as2 however is not far, especially if we exclude the TSP scenario.
It is worth noting that the closed gap score can over-penalize an approach performing worse than the SBS. Indeed, the average close gap score has upper bound 1 (the VBS  cannot be outperformed), but not a lower bound: one bad result in a fold of a scenario can considerably drop the overall average. For example, in the TSP scenario the difference in terms of solved instances between ASAP and sunny-as2 is less than 0.3%. However, the closed gap score is 0.40 versus −0.26. We further investigated these results, by keeping the very same scenarios and splits while changing the evaluation metrics. We considered the Borda count used in the MiniZinc Challenge (Stuckey et al., 2014) and defined in Sect. 3.1. We recall that for every instance i of a scenario, a selector s ∈ S gets a score Borda(i, s) ∈ [0, |S| − 1] proportional to how many other selectors in S − {s} it beats.
Tab. 17 reports the normalized Borda scores for each scenario: if I is the dataset of the scenario, the normalized Borda score of selector s is 1 |I| i∈I Borda(i, s). Results are interesting: by using this metric, adopted by the MiniZinc Challenge since 2008, the ranking of Tab. 16 is turned upside down. Random Forest, which was the approach with the worst closed gap score, is now the best approach in terms of Borda count. AutoFolio and sunny-as2 are not far from it, whereas, surprisingly, ASAP is the one with the worst Borda count.
There can be different reasons for this overturning. As also discussed in , the Borda count of the MiniZinc Challenge can excessively penalize minimal time differences. For instance, if s and s solve a problem in 1 and 2 seconds respectively, s scores 2/3 = 0.667 while s scores 1/3 = 0.333. However, if they solve another problem in 500 and 1000 seconds respectively the score would remain invariant even if the absolute time difference in the latter case is 500 seconds.
To investigate whether the difference between closed gap and Borda count scores is due to the amplification of small time differences, we define a parametric variant of Borda score that considers equivalent selectors having runtime difference below a given threshold. Formally, given a threshold δ ≥ 0, we define Borda δ (i, s) = s ∈S−{s} cmp δ (m(i, s), m(i, s )) where: otherwise.
If δ = 0, Borda δ is exactly the Borda score defined in Sect. 3.1, and cmp 0 actually corresponds to the cmp function. A score of 0.5 is instead given if the difference between the runtime is less than the time threshold δ. Fig. 8 shows how the cumulative Borda score of each approach varies when increasing δ (note the logscale on the x-axis). We can clearly see a reversal of performance between ASAP and Random Forest as δ increases. This means that Random Forest solve faster more easy instances while ASAP and sunny-as2 approaches are better in dealing with harder instances. The reason for this behavior might be that ASAP uses a pre-scheduling that could tamper its performance for easy instances that cannot be solved in short time by the solver(s) in the pre-schedule. sunny-as2 seems less susceptible to this problem because no pre-solving is performed and its scheduling heuristics prioritize the solvers having lower runtime in the neighborhood.
When small values of δ are considered, the best approaches is the one based on a simple Random Forest classification, followed by AutoFolio for values of δ between 5 and 24 seconds. sunny-as2 becomes the best approach from δ = 25 to δ = 405. Then, ASAP takes over. This behavior reflects the fact that ASAP solves slightly more instances than sunny-as2, while on average sunny-as2 is slightly faster doing good choices for easy instances. This is also corroborated by the numbers in Tab. 18 reporting the average PAR 1 and PAR 10 scores normalized w.r.t. the timeouts, and the average percentage of solved instances. Compared to ASAP, sunny-as2 has a lower average PAR 1 but higher PAR 10 due to the fact that PAR 10 penalizes more the timeouts and sunny-as2 solved in average 0.16% fewer instances.
The performance of sunny-as2 and ASAP asymptotically coincides, and interestingly also AutoFolio and SUNNY-original seem to converge.
Overall, sunny-as2 achieves a good and robust performance with different evaluation metrics, even if it is not always the best approach. Importantly, it consistently outperforms its original version on which sunny-as was based.

Conclusions and Future Work
In this work we described sunny-as2, an algorithm selector that-by applying techniques like wrapper-based feature selection and configuration of the neighborhood size-significantly outperforms its early version sunny-as and improves on its preliminary version submitted in the OASC 2017, when it reached the first position in the runtime minimization track.
We conducted an extensive study by varying different parameters of sunny-as2, showing how its performance can fluctuate across different scenarios of the ASlib. We also performed an original and in-depth study of the SUNNY algorithm, including insights on the instances unsolved by sunny-as2 and the use of a greedy approach as an effective surrogate of the original SUNNY approach. We compared sunny-as2 against other state-of-the-art AS approaches, and observed how results can change when different evaluation metrics are adopted.
What we experimentally learned from the evaluations performed is that feature selection and k-configuration are quite effective for SUNNY, and perform better when integrated. Moreover, the greedy approach we introduced enables a faster and more effective training w.r.t. the schedule generation procedure of the original SUNNY approach. Concerning the SUNNY algorithm itself, we exposed the weakness of the similarity assumption on which the k-NN algorithm used by SUNNY relies. The empirical evaluations we performed confirm both the effectiveness of sunny-as2 on several AS scenarios, and its robustness under different performance metrics.
A natural future direction for SUNNY that emerges from our experiments is the study of alternative sub-portfolio selection mechanisms not relying on k-NN. Moreover, we are planning to improve sunny-as2 by targeting the solution quality in the optimization scenarios of the OASC competition. In these scenarios sunny-as2 is strongly penalized because the scheduling of solvers is not allowed. We would also like to consider different strategies for scenarios having a low speedup and a limited number of unsolved instances by the best solver of the portfolio.
Another direction for future works is to further study the correlation between simple, easy-to-get properties of the scenario (e.g., skewness, distribution of labels, distribution of hard instances, number of instances, solver marginal contribution) and the best parameters for sunny-as2, hoping to find good values for its parameters depending on these simple scenario properties. Our initial findings have already excluded, e.g., the use of mutual information between features in order to limit the number of features. However, additional investigations are needed.

Appendix A. Composition of OASC Scenarios
In this section we provide more details about the composition of OASC scenarios, by focusing on the performance of the best solvers on the training/test set of every scenario.
In particular, Tab. 19 shows the three fastest algorithms for each scenario (by merging training and test set). For each scenario, the first column indicates the algorithm ID; the second and the third column show the fraction of solved instances in training and test set respectively. In case of skewed scenarios, e.g. Caren and Monty, the values in training and test set are significantly different.

Appendix C. Experiments with Relative Values
In this Section we show the relative values w.r.t. the total number of instances, features, and solvers of a given scenario. In particular: • Tab. 20 shows the fraction between the number of instances on that column and the total number of instances of the scenario on that row • Tab. 21 shows the fraction between the number of features on that column and the total number of features of the scenario on that row • Tab. 22 shows the fraction between the number of solvers on that row and the total number of solvers of the scenario on that column In each table, we mark in bold font the cell corresponding to the best closed-gap performance for the given scenario.  Appendix D. Experiments Related to ASAP-v2 Parameter Tuning As described by Gonard et al. (2019Gonard et al. ( , 2017, the relevant parameters for ASAP-v2 are the number of estimators (decision trees) and the weight for regularization. In Tab. 23, we present the results obtained by performing various experiments with ASAP-v2 choosing different value combinations for these two parameters. For an entry Asap-t-i-w-j, i means the number of estimators and j refers to the weight. Overall, the results show that ASAP-v2 is quite stable but that we did not find a combination of hyper-parameters that dominates all the other values for all the scenarios.