Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

Anomaly detection (AD) task corresponds to identifying the true anomalies among a given set of data instances. AD algorithms score the data instances and produce a ranked list of candidate anomalies. The ranked list of anomalies is then analyzed by a human to discover the true anomalies. Ensemble of tree-based anomaly detectors trained in an unsupervised manner and scoring based on uniform weights for ensembles are shown to work well in practice. However, the manual process of analysis can be laborious for the human analyst when the number of false-positives is very high. Therefore, in many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup


Introduction
We consider the problem of anomaly detection (AD), where the goal is to detect unusual but interesting data (referred to as anomalies) among the regular data (referred to as nominals).This problem has many real-world applications including credit card transactions, medical diagnostics, computer security, etc., where anomalies point to the presence of phenomena such as fraud, disease, malpractices, or novelty which could have a large impact on the domain, making their timely detection crucial.Anomaly detection poses severe challenges that are not seen in traditional learning problems.First, anomalies are significantly fewer in number than nominal instances.Second, unlike classification problems, there may not be a hard decision boundary to separate anomalies and nominals.
Typically, anomaly detection algorithms train models to compute scores for all instances, and report instances which receive the highest scores as anomalies.The candidate anomalies in the form of top-ranked data instances are analyzed by a human analyst to identify the true anomalies.Prior work has shown that a class of ensemble anomaly detectors trained in an unsupervised manner and scoring based on uniform weights work very well (Das, Wong, Dietterich, Fern, & Emmott, 2016).Algorithms based on tree-based ensembles such as isolation forest (IFOR) consistently perform well on real-world data (Emmott, Das, Dietterich, Fern, & Wong, 2015).However, purely unsupervised algorithms are usually ineffective in the following scenarios: a) The underlying model assumptions are not suited to the real-world task (Chandola, Banerjee, & Kumar, 2009); b) The data volume is so high that even a detector with low false-positive rate reports a very large absolute number of potential anomalies; c) The nature/type of anomalies is not known beforehand (unknown unknowns) and the user wants to explore the data by directing the detector with feedback in the spirit of collaborative problem solving; and d) the same algorithm is able to identify multiple types of anomalies but only a subset of them are relevant for a particular domain.Consider the following simple example.Let us assume that there are three features (say A, B, and C) and we use a generic unsupervised detector which reports instances which are outside 2-standard deviation along each axis as anomalies.However, if this rule is relevant only for features A and B for a particular domain, then the purely unsupervised algorithm would be less useful for this domain as it would report too many false positives.As a concrete real-world scenario, assume that in a computer network security application, the features A, B, and C correspond to total bytes transferred over the network, no. of unique IP addresses connected to, and no. of logins in off-business hours respectively for a server.With work from home mandates, the last feature might become irrelevant.Unless the detector is adjusted to account for this, the number of false positives could be high.
If the candidate set of anomalies contains many false-positives, it will significantly increase the effort of the human analyst in discovering true anomalies.To overcome the above drawbacks of unsupervised AD algorithms, recent work has explored active learning framework to configure the anomaly detector by the human analyst to minimize the overall effort on false-positives (Das et al., 2016;Siddiqui, Fern, Dietterich, Wright, Theriault, & Archer, 2018).In this framework, a human analyst provides label feedback (nominal or anomaly) for one or more data instances during each round of interaction.This feedback is used by the anomaly detector to change the scoring mechanism of data instances towards the goal of increasing the true anomalies appearing at the top of the ranked list.Specifically, prior work employs a greedy query selection strategy by selecting the top-scoring unlabeled instance for feedback.Similar to unsupervised anomaly detection ensembles, prior work showed that active learning using tree-based ensembles performs better (Das et al., 2016;Siddiqui et al., 2018).
The effectiveness of tree-based ensembles for anomaly discovery in both unsupervised and active-learning settings raise two fundamental questions: (Q1) Why does the average score across ensemble members perform best in most cases (Chiang & Yeh, 2015) instead of other score combination strategies (e.g., min, max, median etc.)?; and (Q2) Why does the greedy query selection strategy for active learning almost always perform best?In this paper, we investigate these two fundamental questions related to tree-based ensembles and provide an important insight that explains their effectiveness for anomaly discovery.We also provide empirical evidence on real-world data to support our intuitive understanding and theoretical analysis to support active anomaly discovery.Prior work on active learning for anomaly discovery has some important shortcomings.First, algorithmic work on enhancing the diversity of discovered anomalies is lacking (Görnitz, Kloft, Rieck, & Brefeld, 2013).Second, most algorithms are designed to handle batch data well, but there are few principled algorithms to handle streaming data setting that arises in many real-world applications.To fill this important gap in active anomaly discovery, we exploit the inherent strengths of tree-based ensembles and propose novel batch and streaming active learning algorithms.
Contributions.The main contribution of this paper is to study and evaluate anomaly discovery algorithms based on tree-based ensembles in both unsupervised and active learning settings.Specific contributions include: • We present an important insight into how tree-based anomaly detector ensembles are naturally suited for active learning, and why the greedy querying strategy of seeking labels for instances with the highest anomaly scores is efficient.We also provide theoretical analysis to support active anomaly discovery using tree-based ensembles.
• A novel formalism called compact description (CD) is developed to describe the discovered anomalies using tree-based ensembles.We develop batch active learning algorithms based on CD to improve the diversity of discovered anomalies.
• To handle streaming data setting, we develop a novel algorithm to robustly detect drift in data streams and design associated algorithms to adapt the anomaly detector on-the-fly in a principled manner.
• We present extensive empirical evidence in support of our insights and algorithms on several benchmark datasets.The results show the efficacy of the proposed active learning algorithms in both batch and streaming data settings.Our results show that in addition to discovering significantly more anomalies than state-of-the-art unsupervised baselines, our learning algorithm under the streaming-data setup is competitive with the batch setup.Our code and data are publicly available1 .
Outline of the Paper.The remainder of the paper is organized as follows.In Section 2, we discuss the prior work related to this paper.We introduce our problem setup and give a high-level overview of the generic human-in-the-loop learning framework for active anomaly discovery in Section 3. In Section 4, we describe the necessary background on tree-based anomaly detector ensembles and provide intuition along with empirical evidence and theoretical analysis to explain their suitability for active anomaly discovery.In Section 5 and 6, we present a series of algorithms to support the framework for batch data and streaming data settings respectively.Section 7 presents our experimental results and finally Section 8 provides summary and directions for future work.

Related Work
Unsupervised anomaly detection algorithms are trained without labeled data and have assumptions baked into the model about what defines an anomaly or a nominal (Breunig, Kriegel, Ng, & Sander, 2000;Liu, Ting, & Zhou, 2008;Pevný, 2016;Emmott et al., 2015).They typically cannot change their behavior to correct for the false positives after they have been deployed.Ensembles of unsupervised anomaly detectors (Aggarwal & Sathe, 2017) try to guard against the bias induced by a single detector by incorporating decisions from multiple detectors.The potential advantage of ensembles is that when the data is seen from multiple views by more than one detector, the set of anomalies reported by their joint consensus is more reliable and has fewer false positives.Different methods of creating ensembles include the collection of heterogeneous detectors (Ted, Goldberg, Memory, Young, Rees, Pierce, Huang, Reardon, Bader, Chow, et al., 2013), feature bagging (Lazarevic & Kumar, 2005), varying the parameters of existing detectors such as the number of clusters in a Gaussian Mixture Model (Emmott et al., 2015), sub-sampling, bootstrap aggregation etc.Some unsupervised detectors such as LODA (Pevný, 2016) are not specifically designed as ensembles but may be treated as such because of their internal structure.Isolation forest (IFOR) (Liu et al., 2008) is a state-of-the-art tree-based ensemble anomaly detector.Recently, some prior work looked at the problem of ensemble selection by transfer learning (Campos, Zimek, & Jr, 2018), ensemble score aggregation methods (Zhao & Hryniewicki, 2019) by looking at the locality of a data point, approximated neighborhood creation techniques and combine their scores (Kirner, Schubert, & Zimek, 2017) as outlier scores, feature selection process in outlier detection (Cheng, Wang, Liu, & Li, 2020) (Micenková, McWilliams, & Assent, 2014).Zero++ (Pang, Ting, Albrecht, & Jin, 2016) is an unsupervised approach that works only for categorical data.It is similar to IFOR, but there is no statistically significant difference in detection accuracy between Zero++ and IFOR for numerical data that is transformed to categorical data.
Active learning corresponds to the setup where the learning algorithm can selectively query a human analyst for labels of input instances to improve its prediction accuracy.The overall goal is to minimize the number of queries to reach the target performance.
Active learning for anomaly detection has been explored by researchers in several earlier works (Almgren & Jonsson, 2004;Abe, Zadrozny, & Langford, 2006;Stokes, Platt, Kravis, & Shilman, 2008;He & Carbonell, 2007;Pichara & Soto, 2011;Görnitz et al., 2013).In this setting, the human analyst provides feedback to the algorithm on true labels (nominal or anomaly).If the algorithm makes wrong predictions, it updates its model parameters to be consistent with the analyst's feedback.Interest in this area has gained prominence in recent years due to significant rise in the volume of data in real-world settings, which made reduction in false-positives much more critical.Most of the popular unsupervised anomaly detection algorithms have been adapted to the active learning setting.Some of these methods are based on ensembles and support streaming data (Veeramachaneni, Arnaldo, Korrapati, Bassias, & Li, 2016;Stokes et al., 2008) but maintain separate models for anomalies and nominals internally, and do not exploit the inherent strength of the ensemble in a systematic manner.(Nissim, Cohen, Moskovitch, Shabtai, Edry, Bar-Ad, & Elovici, 2014) incorporates user feedback into SVM, a classification algorithm.(Das et al., 2016) and (Das, Wong, Fern, Dietterich, & Siddiqui, 2017), based on LODA and IFOR respectively, re-weight the components of the ensemble with every user feedback in a manner similar to our work, however, they do not support streaming data and discovery of diverse anomalies.(Siddiqui et al., 2018) updates the same model as (Das et al., 2017) in an online incremental manner when incorporating user feedback.(Ikeda & Watanabe, 2018) employs an autoencoder as the base anomaly detector and adds a penalty term for the false positives (i.e., normal data that are incorrectly reported as anomalies).The user examines the anomalies reported by the algorithm and labels the false-positives.The model is then retrained with the new labeled data in an online manner.(Pimentel, Monteiro, Veloso, & Ziviani, 2020) adds a classifier layer on top of any unsupervised deep network based anomaly detector and updates the parameters of the deep network based on human feedback.(Trittenbach & Böhm, 2019) extends SVDD, a one-class classifier model for outlier detection, to multiple sub-spaces.The multiple subspaces are intended to help with interpretability of the outliers and support better query strategies for active learning.(Kim, Kim, Yu, & Choi, 2023) learns an adaptive boundary using a deep SVDD model by searching for locations having a similar number of normal and abnormal samples.This boundary is iteratively adjusted by getting labeling feedback from users with an uncertainty based query strategy.Deep semi-supervised anomaly detection techniques have also been used to incorporate feedback from users (Ruff, Vandermeulen, Göernitz, Deecke, Siddiqui, Binder, Müller, & Kloft, 2018;Ruff, Vandermeulen, Görnitz, Binder, Müller, Müller, & Kloft, 2019;Li, Qiu, Kloft, Smyth, Mandt, & Rudolph, 2023).The performance gain for such techniques on classical tabular datasets can be quite low in comparison to performance on complex raw data such as images.SOEL (Li et al., 2023) employs a loss function that has opposing effects on normal and anomalous data points.Using this loss, SOEL infers latent labels {anomaly, normal } for each training example using a semi-supervised EM-style algorithm.In this work, we present a comparison of our algorithm with SOEL in Section 7.3.1.
Our proposed algorithmic instantiation of HiLAD for batch data setting (HiLAD-Batch) and recent work on feedback-guided AD via online optimization (Feedback-guided Online) (Siddiqui et al., 2018) both build upon the same tree-based model proposed by (Das et al., 2017).Therefore, there is no fundamental difference in their performance (7.3).The uniform initialization of weights employed by both HiLAD-Batch and Feedback-guided Online plays a critical role in their overall effectiveness.Feedback-guided Online also adopts the greedy query selection strategy mainly because it works well in practice.In this work, we present the fundamental reason why greedy query selection strategy is label-efficient for human-in-the-loop learning with tree-based anomaly detection ensembles.
Anomaly detection for streaming data setting has many real-life applications.Under this setup, data is sent continuously as a stream to the anomaly detector.(Gupta, Gao, Aggarwal, & Han, 2014) present a broad overview of this setup and categorize the outlier detection techniques into three sub-categories: (a) evolving prediction models, (b) distance based outlier detection for sliding windows, (c) methods for high-dimensional data streams.Our proposed approach falls under the class of distance based outlier detection in a sliding  (Breunig et al., 2000;Salehi, Leckie, Bezdek, Vaithianathan, & Zhang, 2016;Na, Kim, & Yu, 2018) or global anomalies (Angiulli & Fassetti, 2007;Yang, Rundensteiner, & Ward, 2009;Kontaki, Gounaris, Papadopoulos, Tsichlas, & Manolopoulos, 2011).(Guha, Mishra, Roy, & Schrijvers, 2016) proposed a tree-based streaming anomaly detector.The key idea is to create a sketch (or summary) of the data using a random cut forest and to report a data point as anomaly when the displacement score is high.However, unlike our work, none of these prior streaming anomaly detection methods incorporate human feedback.

Problem Setup and Human-in-the-Loop Learning Framework
We are given a dataset D = {x 1 , ..., x n }, where x i ∈ R d is a data instance that is associated with a hidden label y i ∈ {−1, +1}.Instances labeled +1 represent the anomaly class and are at most a small fraction τ of all instances.The label −1 represents the nominal class.We also assume the availability of an ensemble E of m anomaly detectors which assigns scores z = {z 1 , ..., z m } to each instance x such that instances labeled +1 tend to have scores higher than instances labeled −1.Our framework is applicable for any ensemble of detectors, but we specifically study the instantiation for tree-based ensembles because of their beneficial properties.We denote the ensemble score matrix for all unlabeled instances by H.The score matrix for the set of instances labeled +1 is denoted by H + , and the matrix for those labeled −1 is denoted by H − .We setup a scoring function Score(x) to score data instances, which is parameterized by w and ensemble of detectors E. For example, with tree-based ensembles, we consider a linear model with weights w ∈ R m that will be used to combine the scores of m anomaly detectors as follows: Score(x) = w•z, where z ∈ R m corresponds to the scores from anomaly detectors for instance x.This linear hyperplane separates anomalies from nominals.We will denote the optimal weight vector by w * .
The generic human-in-the-loop learning framework HiLAD assumes the availability of an analyst who can provide the true label for any instance in an interactive loop as shown in Figure 1.In each iteration of the learning loop, we perform the following steps: 1) Select one or more unlabeled instances from the input dataset D according to a query selection strategy; 2) Query the human analyst for labels of selected instances; and 3) Update the weights of the scoring function based on the aggregate set of labeled and unlabeled instances.The overall goal of the framework is to learn optimal weights for maximizing the number of true anomalies shown to the human analyst.We provide algorithmic solutions for key elements of this human-in-the-loop framework:

HiLAD Approach
• Initializing the parameters w of the scoring function Score(x) based on a key insight for tree-based anomaly detection ensembles.
• Query selection strategies to improve the label-efficiency of learning.
• Updating the weights of scoring function based on label feedback.
• Updating ensemble members as needed to support the streaming data setting.

Suitability of Tree-based Ensembles for Human-in-the-Loop Learning
In this section, we first provide the background on tree-based anomaly detector ensembles and describe Isolation Forest (IFOR) in more depth as we employ it for our experimental evaluation.Subsequently, we provide intuition on how tree-based ensembles are naturally suited for human-in-the-loop learning and show empirical evidence using isolation forest.

Tree-based Anomaly Detection Ensembles
Ensemble of tree-based anomaly detectors have several attractive properties that make them an ideal candidate for human-in-the-loop learning: • They can be employed to construct large ensembles inexpensively.
• Treating the nodes of tree as ensemble members allows us to both focus our feedback on fine-grained subspaces2 as well as increase the capacity of the model in terms of separating anomalies (positives) from nominals (negatives) using complex class boundaries.
In this work, we will focus mainly on IFOR because it performed best across all datasets (Emmott et al., 2015).However, we also present results on HST and RSF wherever applicable.Isolation Forest (IFOR) comprises of an ensemble of isolation trees.Each tree partitions the original feature space at random by recursively splitting an unlabeled dataset.At every tree-node, first a feature is selected at random, and then a split point for that feature is sampled uniformly at random (Figure 2).This partitioning operation is carried out until every instance reaches its own leaf node.The key idea is that anomalous instances, which are generally isolated in the feature space, reach the leaf nodes faster by this partitioning strategy than nominal instances which belong to denser regions (Figure 3).Hence, the path from the root node is shorter to the leaves of anomalous instances when compared to the leaves of nominal instances.This path length is assigned as the unnormalized score for an

< ≥
Figure 2: Illustration of Isolation tree from (Das et al., 2017).(Das et al., 2017).(b) A single isolation tree for the Toy dataset.(c) Regions having deeper red belong to leaf nodes which have shorter path lengths from the root and correspondingly, higher anomaly scores.Regions having deeper blue correspond to longer path lengths and lower anomaly scores.
instance by an isolation tree.After training an IFOR with T trees, we extract the leaf nodes as the members of the ensemble.Such members could number in the thousands (typically 4000 − 7000 when T = 100).Assume that a leaf is at depth l from the root.If an instance belongs to the partition defined by the leaf, it gets assigned a score −l by the leaf, else 0. As a result, anomalous instances receive higher scores on average than nominal instances.Since every instance belongs to only a few leaf nodes (equal to T ), the score vectors are sparse resulting in low memory and computational costs.
HST and RSF apply different node splitting criteria than IFOR, and compute the anomaly scores on the basis of the sample counts and densities at the nodes.We apply logtransform to the leaf-level scores so that their unsupervised performance remains similar to the original model and yet improves with feedback.The trees in HST and RSF have a fixed depth which needs to be larger in order to improve the accuracy.In contrast, trees in IFOR have adaptive depth and most anomalous subspaces are shallow.Larger depths are associated with smaller subspaces, which are shared by fewer instances.As a result, feedback on any individual instance gets passed on to very few instances in HST and RSF, but to much more number of instances in IFOR.Therefore, it is more efficient to incorporate feedback in IFOR than it is in HST or RSF (see Figure 4).(Tan et al., 2011)).IFOR has adaptive height and most anomalous subspaces are shallow.Higher depths are associated with smaller subspaces which are shared by fewer instances.As a result, feedback on any individual instance gets passed on to many other instances in IFOR, but to fewer instances in HST.RSF has similar behavior as HST.We set the depth for HST (and RSF (Wu et al., 2014)) to 8 (Figure 4c) in our experiments in order to balance accuracy and feedback efficiency.

Beneficial Properties of Tree-based Anomaly Detector Ensembles
In this section, we provide intuition for the effectiveness of unsupervised tree-based anomaly detector ensembles and list their beneficial properties for human-in-the-loop learning framework.Our arguments are motivated by the active learning theory for standard classification.Without loss of generality, we assume that all scores from the members of tree-based ensemble of anomaly detectors are normalized (i.e., they lie in [−1, 1] or [0, 1]), with higher scores implying more anomalous.For the following discussion, w unif ∈ R m represents a vector of equal values, and ||w unif || = 1. Figure 5 illustrates a few possible distributions of normalized scores from the ensemble members in 2D.In each figure, there are two ensemble members and each member tends to assign higher scores to anomalies than to nominals such that, on average, anomalies tend to have higher scores than nominals.Let z represent the vector of anomaly scores for an instance and let m be the number of ensemble members.Then, the average score for the same instance can be computed as Score= We observe that under this setup, points in the score space closest to the top right region of the score space have the highest average scores since they are most aligned with the direction of w unif .When ensemble members are "good", they assign higher scores to anomalies and push them to this region of the score space.The most general case is illustrated in case C1 in Figure 5a.This behavior makes it easier to separate anomalies from nominals by a hyperplane.Moreover, it also sets a prior on the position and alignment of the separating hyperplane (as shown by the red solid line).Having priors on the hyperplane helps determine the region of label uncertainty (anomaly vs. nominal) even without any supervision, and this, in theory, strongly motivates active learning using the human analyst in the loop.
Most theoretical research on active learning for classification (Kearns, 1998;Balcan et al., 2007;Kalai, Klivans, Mansour, & Servedio, 2008;Dasgupta et al., 2009;Balcan & Feldman, 2015) makes simplifying assumptions such as uniform data distribution over a unit sphere and with homogeneous (i.e., passing through the origin) hyperplanes.However, for anomaly detection, arguably, the idealized setup is closer to case C2 (Figure 5b), where nonhomogeneous decision boundaries are more important.More importantly, in order for the theory to be relevant for anomaly detection, the assumptions behind it should be realizable with multi-dimensional real-world data and a competitive anomaly detector.We present empirical evidence (Section 7) which shows that scores from the state-of-the-art Isolation Forest (IFOR) detector are distributed in a similar manner as case C3 (Figure 5c).C3 and C2 are similar in theory (for active learning) because both involve searching for the optimum non-homogeneous decision boundary.In all cases, the common theme is that when the ensemble members are ideal, then the scores of true anomalies tend to lie in the farthest possible location in the positive direction of the uniform weight vector w unif by design.Consequently, the average score for an instance across all ensemble members works well for anomaly detection.However, not all ensemble members are ideal in practice, and the true weight vector (w * ) is displaced by an angle θ from w unif .Figure 6c shows an illustration of this scenario on a Toy dataset.In large datasets, even a small misalignment between w unif and w * results in many false positives.While the performance of ensemble on the basis of the AUC metric may be high, the detector could still be impractical for use by analysts.
The property, that the misalignment is usually small, can be leveraged by active learning to learn the optimal weights efficiently using a small amount of label feedback.To understand this, observe that the top-ranked instances are close to the decision boundary and are therefore, in the uncertainty region.The key idea is to design a hyperplane that passes through the uncertainty region which then allows us to select query instances by uncertainty sampling.Selecting instances on which the model is uncertain for labeling is efficient for active learning (Cohn, Atlas, & Ladner, 1994;Balcan et al., 2007;Cao et al., 2023).Specifically, greedily selecting instances with the highest scores is first of all more likely to reveal anomalies (i.e., true positives), and even if the selected instance is nominal (i.e., false positive), it still helps in learning the decision boundary efficiently.This is an im-portant insight and has significant practical implications.We present the label complexity of this greedy (but efficient) active learning algorithm for anomaly detection below.Label complexity of active learning with fixed weights: Our trained ensemble of two anomaly detectors assign scores uniformly distributed over a unit sphere (Fig 5).Since we have assumed that τ fraction of the data is anomalous, marginal density of anomalies is then τ 2π .We also have a pool of unlabeled data.For simplicity, we assume that anomalies have a Gaussian distribution as a function of the angle ω ∈ [−π, π] from u on the unit sphere: f (y = +1|ω) ∼ N (0, σ 2 ).We query an analyst for the labels of only the top β fraction of instances sorted on scores, i.e., {z : arccos(w • z) ≤ βπ}.Let the angle between u and w unif be θ.Let β l and β r be the minimum and maximum of the values {θ − βπ, θ + βπ} respectively.As per our query selection strategy, the probability of labeling a true anomaly is then , where Φ(•) is the c.d.f of the Gaussian distribution.If |θ| is large, i.e., w unif and u are not "close", then p θ will be small.Proposition 4.1.Let δ ∈ [0, 1].For the 2D case, the number of labels needed to learn the decision boundary with probability (1−δ) with pool-based active learning is Proof: Essentially, we need to locate the mean of the Gaussian distribution.This can be done through binary search because the density is highest at the mean and decreases continuously as we move away from it.For the search, we estimate the density of anomalies at O(log( 1 σ )) locations (for accuracy to one s.d.).For any estimate, at least one anomaly must be sampled.When the angle is θ (i.e., w=w unif ) at the initial location, then the number of samples required is, with probability (1 − δ): The number of samples at the initial location sets the baseline estimate; therefore, the total number of samples is Remark.If we ignore the initial location of search, the number of labels required would be O(log( 1 σ ) 2π τ log( 1 δ )).Thus, initializing at w unif helps when p θ is larger than τ 2π .
Label complexity of active learning with varying weights: In this case, we select one data instance from the top ranked β fraction of instances at each time step for querying, and then adjust the weights w in response to the label received from the analyst.We denote the weight vector and its angular displacement from u at each time step t by w t and θ t respectively, starting with w 1 = w unif and θ 1 = θ.
Proposition 4.2.Let the algorithm A update the weights with each new label at time t while learning the decision boundary such that θ t+1 ≤ θ t .If the label complexity of algorithm A is T ′ , then T ′ ≤ T (where T is defined in Proposition 4.1).
Proof: As in the previous case, we sample at O(log( 1 σ )) locations for the binary search.Let l ′ be the number of samples required at the initial location of the binary search for density estimation, and set θ 1 = θ.Now, l ′ sets the baseline for the number of samples in each round of the search.Since, by assumption, θ t+1 ≤ θ t at each label iteration t, p θ t+1 ≥ p θt .For any estimate, at least one anomaly needs to be selected with probability (1 − δ); i.e., (1 When the number of members in the ensemble is m, then p θ will be a function of m and the number of locations for binary search will be O(m log( 1 σ )).
Summary.Tree-based anomaly detector ensembles have several beneficial properties for human-in-the-loop learning for anomaly discovery: • Uniform prior over weights helps in improving the label-efficiency of discovering true anomalies.Indeed, in our experiments, we demonstrate the effectiveness of uniform weights over random weights for initialization in discovering anomalies.We also show the histogram distribution of the angles between score vectors from IFOR and uniform weights w unif to demonstrate that anomalies are aligned closer to w unif .
• With uniform prior over weights, the greedy strategy of querying the labels for top ranked instances is efficient, and is therefore a good yardstick for evaluating the performance of other querying strategies as well.This point will be particularly significant when we evaluate a different querying strategy to enhance the diversity of discovered anomalies as part of this work.
• Learning the weights of anomaly detector ensembles with human-in-the-loop learning that generalizes to unseen data helps in limited-memory or streaming data settings.Indeed, our experimental evaluation corroborates this hypothesis.

Algorithmic Instantiation of HiLAD for Batch Data Setting
In this section, we describe a series of algorithms to instantiate the generic human-in-theloop learning framework HiLAD for batch data setting (HiLAD-Batch).First, we present a novel formalism called compact description that describes groups of instances compactly using a tree-based model, and apply it to improve the diversity of instances selected for labeling (Section 5.1).Second, we describe an algorithm to update the weights of the scoring function based on label feedback in the batch setting, where the entire data is available at the outset (Section 5.2).

Compact Description for Diversified Querying and Interpretability
In this section, we first describe the compact description formalism to describe a group of instances.Subsequently, we propose algorithms for selecting diverse instances for querying and to generate succinct interpretable rules using compact description.
Compact Description (CD).The tree-based model assigns a weight and an anomaly score to each leaf (i.e., subspace).We denote the vector of leaf-level anomaly scores by d, and the overall anomaly scores of the subspaces (corresponding to the leaf-nodes) by a = [a 1 , ..., a m ] = w • d, where • denotes element-wise product operation.The score a i provides a good measure of the relevance of the i-th subspace.This relevance for each subspace is determined automatically through the label feedback.Our goal is to select a small subset of the most relevant and "compact" (by volume) subspaces which together contain all the instances in a group that we want to describe.We treat this problem as a specific instance of the set covering problem.We illustrate this idea on a synthetic dataset in Figure 7.This approach can be potentially interpreted as a form of non-parametric clustering.shows that after incorporating the labels of 35 instances, the subspaces around the labeled anomalies have emerged as the most relevant.(c) shows the set of important subspaces which compactly cover all labeled anomalies.These were computed by solving Equation 1.Note that the compact subspaces only cover anomalies that were discovered in the 35 feedback iterations.Anomalies which were not detected are likely to fall outside these compact subspaces.
Let Z be the set of instances that we want to describe, where |Z| = p.For example, Z could correspond to the set of anomalous instances discovered by our human-in-the-loop learning approach.Let s i be the δ most relevant subspaces (i.e., leaf nodes) which contain z i ∈ Z, i = 1, ..., p.Let S = s 1 ∪ ... ∪ s p and |S| = k.Denote the volumes of the subspaces in S by the vector v ∈ R k .Suppose x ∈ {0, 1} k is a binary vector which contains 1 in locations corresponding to the subspaces in S which are included in the covering set, and 0 otherwise.Let u i ∈ {0, 1} k denote a vector for each instance z i ∈ Z which contains 1 in all locations corresponding to subspaces in s i .Let U = [u T 1 , ..., u T n ] T .A compact set of subspaces S * which contains (i.e., describes) all the candidate instances can be computed using the optimization formulation in Equation 1.We employ an off-the-shelf ILP solver (CVX-OPT) to solve this problem.
Applications of Compact Description.Compact descriptions have multiple uses including: • Discovery of diverse classes of anomalies very quickly by querying instances from different subspaces of the description.
• Improved interpretability of anomalous instances.We assume that in a practical setting, the analyst(s) will be presented with instances along with their corresponding description(s).Additional information can be derived from the description and shown to the analyst (e.g., number of instances in each compact subspace), which can help prioritize the analysis.
In this work, we present empirical results on improving query diversity and also compare with another state-of-the-art algorithm that extracts interpretations (Wang, Rudin, Doshi-Velez, Liu, Klampfl, & MacNeille, 2016).
Let Z = n top-ranked instances as candidates ⊆ X (blue points in Figure 8a) Let S * = subspaces with Equation 1 that contain Z (rectangles in Figures 8b and 8c) Set Q = ∅ while |Q| < b do Let x = instance with highest anomaly score ∈ Z s.t.x has minimal overlapping regions in S * with instances in Q Set Q = Q ∪ {x} (green circles in Figure 8c) Set Z = Z \ {x} end while return Q Diversity-based Query Selection Strategy.In Section 4, we reasoned that the greedy strategy of selecting top-scored instances (referred as Select-Top) for labeling is efficient.However, this strategy might lack diversity in the types of instances presented to the human analyst.It is likely that different types of instances belong to different subspaces in the original feature space.Our proposed strategy (Select-Diverse), which is described next, is intended to increase the diversity by employing tree-based ensembles and compact description to select groups of instances from subspaces that have minimum overlap.
Assume that the human analyst can label a batch of b instances, where b > 1, in each feedback iteration.Algorithm 2 employs the compact description to achieve this diversity.The algorithm first selects n (> b) top-ranked anomalous instances Z and computes the corresponding compact description (small set of subspaces S * ).Subsequently, performs an iterative selection of b instances from Z by minimizing the overlap in the corresponding subspaces from S * .Figure 8

Algorithmic Approach to Update Weights of Scoring Function
In this section, we provide an algorithm to update weights of the scoring function for HiLAD-Batch instantiation for batch setting: the entire data D is available at the outset.
Algorithm 3 HiLAD-Batch (B, w (0) , H, H + , H − ) Input: Query budget B, initial weights w (0) , unlabeled instances H, labeled instances H + and H − Set t = 0 while t ≤ B do Set t = t + 1 Set a = H • w (i.e., a is the vector of anomaly scores) Let q = z i , where i = arg max i (a i ) Get Set H = H \ z i w (t) = learn new weights; normalize ∥w (t) ∥ = 1 end while return w (t) , H, H + , H − Recall that our scoring function is of the following form: Score(x) = w • z, where z ∈ R m corresponds to the scores from anomaly detectors for instance x.We extend the AAD approach (based on LODA projections) (Das et al., 2016) to update the weights for tree-based models.AAD makes the following assumptions: (1) τ fraction of instances (i.e., nτ ) are anomalous, and (2) Anomalies should lie above the optimal hyperplane while nominals should lie below.AAD tries to satisfy these assumptions by enforcing constraints on the labeled examples while learning the weights of the hyperplane.If the anomalies are rare and we set τ to a small value, then the two assumptions make it more likely that the hyperplane will pass through the region of uncertainty.Our previous discussion then suggests that the optimal hyperplane can now be learned efficiently by greedily asking the analyst to label the most anomalous instance in each feedback iteration.We simplify the AAD formulation with a more scalable unconstrained optimization objective.Crucially, the ensemble weights are updated with an intent to maintain the hyperplane in the region of uncertainty through the entire budget B. The HiLAD-Batch learning approach is presented in Algorithm 3 and depends on only one hyper-parameter τ .We first define the hinge loss ℓ(q, w; (z i , y i )) in Equation 2that penalizes the model when anomalies are assigned scores lower than q and nominals higher.Equation 3 then formulates the optimization problem for learning the optimal weights in Line 14 of the batch algorithm (Algorithm 3). Figure 9 illustrates how HiLAD-Batch changes the anomaly score contours across feedback iterations on the synthetic dataset using an isolation forest with 100 trees.ℓ(q, w; t) determines the influence of the prior.For the batch setup, we set λ (t) = 0.5 |H + |+|H − | such that the prior becomes less important as more instances are labeled.When there are no labeled instances, λ (t) is set to 1 2 .The third and fourth terms of Equation 3 encourage the scores of anomalies in H + to be higher than that of z (t−1) τ (the nτ -th ranked instance from the previous iteration), and the scores of nominals in H − to be lower than that of z (t−1) τ .We employ gradient descent to learn the optimal weights w in Equation 3. Our prior knowledge that w unif is a good prior provides a good initialization for gradient descent.We later show empirically that w unif is a better starting point than random weights. where, , and, τ and qτ (w (t−1) ) are computed by ranking anomaly scores with w = w (t−1)

Algorithmic Instantiation of HiLAD for Streaming Data Setting
In this section, we describe algorithms to support human-in-the-loop anomaly detection using tree-based ensembles in the streaming data setting (HiLAD-Stream), where the data comes as a continuous stream.
In the streaming setting, we assume that the data is input to the algorithm continuously in windows of size K and is potentially unlimited.The HiLAD-Stream instantiation is shown in Algorithm 4. Initially, we train all the members of the ensemble with the first window of data.When a new window of data arrives, the underlying tree model is updated as follows: in case the model is an HST or RSF, only the node counts are updated while keeping the tree structures and weights unchanged; whereas, if the model is an IFOR, a subset of the current set of trees is replaced as shown in Update-Model (Algorithm 4).The updated model is then employed to determine which unlabeled instances to retain in memory, and which to "forget".This step, referred to as Merge-and-Retain, applies the simple strategy of retaining only the most anomalous instances among those in the memory and in the current window, and discarding the rest.Next, the weights are fine-tuned with analyst feedback through human-in-the-loop learning loop similar to the batch setting with a small budget Q.Finally, the next window of data is read, and the process is repeated until the stream is empty or the total budget B is exhausted.In the rest of this section, we will assume that the underlying tree model is IFOR.
Input: Stream window size K, total query budget B, queries per window Q, anomaly detector ensemble E (0) , initial instances X 0 (used to create E (0) ), initial weights w (0) , significance level α KL KL , P (t) = Update-Model(X t , E (t−1) , q (t−1) KL , P (t−1) , α KL ) // Merge-and-Retain(w, H, K) retains K most anomalous instances in H Set H = Merge-and-Retain(w Algorithm 5 Update-Model (X, E, q KL , P, α KL ) Input: Instances X, anomaly detector ensemble E, current KL threshold q KL , baseline distributions P, significance level // the number of trees with divergence is not significant return E, q KL , P end if Set E ′ = replace trees in E whose indexes are in S, with new trees trained using X // Recompute threshold and baseline distributions Set q ′ KL = Get-KL-Threshold(X, E ′ , α KL , 10) When we replace a tree in Update-Model, its leaf nodes and corresponding weights get discarded.On the other hand, adding a new tree implies adding all its leaf nodes with weights initialized to a default value v.We first set v = 1 √ m ′ where m ′ is the total number of leaves in the new model, and then re-normalize the updated w to unit length.The HiLAD instantiation for streaming data is presented in Algorithm 4. In all the HiLAD-Stream experiments, we set the number of queries per window Q = 20, and λ (t) = 1 2 .
HiLAD-Stream approach can be employed in two different situations: (1) limited memory with no concept drift, and (2) streaming data with concept drift.The type of situation determines how fast the model needs to be updated in Update-Model.If there is no concept drift, we need not update the model at all.If there is a large change in the distribution of data from one window to the next, then a large fraction of members need to be replaced.When we replace a member tree in our tree-based model, all its corresponding nodes along with their learned weights have to be discarded.Thus, some of the "knowledge" is lost with the model update.In general, it is hard to determine the true rate of drift in the data.One approach is to replace, in the Update-Model step, a reasonable number (e.g.20%) of older ensemble members with new members trained on new data.Although this ad hoc approach often works well in practice, a more principled approach is preferable.
Algorithm 6 Get-KL-Threshold (X, E, α KL , n) Input: Instances X, anomaly detector ensemble E, significance level α KL , repetitions of KL-divergence computations n Set T = number of trees in E Initialize KL = 0 ∈ R T // mean KL-divergence for each tree for i in 1 // average the values Set q KL = (1 − α KL ) × 100 quantile value in KL return q KL Drift Detection Algorithm.Algorithm 5 presents a principled methodology that employs KL-divergence (denoted by D KL ) to determine which trees should be replaced.The set of all leaf nodes in a tree are treated as a set of histogram bins which are then used to estimate the data distribution.We denote the total number of trees in the model by T , and the t-th tree by T t .When T t is initially created with the first window of data, the data from the same window is also used to initialize the baseline distribution for T t , denoted by p t (Get-Ensemble-Distribution in Algorithm 7).After computing the baseline distributions for each tree, we estimate the D KL threshold q KL at the α KL (typically 0.05) significance level by sub-sampling (Get-KL-Threshold in Algorithm 6).When a new window is read, we first use it to compute the new distribution q t (for T t ).Next, if q t differs significantly from p t (i.e., D KL (p t ||q t ) > q KL ) for at least 2α KL T trees, then we replace all such trees with new ones created using the data from the new window.Finally, if any tree in the forest is replaced, then the baseline densities for all trees are recomputed with the data in the new window.
Algorithm 7 Get-Ensemble-Distribution (X, E) Let T t = t-th tree in E Set p t = Get-Tree-Distribution(X, T t ) Set P = P ∪ p t end for return P Algorithm 8 Get-Tree-Distribution (X, T ) Input: Instances X, tree T Set p = distribution of instances in X at the leaves of T return p

Experiments and Results
Datasets.We evaluate our human-in-the-loop learning framework on ten publicly available benchmark datasets ( (Woods, Doss, Bowyer, Solka, Priebe, & Kegelmeyer, 1993), (Ditzler & Polikar, 2013), (Harries & of New South Wales., 1999), UCI (Dheeru & Karra Taniskidou, 2017)) listed in Table 2.The anomaly classes in Electricity and Weather were down-sampled to be 5% of the total.Evaluation Methodology.For each variant of the human-in-the-loop framework HiLAD, we plot the percentage of the total number of anomalies shown to the analyst versus the number of instances queried; this is the most relevant metric for an analyst in any real-world application.A higher plot means the algorithmic instantiation of the framework is better in terms of discovering anomalies.All results presented are averaged over 10 different runs and the error-bars represent 95% confidence intervals.
7.1 Human-in-the-Loop Learning for Batch Data Setting Experimental Setup.All versions of HiLAD-Batch employ IFOR with the number of trees T = 100 and subsample size 256.The initial starting weights are denoted by w (0) .We normalize the score vector for each instance to unit length such that the score vectors lie on a unit sphere.This normalization helps adhere to the discussion in Section 4, but is otherwise unnecessary.Figure 10 shows that w unif tends to have a smaller angular separation from the normalized IFOR score vectors of anomalies than from those of nominals.This holds true for most of our datasets (Table 2).Weather is a hard dataset for all anomaly detectors (Wu et al., 2014), as reflected in its angular distribution in Figure 10i.In all our experiments, Unsupervised Baseline shows the number of anomalies detected without any feedback, i.e., using the uniform ensemble weights w unif ; HiLAD-Batch (No Prior -Unif) and HiLAD-Batch (No Prior -Rand) impose no priors on the model, and start human-in-the-loop learning with w (0) set to w unif and a random vector respectively; HiLAD-Batch sets w unif as prior, and starts with w (0) = w unif .For HST, we present two sets of results with batch input only: HST-Batch with original settings (T = 25, depth=15, no feedback) (Tan et al., 2011), and HST-Batch (Feedback) which supports feedback with HiLAD-Batch strategy (with T = 50 and depth=8, a better setting for feedback).For RST, we present the results (RST-Batch) with only the original settings (T = 30, depth=15) (Wu et al., 2014) since it was not competitive with other methods on our datasets.We also compare the HiLAD-Batch variants with the AAD approach (Das et al., 2016) in the batch setting (AAD-Batch).Batch data setting is the most optimistic for all algorithms.

Results for HiLAD-Batch Instantiation
We set the budget B to 300 for all datasets in the batch setting.The results on the four smaller datasets Abalone, ANN-Thyroid-1v3, Cardiotocography, and Yeast are shown in Figure 11.When the algorithm starts from sub-optimal initialization of the weights and with no prior knowledge (HiLAD-Batch (No Prior -Rand)), more number of queries are spent hunting for the first few anomalies, and thereafter detection improves significantly.When the weights are initialized to w unif , which is a reliable starting point (HiLAD-Batch (No Prior -Unif) and HiLAD-Batch), fewer queries are required to find the initial anomalies, and typically results in a lower variance in accuracy.Setting w unif as prior in addition to informed initialization (HiLAD-Batch) performs better than without the prior (HiLAD-Batch (No Prior -Unif)) on Abalone, ANN-Thyroid-1v3, and Yeast.We believe this is because the prior helps guard against noise.
Figure 12 shows the relative performance (% of anomalies discovered) compared to the Unsupervised Baseline.It is clear that HiLAD-Batch discovered up to 250% more anomalies than the baseline algorithm.A key observation to note is that HiLAD-Batch discovers the biggest portion of anomalies in the early phase of the label collection.For Abalone dataset HiLAD-Batch discovered 40-60% more anomalies in the initial phase whereas Loda-AAD-Batch was under-performing initially and recovered later.The biggest gain we saw was for Thyroid dataset where HiLAD-Batch discovered 250% more anomalies compared to the baseline.Other datasets (Cardiotocography, Yeast) also resulted in higher anomaly discovery using the HiLAD-Batch method.

Results for Diversified Query Strategy
The diversified querying strategy Select-Diverse (Algorithm 2) employs compact descriptions to select instances.Therefore, the evaluation of its effectiveness is presented first.An interactive system can potentially ease the cognitive burden on the analysts by using descriptions to generate a "summary" of the anomalous instances.
We perform a post hoc analysis on the datasets with the knowledge of the original classes (Table 2).It is assumed that each class in a dataset represents a different data-generating process.To measure the diversity at any point in our feedback cycle, we compute the difference between the number of unique classes presented to the analyst per query batch averaged across all the past batches.The parameter δ for Select-Diverse was set to 5 in all experiments.We compare three query strategies in the batch data setup: HiLAD-Batch-T, HiLAD-Batch-D, and HiLAD-Batch-R.HiLAD-Batch-T simply presents the top three most anomalous instances per query batch.HiLAD-Batch-D employs Select-Diverse to present three diverse instances out of the ten most anomalous instances.HiLAD-Batch-R presents three instances selected at random from the top ten anomalous instances.Finally, HiLAD-Batch greedily presents only the single most anomalous instance for labeling.We find that HiLAD-Batch-D presents a more diverse set of instances than both HiLAD-Batch-T (solid lines) as well as HiLAD-Batch-R (dashed lines) on most datasets.Figure 13b shows that the number of anomalies discovered (on representative datasets) with the diversified querying strategy is similar to the greedy strategy, i.e., no loss in discovery rate to improve diversity.

Human-in-the-Loop Learning for Streaming Data Setting
Experimental Setup.For HiLAD-Stream instantiation, we employ IFOR with the number of trees T = 100 and subsample size 256.This is similar to the setup for HiLAD-Batch.In all HiLAD-Stream experiments, we set the number of queries per window Q = 20.The total budget B and the stream window size K for the datasets were set respectively as follows: Covtype (3000, 4096), KDD-Cup-99 (3000, 4096), Mammography (1500, 4096), Shuttle (1500, 4096), Electricity (1500, 1024), Weather (1000, 1024).These values are reasonable w.r.t the dataset's size, the number of anomalies, and the rate of concept drift.The maximum number of unlabeled instances residing in memory is K.When the last window of data arrives, then human-in-the-loop learning is continued with the final set of unlabeled data retained in the memory until the total budget B is exhausted.The instances are streamed in the same order as they appear in the original public sources.When a new window of data arrives: HiLAD-Stream (KL Adaptive) dynamically determines which trees to replace based on KL-divergence, HiLAD-Stream (Replace 20% Trees) replaces 20% oldest trees, and HiLAD-Stream (No Tree Replace) creates the trees only once with the first window of data and only updates the weights of the fixed leaf nodes using feedback.% of total anomalies seen (d) Shuttle Figure 14: Percentage of total anomalies seen vs. number of queries for the larger datasets in the limited memory setting.HiLAD-Stream (KL Adaptive) and HiLAD-Stream-D apply the Select-Top and Select-Diverse query strategies resp.Mammography, KDD-Cup-99, and Shuttle have no significant drift.Covtype, which has a higher drift, is included here for comparison because it is large.

Results for HiLAD-Stream Instantiation
Limited memory setting with no concept drift.The results on the four larger datasets are shown in Figure 14.The performance is similar to what is seen on the smaller datasets.Among the unsupervised algorithms in the batch setting, IFOR (Unsupervised Baseline) and HST (HST-Batch) are competitive, and both are better than RSF (RSF-Batch).With feedback, HiLAD-Batch is consistently the best performer.HST with feedback (HST-Batch (Feedback)) always performs better than HST-Batch.The streaming algorithm with feedback, HiLAD-Stream (KL Adaptive), significantly outperforms Unsupervised Baseline and is competitive with HiLAD-Batch.HiLAD-Stream (KL Adaptive) performs better than HST-Batch (Feedback) as well.HiLAD-Stream-D which presents a more diverse set of instances for labeling performs similar to HiLAD-Stream (KL Adaptive).These results demonstrate that the feedback-tuned anomaly detectors generalize to unseen data.Streaming setting with concept drift.Figure 17 shows the results after integrating drift detection and label feedback with HiLAD-Stream for the datasets which are expected to have significant drift.Both Covtype and Electricity show more drift in the data than Weather (top row in Figure 17).The rate of drift determines how fast the model should be updated.If we update the model too fast, then we loose valuable knowledge gained from feedback which is still valid.On the other hand, if we update the model too slowly, then the model continues to focus on the stale subspaces based on the past feedback.It is hard to find a common rate of update that works well across all datasets (such as replacing 20% trees with each new window of data).Figure 17 (bottom row) shows that the adaptive strategy (HiLAD-Stream (KL Adaptive)) which replaces obsolete trees using KL-divergence, as illustrated in Figure 17 (top row), is robust and competitive with the best possible configuration, i.e., HiLAD-Batch.

Parameter sensitivity analysis for HiLAD-Stream
HiLAD-Stream depends on drift detection measure.It has several approaches to replace ensemble members: i) KL-adaptive, ii) a fixed amount of tree replacement, and iii) no replacement.KL-adaptive case, we focused on selection of 95 percentile threshold to discard less important ensemble members(trees).We changed the decision threshold to observe HiLAD-Stream's sensitivity.Similar exploration was performed for fixed tree replacement parameters from a wide range of parameters 5%, 10%, 15%, 20% 25%.Table 3 presents our algorithms detection performance with the variation of parameters.KL based threshold selection (95%) is the most stable one.In a real-world scenario, determining the fixed replacement factors is infeasible, and KL-divergence based threshold approach will be the (e) KDD-Cup-99 (f) Shuttle Figure 16: The last data window in each dataset usually has much fewer instances and therefore, its distribution is very different from the previous window despite there being no data drift.Therefore, we ignore the drift in the last window.We did not expect Abalone, Cardiotocography, KDDCup99, Mammography, Shuttle, Yeast, and ANN-Thyroid-1v3 (Figure 15a) to have much drift in the data.This can also be seen in the plots where most of the windows in the middle of streaming did not result in too many trees being replaced (the numbers in the parenthesis are mostly zero).recommended choice.We present support for this claim in Tables 7, 6, 8 for various number of feedback.Our recommended choice seems to discover comparable amount of anomalies compared to fixed settings.We highlighted the performance of our algorithm bold and underlined the best result achieved with fixed drift amount.In a practical scenario, the amount of drift is unknown hence setting the right configuration is always a challenging decision.Whereas, our principled approach HiLAD-Stream determines such thresholds dynamically and can reduce the burden from human experts on setting the fixed parameter.

Anomaly characterization from label feedback using HiLAD
HiLAD can identify anomalies from different subspaces based on feedback.It takes into account the users' preferences (data point type and spatial location) to re-weight different subspaces.In contrast, classical anomaly detection approaches always propose candidates from the same ranked list.We demonstrate the performance of HiLAD on three cases (one on synthetic, and two others from representative real-world datasets (Yeast and Abalone).
1. Synthetic dataset with two groups of anomalies created to provide controlled feedback.It's clear from the score contour plots (Figure 18) that our algorithm pays attention where the domain expert is interested.
2. Abalone contains two anomaly classes (Class 3 and Class 21 ).In our experiments, we provided feedback from one class as anomalies and found that HiLAD ranked anomalies from that class higher as shown in Figure 19.
Table 3: The fraction of total anomalies seen all three datasets for streaming experiment.We explored our algorithms tree replacement fraction using both (fixed replacement and KL-threshold) approaches.It seems that KL-Divergence based technique with 95% showed stable performance over fixed tree replacement (marked as bold).We also underlined the best results obtained from all datasets.
Datasets (# of feedbacks)   We see similar behavior of the algorithm for classes VAC (Figure 20c) and ERL(Figure 20d) as we did for both classes of the Abalone dataset.However, for class POX we see an initial higher gain with HiLAD-Batch than the unsupervised baseline before the unsupervised baseline overtakes HiLAD-Batch.We took a closer look into the cause for this using t-SNE plots (Figure 21).Here we see that there is one clustered group of anomalies (lower left) that gets discovered fast initially by HiLAD-Batch and thus explains its initial steeper rise.But then, rest of the anomalies belonging to POX are mixed within the nominal data and hence harder to discover.Since these might be spread out among nominals, positive feedback on one such anomaly does not get propagated to other anomalies efficiently.Moreover, a negative feedback on a nominal data instance (a very likely scenario for this situation) lowers the priority of an anomaly that is present in its proximity.

Comparison with Recent Semi-Supervised Approaches
We propose effective instantiations of the generic human-in-the-loop learning framework HiLAD for anomaly discovery.Prior works including (Siddiqui et al., 2018), (Das et al., 2016), (Das et al., 2017) are notable methods in this line of work.All these prior methods used ensemble based approaches as a black-box system (Pevný, 2016).However, we used tree-based ensembles / Isolation Forest (Liu et al., 2008) and exploited the inherent structure of the trees to effectively utilize human feedback.Our experimental evaluation considers all these works as baselines.Moreover, we did a fine-grained experimental comparison with (Siddiqui et al., 2018).Figure 22 shows the comparison of HiLAD-Batch with feedback-guided anomaly detection via online optimization (Feedback-guided Online) (Siddiqui et al., 2018).The results for KDDCup99 and Covtype could not be included for Feedback-guided Online because their code3 resulted in error (Segmentation Fault) when run with 3000 feedback iterations (a reasonable budget for the large datasets).

Comparison with Limited Labeled Set
In this section, we present a comparison with SOEL using the same experimental setup followed in (Li et al., 2023).SOEL is a semi-supervised algorithm which, as per its original design, does not expect a continuous user feedback cycle.Instead, SOEL assumes that a limited amount of labeled data is already available.Under the experimental setup, the datasets are first split into two parts (train, test).Next, a fixed number of labels B are collected by following one of the nine query strategies from previous works: (1) Mar (Görnitz et al., 2013), (2) Hybr1 (Görnitz et al., 2013), (3) Pos1 (Pimentel et al., 2020), (4) Pos2 (Barnabé-Lortie, Bellinger, & 2015), (5) Rand1 (Ruff et al., 2018), (6) Rand2 (Trittenbach, Englhardt, & Böhm, 2021), (7) Hybr2, (8) Hybr3 (Ning, Chen, Zhou, & Wen, 2022), and (9) SOEL (Li et al., 2023).Finally, the F1-score for each query strategy is reported averaged over five runs.The train-test splits and initial- (h) Yeast Figure 22: Results comparing HiLAD-Batch with feedback-guided anomaly detection via online optimization (Siddiqui et al., 2018).HiLAD-Batch is the tree-based model implemented in our codebase and employs the AAD loss (anomalies score higher than τ -th quantile score and nominals lower).Feedback Guided Online employs the linear loss in Siddiqui et al., 2018.Unsupervised Baseline is the unsupervised Isolation Forest baseline.Both approaches perform similar on most datasets.While HiLAD-Batch has slightly poor accuracy on Mammography than Feedback Guided Online, HiLAD-Batch performs much better on Weather.ization were random for each experiment run.Hybr2 is attributed to a strategy discussed in https://github.com/shubhomoydas/ad_examplesbut was implemented incorrectly in (Li et al., 2023) with minimum euclidean distance between queried instances.The correct implementation maximizes the euclidean distance to improve diversity.
Our first set of comparisons with SOEL (Table 4) are based on the four tabular datasets from (Li et al., 2023).In these datasets, more than 30% samples are considered anomalies, especially in the test set.HiLAD performed best on two out of four datasets.SOEL performed best on Satellite, and Hybr1 was best on Ionosphere.Arguably, this setup has a much higher proportion of anomalies than in real-world scenarios.
Our second set of comparisons with SOEL are based on the seven datasets from Table 2 where % of anomalies present in the test set are more realistic (< 10%).We follow the same training methodology as in the previous experiment and report the results in Table 5.Here, HiLAD achieved the best F1-score on six out of seven datasets.A key takeaway from both sets of results (Table 4 and Table 5) is that HiLAD's performance is consistent across different datasets.
Our third set of experiments are again based on datasets from Table 2. Here, we compare the performance of algorithms as the amount of labeled data B during training time is increased in the following order: 10, 20, 30, 40, 50.The results are reported in Figure 23.For some datasets, we didn't have 50 anomalous samples present.One interesting observation is that with 10 labeled samples, HiLAD was not the best performer for Cardiotocography.However, as more labeled samples were added, HiLAD's performance improved, unlike other methods whose performance didn't.

Broader Applicability of Insights and Algorithmic Ideas
We employed the tree-based approach as a device to demonstrate how our generic mechanism for incorporating labeled anomalies can be applied to ensembles of anomaly detectors.This technique can be applied in situations where an ensemble of detectors can be trained inexpensively (e.g., using feature bagging (Lazarevic & Kumar, 2005), LODA (Pevný, 2016), IFOR (Liu et al., 2008) etc.) or with more sophisticated methods (Ted et al., 2013).We have focused on tree-based methods in this paper due to their popularity and frequent state-of-the-art performance.Our algorithm makes an assumption that the data is i.i.d.. distributed in the form of tabular data with feature values and hence, we did not apply it directly to raw data such as text and images.However, this is not a serious limitation.Deeplearning-based anomaly detection methods that use deep learning to extract features (Pang, Shen, Cao, & Hengel, 2021) can employ our proposed algorithm on the representation of the data in the latent space where it is assumed that the data is i.i.d.distributed.
The proposed technique treats the subspaces at the tree leaves to be ensemble components.GLAD (Islam, Das, Doppa, & Natarajan, 2020) uses the insights from this algorithm and extends it to the extreme case where each instance is treated as a subspace; and then, instead of trees, uses neural networks to incorporate feedback using a conceptually similar loss function as proposed this paper.The strong results of GLAD algorithm demonstrate that the key ideas behind the proposed algorithm are versatile.

Summary of Experimental Findings
We briefly summarize the main findings of our empirical evaluation of HiLAD framework.
• Uniform prior over the weights of the scoring function with tree-based ensembles to rank data instances is very effective for human-in-the-loop anomaly detection.The histogram distribution of the angles between score vectors from Isolation Forest (IFOR) and w unif show that anomalies are aligned closer to w unif .
• The diversified query selection strategy (Select-Diverse) based on compact description improves diversity over greedy query selection strategy with no loss in anomaly discovery rate.
• The KL-Divergence based drift detection algorithm is very robust in terms of detecting and quantifying the amount of drift.In the case of limited memory settings with no concept drift, feedback tuned anomaly detectors generalize to unseen data.In the streaming setting with concept drift, our HiLAD-Stream algorithm is robust and competitive with the best possible configuration, namely, HiLAD-Batch.
• HiLAD can discover higher anomalies with limited labeled feedback.With more labeled samples, it discovers more anomalies compared to baseline methods.

Summary and Future Work
This paper studied a human-in-the-loop learning framework using tree-based anomaly detector ensembles for discovering and interpreting anomalous data instances.We first explained the reason behind the empirical success of tree-based anomaly detector ensembles and called attention to an under-appreciated property that makes them uniquely suitable for label-efficient human-in-the-loop anomaly detection.We demonstrated the practical utility of this property by designing efficient learning algorithms to support this framework.We also showed that the tree-based ensembles can be used to compactly describe groups of anomalous instances to discover diverse anomalies and to improve interpretability.To handle streaming data settings, we developed a novel algorithm to detect the data drift and associated algorithms to take corrective actions.This algorithm is not only robust, but can also be employed broadly with any ensemble anomaly detector whose members can compute sample distributions such as tree-based and projection-based detectors.
Our immediate future work includes deploying human-in-the-loop anomaly detection algorithms in real-world systems to measure their accuracy and usability (e.g., qualitative assessment of interpretability and explanations).Developing algorithms for interpretability and explainability of anomaly detection systems is a very important future direction.

Figure 1 :
Figure 1: High-level overview of the human-in-the-loop learning framework for anomaly detection.Our goal is to maximize the number of true anomalies presented to the analyst.
Illustration of Isolation Tree on simple data.(a) Toy dataset Illustration of differences among different tree-based ensembles.The red rectangles show the union of the 5 most anomalous subspaces across each of the 15 most anomalous instances (blue).These subspaces have the highest influence in propagating feedback across instances through gradient-based learning under our model.HST has fixed depth which needs to be high for accuracy(recommended 15 (a) C1: Common case (b) C2: Similar to Active Learning theory (c) C3: IFOR case Figure 5: Illustration of candidate score distributions from an ensemble in 2D.The two axes represent two different ensemble members.(a) C1 represents the common case where both anomaly detectors want to score anomalous data points higher, (b) C2 illustrates how active learning helps the model to learn the slight angle deviation, θ, (c) C3 is specifically for IFOR case where the anomalous data points has smaller path length.So, anomalous data points will be located at two extremes where the path lengths are smallest and can be separated by the non-homogeneous hyperplane.
Isolation Tree.(a) Toy dataset (Das et al., 2017) which will be used as the running example through out the text to illustrate the ideas.Red points are anomalies and black points are nominals.(b) Anomaly scores assigned by IFOR to the Toy dataset.(c) Histogram distribution of the angles between score vectors from IFOR and w unif .The red and green histograms show the angle distributions for anomalies and nominals respectively.Since the red histograms are closer to the left, anomalies are aligned closer to w unif .
Top 30 subspaces ranked by w • d (relevance).Red points are anomalies.(a) shows the top 30 most relevant subspaces (w.r.t their anomalousness) without any feedback.We can see that initially these subspaces simply correspond to the exterior regions of the dataset.Our human-in-theloop anomaly detection approach learns the true relevance of subspaces via label feedback.(b) provides an illustration comparing Select-Top and Select-Diverse query selection strategies.
Illustration of compact description and diversity of selected instances for labeling using IFOR.Most anomalous 15 instances (blue checks) are selected as the query candidates.The red rectangles in (a) form the union of the δ (= 5 works well in practice) most relevant subspaces across each of the query candidates.(b) and (c) show the most "compact" set of subspaces which together cover all the query candidates.(b) shows the most anomalous 5 instances (green circles) selected by the greedy Select-Top strategy.(c) shows the 5 "diverse" instances (green circles) selected by Select-Diverse.
Histogram distribution of the angles between score vectors from IFOR and w unif for all datasets.The red and green histograms show the angle distributions for anomalies and nominals respectively.Since the red histograms are closer to the left, anomalies are aligned closer to w unif .
Percentage of total anomalies seen vs. the number of queries for the smaller datasets in the batch setting.
Relative % of anomalies seen compared to the Unsupervised Baseline for smaller datasets in the batch setting.
Discovery performance comparison between HiLAD-Batch-D and HiLAD-Batch Figure 13: Results comparing diversified querying strategy HiLAD-Batch-D with baseline query strategies HiLAD-Batch-T and HiLAD-Batch-R.The x-axis in (a) shows the number of query batches (of batch size 3).The y-axis shows the difference in the number of unique classes seen averaged across all batches till a particular batch.The solid lines in (a) show the average difference between unique classes seen with HiLAD-Batch-D and HiLAD-Batch-T; the dashed lines show the average difference between HiLAD-Batch-D and HiLAD-Batch-R.(b) presents the discovery performance for HiLAD-Batch-D (dashed line).And the baseline is HiLAD-Batch (solid line)
(a) ANN-Thyroid-1v3 (b) Covtype Figure 15: Results for drift detection across windows.(a) When there is no drift, such as in ANN-Thyroid-1v3, then no trees are replaced for most of the windows, and the older model is retained.(b) If there is drift, such as in Covtype, then the trees are more likely to be replaced.
Results for integrated drift detection and label feedback with HiLAD-Stream algorithm.The top row shows the number of trees replaced per window when a drift in the data was detected relative to previous window(s).The bottom row shows the percentage of total anomalies seen vs. number of queries for the streaming datasets with significant concept drift.

Figure 18 :
Figure 18: Class-wise feedback experiment with synthetic datasets.Figure (a) and (c) are presenting two types of anomalies located two corners.And (b), (d) are showing how the algorithm rank data points across the data space.
Abalone dataset contains anomalies from two different classes.They are known as Class 3 and Class 21.The above figures are (a) feedback with HiLAD without class preferences (b) domain expert is interested in examples from Class 21.(c) Class 3 is the point of interest.In all cases, we notice that getting feedback on a particular class can help us to recognize more anomalies from any particular class.(Note that for Figure (b) and (c) Y-axis represents what fraction of anomalies were discovered from that class.)Yeast contains anomalies from three classes known as VAC, POX, ERL.These plots are presenting HiLAD-Batch is able to identify anomalous examples from a specific type of examples.(Note that for Y-axis represents what fraction of anomalies were discovered from that class.) 3. Yeast contains three anomaly classes (POX, VAC, and ERL).The results of the experiment where we provide feedback on only one class of anomalies is shown in Figure 20.
(a) Abalone tSNE with class label (b) Yeast tSNE with class label Figure 21: tSNE plot for Abalone and Yeast.Samples for each class are marked and color codes.They are widespread on the data space

Table 2 :
Description of benchmark datasets used in our experiments.

Table 4 :
F1-score (%) with standard deviation for anomaly detection on tabular data when the query budget |B| = 10.HiLAD performs the best on two of the four datasets whereas SOEL and Hybr1 performed the best on the rest two.

Table 5 :
F1-score (%) with standard deviation for anomaly detection on tabular data when the query budget |B| = 10.HiLAD performs the best on six of the seven datasets.

Table 8 :
Fraction of total anomalies discovered for Weather dataset.We varied the fraction of replaced trees for both fixed and KL-Divergence based approaches.KL-Divergence based approach with 95% threshold shows stable results compared with other parameter settings (we marked this row bold).We also underlined the best results obtained for different label feedback budget.