FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. Training such models and exploring their hyperparameter space, however, is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, analysis has been limited to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed model, achieves state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.


Introduction
In recent years, self-attention (SA)-based transformer models (Vaswani et al., 2017;Devlin et al., 2019) have achieved state-of-the-art results on tasks that span the natural language processing (NLP) domain.Large-scale pre-training datasets, increasing computational power, and robust training techniques (Liu et al., 2019) drive this burgeoning success.A challenge that remains is efficient optimal model selection for a specific task and a set of user requirements.In this context, one should train only models with the maximum predicted performance.This falls in the domain of neural architecture search (NAS) (Zoph & Le, 2017).

Challenges
The design space of transformer models is vast.Rigorous search has proposed several models in the past.Popular models include BERT, XLM, XLNet, BART, ConvBERT, and FNet (Devlin et al., 2019;Conneau & Lample, 2019;Yang et al., 2019;Lewis et al., 2020;Jiang et al., 2020;Lee-Thorp et al., 2022).Transformer design involves a choice of several hyperparameters, including the number of layers, size of hidden embeddings, number of attention heads, and size of the hidden layer in the feed-forward network (Khetan & Karnin, 2020).This leads to an exponential increase in the design space, making a brute-force approach to explore the design space computationally infeasible (Ying et al., 2019).The aim is to converge to an optimal model as quickly as possible by testing the lowest possible number of datapoints (Pham et al., 2018).Moreover, model performance may not be deterministic, requiring heteroscedastic modeling (Ru et al., 2020).

Existing Solutions and Motivation
Recent NAS advancements use various techniques to explore and optimize different models in the deep learning domain, from image recognition to speech recognition and machine translation (Zoph & Le, 2017;Mazzawi et al., 2019).In the computer-vision domain, various search approaches, such as genetic algorithms, reinforcement learning, and structure adaptation, realize diverse convolutional neural network (CNN) architectures.Some even introduce new basic operations (Zhang et al., 2018) to enhance performance on different tasks.Many works leverage a performance predictor, often called a surrogate model, to reliably predict model accuracy.One can train such a surrogate through active learning by querying a few models from the design space and regressing their performance to the remaining space (under some theoretical assumptions), thus significantly reducing search times (Siems et al., 2020;White et al., 2021b).
Unlike CNN frameworks (Ying et al., 2019;Tan & Le, 2019), meant for vision tasks, there is no universal framework for NLP that differentiates among transformer architectural hyperparameters.Works that do compare different design decisions often do not consider heterogeneity and flexibility in their search space and explore the space over a limited hyperparameter set (Khetan & Karnin, 2020;Xu et al., 2021;Gao et al., 2022) 1 .For instance, Primer (So et al., 2021) only adds depth-wise convolutions to the attention heads; AutoBERT-Zero (Gao et al., 2022) lacks deep feed-forward stacks; AutoTinyBERT (Yin et al., 2021) does not consider linear transforms (LTs) that outperform traditional SA operations in terms of parameter efficiency; AdaBERT (Chen et al., 2021) only considers a design space of convolution and pooling operations.Most works, in the field of NAS for transformers, target model compression while trying to maintain the same performance (Chen et al., 2021;Yin et al., 2021;Wang et al., 2020), which is orthogonal to our objectives in this work, i.e., searching for novel architectures that push the performance frontier.In addition, all  (Chen et al., 2021) DS AutoTinyBERT (Yin et al., 2021) ST DynaBERT (Hou et al., 2020) ST NAS-BERT (Xu et al., 2021) ST AutoBERT-Zero (Gao et al., 2022) ES

FlexiBERT (ours) BOSHNAS
Table 1: Comparison of related works with different parameters ( indicates that the corresponding feature is present).Adaptive width refers to different architectures having possibly different hidden dimensions (albeit each layer within the architecture having the same hidden dimension).Full flexibility corresponds to each encoder layer having, possibly, a different hidden dimension.
previous works only consider rigid architectures.For instance, DynaBERT (Hou et al., 2020) only adapts the width of the network by varying the number of attention heads (and not the hidden dimension of each head), which is only a simple extension to traditional architectures.Further, their individual models still have the same hidden dimension throughout the network.
AutoTinyBERT (Yin et al., 2021) and HAT (Wang et al., 2020), among others, fix the input and output dimensions for each encoder layer (see Appendix A.1 for a background on the SA operation), which leads to rigid architectures.
Table 1 gives an overview of various baseline NAS frameworks for transformer architectures.It presents the aforementioned works and the respective features they include.Primer (So et al., 2021) and AutoBERT-Zero (Gao et al., 2022) exploit evolutionary search (ES), which faces various drawbacks that limit elitist algorithms (Dang et al., 2021;White et al., 2021a;Siems et al., 2020).AdaBERT (Chen et al., 2021) leverages differentiable architecture search (DS), a popular technique used in many CNN design spaces (Siems et al., 2020).On the other hand, some recent works like AutoTinyBERT (Yin et al., 2021), Dyn-aBERT (Hou et al., 2020), and NAS-BERT (Xu et al., 2021) leverage super-network training, where they train one large transformer and search its sub-networks in a one-shot manner.However, this technique is not amenable to diverse design spaces, as the super-network size would drastically increase, limiting the gains from weight transfer to the relatively minuscule sub-network.Moreover, previous works limit their search to either the standard SA operation, i.e., the scaled dot-product (SDP), or the convolution operation.We extend the basic attention operation to also include the weighted multiplicative attention (WMA).Taking motivation from recent advances with LT-based transformer models (Lee-Thorp et al., 2022), we also add discrete Fourier transform (DFT) and discrete cosine transform (DCT) to our design space.AutoTinyBERT and DynaBERT also allow adaptive widths in the transformer architectures in their design space.However, each instance still has the same dimensionality throughout the network (in other words, every encoder layer has the same hidden dimension, as explained above).We mathematically detail why this is inherently a limitation in traditional transformer architectures in Appendix A.1.FlexiBERT, to the best of our knowledge, is the first framework to allow full flexibility -not only can different transformer instances in the design space have distinct widths, but each encoder layer within a transformer instance can also have different hidden dimensions.This results in a massive design space with 3.32 billion transformer architectures.Searching this space via a brute-force technique would be computationally infeasible.Hence, we leverage a novel NAS technique, Bayesian Optimization using Second-Order Gradients and Heteroscedastic Models for Neural Architecture Search (BOSHNAS), to search for the best-performing architecture in this enormous design space.

Our Contributions
To address the limitations of homogeneous and rigid models, we make the following technical contributions: • We expand the design space of transformer hyperparameters to incorporate heterogeneous architectures that venture beyond simple SA by employing other operations like convolutions and LTs.
• We propose novel projection layers and relative/trained positional encodings to make hidden sizes flexible across layers -hence the name FlexiBERT.
• We propose Transformer2vec that uses similarity measures to compare computational graphs of transformer models to obtain a dense embedding that captures model similarity in a Euclidean space.
• We propose a novel NAS framework, namely, BOSHNAS.It uses a neural network as a heteroscedastic surrogate model and second-order gradient-based optimization using backpropagation to input (GOBI) (Tuli et al., 2021) to speed up search for the next query in the exploration process.It leverages nearby trained models to transfer weights in order to reduce the amortized search time for every query.
• Experiments on the GLUE benchmark (Wang et al., 2018) show that BOSHNAS applied to the FlexiBERT design space results in a score improvement of 0.4% compared to the baseline, i.e., NAS-BERT (Xu et al., 2021).The proposed model, FlexiBERT-Mini, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score.FlexiBERT also outperforms the best homogeneous architecture by 3%, while requiring 2.6× fewer parameters.FlexiBERT-Large, our BERT-Large (Devlin et al., 2019) counterpart, outperforms the state-of-the-art models by at least 5.7% average accuracy on the first eight tasks in the GLUE benchmark (Wang et al., 2018) 2 .
We organize the rest of the paper as follows.Section 2 presents related work.Section 3 describes the set of steps and decisions that undergird the FlexiBERT framework.In Section 4, we present the results of design space exploration experiments.Finally, Section 5 concludes the article.

Background and Related Work
We briefly describe related work next.

Transformer Design Space
Traditionally, transformers have primarily relied on the SA operation (Vaswani et al., 2017).Nevertheless, several works have proposed various compute blocks to reduce the number of model parameters and hence computational cost without compromising performance.For instance, ConvBERT uses dynamic span-based convolutional operations that replace SA heads to model local dependencies directly (Jiang et al., 2020).Recently, FNet improved model efficiency using LTs instead (Lee-Thorp et al., 2022).MobileBERT, another recent architecture, uses bottleneck structures and multiple feed-forward stacks to obtain smaller and faster models while achieving competitive results on well-known benchmarks (Sun et al., 2020).For completeness, we present other previously proposed advances to improve the BERT model in Appendix A.2.

Neural Architecture Search
NAS is an important machine learning technique that algorithmically searches for new neural network architectures within a pre-specified design space under a given objective (He et al., 2021).Prior work implements NAS using various techniques, albeit limited to the CNN design space.A popular approach is to use a reinforcement learning algorithm, REINFORCE, that is superior to other tabular approaches (Williams, 1992).Other approaches include Gaussian-Process-based Bayesian Optimization (GP-BO) (Snoek et al., 2012), ES (Real et al., 2019;Lu et al., 2019), etc.However, these methods come with challenges that limit their ability to reach state-of-the-art results in the CNN design space (White et al., 2021a).
Recently, NAS has also seen the application of surrogate models for performance prediction in CNNs (Siems et al., 2020).This results in the training of much fewer models to predict accuracy for the entire design space under some confidence constraints.However, these predictors are computationally expensive to train.This leads to a bottleneck, especially in large design spaces, in the training of subsequent models since we produce new queries only after we train this predictor for every batch of trained models in the search space.Siems et al. (2020) use a Graph Isomorphism Net (Xu et al., 2019) that regresses performance values directly on the computational graphs formed for each CNN model.
Although previously restricted to CNNs (Zoph et al., 2018), NAS has recently seen applications in the transformer space as well.So et al. (2019) use standard NAS techniques to search for optimal transformer architectures.However, their method trains every new model from scratch.Furthermore, they do not employ knowledge transfer, which transfers weights from previously trained neighboring models to speed up subsequent training.This is important in the transformer space since pre-training every model is computationally expensive.Further, the attention heads in their model follow the same dimensionality, i.e., are not fully flexible.
One of the state-of-the-art NAS techniques, BANANAS, implements Bayesian Optimization (BO) over a neural network model and predicts performance uncertainty using ensemble networks that are, however, too compute-heavy (White et al., 2021a).BANANAS uses mutation/crossover on the current set of best-performing models and obtains the next best-predicted model in this local space.Instead, we propose using GOBI (Tuli et al., 2021) to efficiently search for the next query in the global space.Thanks to random cold restarts, GOBI can search over diverse models in the architecture space.BANANAS also uses path embeddings, which perform sub-optimally for search over a diverse space (Cheng et al., 2021).

Graph Embeddings that Drive NAS
Many works on NAS for CNNs use graph embeddings to model their performance predictor.Each computational graph has a corresponding embedding, representing a specific CNN architecture in the design space.A popular approach to learning with graph-structured data is to make use of graph kernel functions that measure similarity between graphs.A recent work, NASGEM (Cheng et al., 2021), uses the Weisfeiler-Lehman (WL) sub-tree kernel, which compares tree-like substructures of two computational graphs.This helps distinguish between substructures that other kernels, like random walk, may deem identical (Shervashidze et al., 2011).Also, the WL kernel has an attractive computational complexity.This has made it one of the most widely used graph kernels.Graph-distance-driven NAS often leads to enhanced representation capacity that yields optimal search results (Cheng et al., 2021).However, the WL kernel only computes sub-graph similarities based on overlap in graph nodes.It does not consider whether or not two nodes are inherently similar.For example, a computational 'block' (or its respective graph node) for an SA head with h = 128 and o = SDP would be closer to another attention block with, say, h = 256 and o = WMA, but would be farther from a block representing a feed-forward layer.
Once we have similarities computed between every possible graph pair in the design space, we learn dense embeddings, the Euclidean distance for which should follow the similarity function.These embeddings would be not only helpful in effective visualization of the design space but also for fast computation of neighboring graphs in the active-learning loop.Further, a dense embedding helps us practically train a finite-input surrogate function (as opposed to the sparse path encodings used by White et al., 2021a).Many works have achieved this using different techniques.Narayanan et al. (2017) train task-specific graph embeddings using a skip-gram model and negative sampling, taking inspiration from word2vec (Mikolov et al., 2013).In this work, we take inspiration from GloVe instead (Pennington et al., 2014), by applying manifold learning to all distance pairs (Kruskal, 1964).Hence, using global similarity distances built over domain knowledge and batched gradient-based training, we obtain the proposed Transformer2vec embeddings that are superior to traditional generalized graph embeddings.
We take motivation from NASGEM (Cheng et al., 2021), which showed that training a WL kernel-guided encoder has advantages in scalable and flexible search.Thus, we train a performance predictor on the Transformer2vec embeddings, which not only aid in the transfer of weights between neighboring models but also support better-posed continuous performance approximation.More details on the computation of these embeddings are given in Section 3.3.

Methodology
In this work, we train a heteroscedastic surrogate model that predicts the performance of a transformer architecture and uses it to run second-order optimization in the design space.
We do this by decoupling the training procedure from pre-processing the embedding of every model in the design space to speed up training.First, we train embeddings to map the space of computational graphs to a Euclidean space (Transformer2vec) and then train the surrogate model on the embeddings.
Our work involves exploring a vast and heterogeneous design space and searching for optimal architectures with a given task.To this end, we (a) define a design space via a flexible set of architectural choices (see Section 3.1), (b) generate possible computational graphs (G; see Section 3.2), (c) learn an embedding for each point in the space using a distance metric for graphs (∆; see Section 3.3), and (d) employ a novel search technique (BOSHNAS) based on surrogate modeling of the performance and its uncertainty over the continuous embedding space (see Section 3.4).In addition, to tackle the enormous design space, we propose a hierarchical search technique that iteratively searches over finer-grained models derived from (e) a crossover of the best models obtained in the current iteration and their neighbors.Figure 1 gives a broad overview of the FlexiBERT pipeline, as explained above.We show an unrolled version of this iterative flow below: However, for simplicity of notation, we omit the iteration index in further references.We now discuss the key elements of this pipeline in detail.Table 2: Design space description.Super-script (j) depicts the value for layer j.

FlexiBERT Design Space
We now describe the FlexiBERT design space, i.e., box (a) in Figure 1.

Set of Operations in FlexiBERT
The traditional BERT model comprises multiple layers, each containing a bidirectional multi-headed SA module followed by a feed-forward module.Previous works propose several modifications to the original encoder, primarily to the attention module.This gives rise to a richer design space.We consider WMA-based SA in addition to SDP-based operations (Luong et al., 2015).We also incorporate LT-based attention in FNet (Lee-Thorp et al., 2022) and dynamicspan-based convolution (DSC) in ConvBERT (Jiang et al., 2020), in place of the vanilla SA mechanism.Whereas the original FNet implementation uses DFT, we also consider DCT.The motivation behind using DCT is its widespread application in lossy data compression, which we believe can lead to sparse weights, thus leaving room for optimizations with sparsity-aware machine learning accelerators (Yu & Jha, 2022).Our design space allows variable kernel sizes for convolution-based attention.Consolidating different attention module types that vary in their computational costs into a single design space enables the models to have inter-layer variance in expression capacity.Inspired by MobileBERT (Sun et al., 2020), we also consider architectures with multiple feed-forward stacks.We summarize the entire design space with the range of each operation type in Table 2.The ranges of different hyperparameters are in accordance with the design space spanned by BERT-Tiny to BERT-Mini (Turc et al., 2019), with additional modules included as discussed.We call this the Tiny-to-Mini space.This restricts our curated testbed to models with up to 3.3 × 10 7 trainable parameters.This curated parameter space allows us to perform extensive experiments, comparing the proposed approach against various baselines.
We can express every model in the design space via a model card, a dictionary containing the chosen value for each design decision.We represent BERT-Tiny (Turc et al., 2019) in this formulation as where the length of the list for every entry in f denotes the size of the feed-forward stack.We employ the model card to derive the computational graph of the model using smaller modules inferred from the design choice (details in Section 3.2).

Flexible Hidden Dimensions
Traditional transformer architectures restrict the flow of information using a constant embedding dimension across the network (a matrix of dimensions N T × h from one layer to the next, where N T denotes the number of tokens and h the hidden dimension; more details in Appendix A.1). Instead, we allow architectures in our design space to have flexible dimensions across layers.This enables different layers to capture information of different dimensions, as it learns more abstract features deeper into the network.For this, we make the following modifications: • Projection layers: We add an affine projection network between encoder layers with dissimilar hidden sizes to transform encoding dimensionality.
• Relative positional encoding: The vanilla-BERT implementation uses an absolute positional encoding at the input and propagates it ahead through residual connections.
Since we relax the restriction of a constant hidden size across layers, this does not apply to many models in our design space (as the learned projections for absolute encodings may not be one-to-one).Instead, we add a relative positional encoding at each layer (Shaw et al., 2018;Huang et al., 2018;Yang et al., 2019).Such an encoding can entirely replace absolute positional encodings with relative position representations learned using the SA mechanism.Whereas the SA module implementation remains the same as in previous works, for DSC-based and LT-based attention, we learn the relative encodings separately using SA and add them to the output of the attention module.
Formally, let Q and V denote the query and the value layers, respectively.Let R denote the relative embedding tensor that the model needs to learn.Let Z and X denote the output and the input tensors of the attention module, respectively.In addition, let us define LT-based attention and DSC-based attention as LT(•) and DSC(•), respectively.Then, One should note that the proposed approach would only be applicable when the positional encodings are trained instead of being predetermined (Vaswani et al., 2017).The proposed relative and trained positional encodings enable us to make the dimensionality of data flow flexible across the network layers.This also means that each layer in the feed-forward stack can have a distinct hidden dimension.

Graph Library
We now describe the graph library, i.e., box (b) in Figure 1.

Block-level Computational Graphs
To learn a lower-dimensional dense manifold of the given design space, characterized by a large number of FlexiBERT models, we convert each model into a computational graph.We formulate this graph based on the forward flow of connections for each compute block.For our design space, we take all possible combinations of the compute blocks derived from the design decisions presented in Table 2 (see Appendix B.1 for a list of possible compute blocks supported in FlexiBERT).Using this design space and the set of compute blocks, we create all possible computational graphs within the design space for every transformer model.We then use recursive hashing as follows (Ying et al., 2019).For every node in this graph, we concatenate the hash of its input, that node, and its output, and then take the hash of the result.We use SHA256 as our hashing function.Doing this for all nodes and then hashing the concatenated hashes gives us the resultant hash of a given computational graph.This helps us detect isomorphic graphs and remove redundancy.
Figure 2 shows the block-level computational graph for BERT-Tiny.Using the connection patterns for every possible block permutation, we can generate multiple graphs for the given design space.

Levels of Hierarchy
The total number of possible graphs in the design space with heterogeneous feed-forward hidden layers is ∼3.32 billion.This is substantially larger than any transformer design space used in the past.
To make our approach tractable, we propose a hierarchical search method.We consider each model in the design space to be composed of multiple stacks containing at least one encoder layer.In the first step, we restrict each stack to s = 2 layers, where each layer in a stack shares the same design configuration.Naturally, this limits the search space size (we denote the set of all graphs in this space by G 1 ).Hence, for instance, BERT-Tiny falls under G 1 since the two encoder layers have the same configuration.We learn embeddings in this space and then run NAS to obtain the best-performing models.In the subsequent step, we consider a design space constituted by a finer-grained neighborhood of these models.We derive the neighborhood using pairwise crossover between the best-performing models and their neighbors in a space where the number of layers per stack is s/2 = 1, denoted by G 2 (detailed explanation of the crossover operation in Appendix B.4).Finally, we include heterogeneous feed-forward stacks (s = 1 * ) and denote the space by G 3 .

Transformer2vec
We now describe the Transformer2vec embedding and how we create an embedding library from a graph library G, i.e., box (c) in Figure 1.

Graph Edit Distance
Taking inspiration from Cheng et al. (2021) and Pennington et al. (2014), we train dense embeddings using global distance metrics, such as the Graph Edit Distance (GED) (Abu-Aisheh et al., 2015).These embeddings enable fast derivation of neighboring graphs in the active learning loop to facilitate the transfer of weights.We call them Transformer2vec embeddings.Unlike other approaches like the WL kernel, GED bakes in domain knowledge in graph comparisons, as explained in Section 2.3, by using a weighted sum of node insertion, deletion, and substitution costs.
For the GED computation, we first sort all possible compute blocks in the order of their computational complexity.Then, we weight the insertion and deletion cost for every block based on its index in this sorted list and the substitution cost between two blocks based on the difference in the indices in this sorted list.For computing the GED, we use a depth-first algorithm that requires less memory than traditional methods (Abu-Aisheh et al., 2015).

Training Embeddings
Given that there are S graphs in G, we compute the GED for all possible computational graph pairs.This gives us a dataset of N = S 2 distances.To train the embedding, we minimize the mean-square error as the loss function between the predicted Euclidean distance and the corresponding GED.For the design space in consideration, we generate d-dimensional embeddings for every level of the hierarchy.Concretely, to train embedding T , we minimize the loss where d(•, •) is the Euclidean distance and we calculate the GED for the corresponding computational graphs g i , g j ∈ G.

Weight Transfer among Neighboring Models
Pre-training each model in the design space is computationally expensive.Hence, we rely on weight sharing to initialize a query model in order to directly fine-tune it and minimize exploration time (details in Appendix B.3).We generate k nearest neighbors of a graph in the design space (we use k = 100 for our experiments).Then, naturally, we would like to transfer weights from the corresponding fine-tuned neighbor that is closest to the query, as such models intuitively have similar initial internal representations.We calculate this similarity using a biased overlap measure that counts the number of encoder layers from the input to the output that are common to the current graph (i.e., have exactly the same hyperparameter values).We stop counting the overlap on encountering different encoder layers, regardless of subsequent overlaps.In this ranking, there could be more than one graph with the same biased overlap with the current graph.Since the learned internal representations depend on the subsequent set of operations as well, we break ties based on the embedding distance of these graphs with the current graph.This gives us a set of neighbors, denoted by N q for a model q, for every graph that are ranked based on both the biased overlap and the embedding distance.It helps increase the probability of finding a trained neighbor with high overlap.
As a hard constraint, we only consider transferring weights if the biased overlap fraction (O f (q, n) = biased overlap/l q , where q is the query model, n ∈ N q is the neighbor in consideration, and l q is the number of layers in q) between the queried model and its neighbor is above a threshold τ .If the query-neighbor pair meets the constraint, we transfer the weights of the shared part from the corresponding neighbor to the query and fine-tune it.Otherwise, we pre-train the query.We denote the weight transfer operation by W q ← W n .

BOSHNAS
We now describe the BOSHNAS search policy, i.e., box (d) in Figure 1.

Uncertainty Types
To overcome the challenges of an unexplored design space, it is important to consider the uncertainty in model predictions to guide the search process.Predicting model performance deterministically is not enough to estimate the next most probably best-performing model.We leverage the upper confidence bound (UCB) exploration on the predicted performance of unexplored models (Russell & Norvig, 2010).This could arise from not only the approximations in the surrogate modeling process but also parameter initializations and variations in model performance due to different training recipes.These are called epistemic and aleatoric uncertainties, respectively.The former, also called reducible uncertainty, arises from a lack of knowledge or information, and the latter, also called irreducible uncertainty, refers to the inherent variation in the system to be modeled.

Surrogate Model
In BOSHNAS, we use Monte-Carlo (MC) dropout (Gal & Ghahramani, 2016) and a Natural Parameter Network (NPN) (Wang et al., 2016) to model the epistemic and aleatoric uncertainties, respectively.The NPN not only helps with a distinct prediction of aleatoric uncertainty that we use for optimizing the training recipe once we are close to the optimal architecture but also serves as a superior model to Gaussian Processes, Bayesian Neural Networks (BNNs), and other Fully-Connected Neural Networks (FCNNs) (Tuli et al., 2021).Consider the NPN network f S (x; θ) with a transformer embedding x as input and parameters θ.The output of such a network is the pair (µ, σ) ← f S (x; θ), where µ is the predicted mean performance and σ is the aleatoric uncertainty.To model the epistemic uncertainty, we use two deep surrogate models: (1) teacher (g S ) and ( 2) student (h S ) networks.It is a surrogate for the performance of a transformer, using its embedding x as an input.The teacher network is an FCNN with MC Dropout (parameters θ ).To compute the epistemic uncertainty, we generate n samples using g S (x, θ ).The standard deviation of the sample set is denoted by ξ.To run GOBI (Tuli et al., 2021) and avoid numerical gradients due to their poor performance, we use a student network (FCNN with parameters θ ) that directly predicts the output ξ ← h S (x, θ ), a surrogate of ξ (Tuli et al., 2022).

Active Learning and Optimization
For a design space G, we first form an embedding space ∆ by transforming all graphs in G using the Transformer2vec embedding.Assuming we have the three networks f S , g S , and h S for our surrogate model, we use the following UCB estimate: where x ∈ ∆, k 1 , and k 2 are hyperparameters.
To generate the next transformer to test, we execute GOBI using neural network inversion and the AdaHessian optimizer (Yao et al., 2021) that uses second-order updates to x (∇ 2 x UCB) up till convergence.From this, we get a new query embedding, x .We find the nearest transformer architecture based on the Euclidean distance of all available transformer architectures in the design space ∆, giving the next closest model x.We fine-tune this model (or pre-train it if there is no nearby trained model with sufficient overlap; see Section 3.3) on the required task to obtain the respective performance.Once we receive the new datapoint, (x, o), we train the models using the loss functions on the updated corpus δ : where µ, σ = f S (x, θ) and ξ is obtained by sampling g S (x, θ ).The first is the aleatoric loss to train the NPN model (Wang et al., 2016); the other two are squared-error loss functions.We run multiple random cold restarts of GOBI to get multiple queries for the next step in the search process.
Figure 3 shows different surrogate models in the BOSHNAS pipeline (f S , g S , and h S ) in the order of flow.As explained in Section 3.4, the NPN network (f S ) models the performance and the aleatoric uncertainty, and the student network (h S ) models the epistemic uncertainty from the teacher network (g S ).
Algorithm 1 summarizes the BOSHNAS workflow.Starting from an initial pre-trained set δ in the first level of the hierarchy G 1 , we run until convergence the following steps in a multi-worker compute cluster.To trade off between exploration and exploitation, we consider two probabilities: uncertainty-based exploration (α) and diversity-based exploration (β).With probability 1 − α − β, we run second-order GOBI using the surrogate model to minimize UCB in Eq. ( 5).Adding the converged point (x, o) in δ, we minimize the loss values in Eq. ( 6) (line 6 in Algorithm 1).We then generate a new query point, transfer weights from a neighboring model, and train it (lines 7-11).With α probability, we sample the search space using the combination of aleatoric and epistemic uncertainties, k 1 • σ + k 2 • ξ, to find a point where the performance estimate is uncertain (line 15).To avoid getting stuck in a localized search subset, we also choose a random point with probability β (line 18).Once we converge in the first level, we continue with the second and third levels, G 2 and G 3 , as described in Section 3.2.

Experimental Results
In this section, we show how the FlexiBERT model obtained from BOSHNAS outperforms the baselines.

Setup
For our experiments, we set the number of layers in each stack to s = 2 for the first level of the hierarchy, where models have the same configurations in every stack.In the second level, we use s = 1.Finally, we also make the feed-forward stacks heterogeneous (s = 1 * ) in the third level (details given in Section 3.2).For the range of design choices in Table 2 and Algorithm 1: BOSHNAS Result: best architecture 1 Initialize: overlap threshold (τ ), convergence criterion, uncertainty sampling prob.
(α), diversity sampling prob.(β), surrogate model (f S , g S , and h S ) on initial corpus δ, design space g ∈ G ⇔ x ∈ ∆; 2 while convergence criterion not met do 3 wait till a worker is free 11 send x to worker; setting s = 2, we obtain 9312 unique graphs after removing isomorphic graphs.We set the dimension of the Transformer2vec embedding to d = 16 after running a grid search.To do this, we minimize the distance prediction error while keeping d small using knee-point detection.We obtain the hyperparameter values in Algorithm 1 through grid search.We use overlap threshold τ = 80%, α = β = 0.1, and k 1 = k 2 = 0.5 in our experiments.The convergence criterion is met in BOSHNAS when the change in performance is within 10 −4 for five iterations.We give details of the model training process in Appendix B.2.

Pre-training and Fine-tuning Models
We adapt our pre-training recipe from the one used in RoBERTa, proposed by Liu et al. (2019), with slight variations in order to reduce the training budget (details in Appendix B.2).

Ablation Study
We compare BOSHNAS against other popular techniques from the CNN space, namely Random Search (RS), ES, REINFORCE, GP-BO, and a recent state-of-the-art, BANANAS.We present performance on the GLUE benchmark.
Figure 4 shows the best GLUE scores reached by respective baseline NAS techniques along with BOSHNAS used with naive (i.e., feature-based one-hot) or Transformer2vec embeddings on a representative design space.We use the space in the first level of the hierarchy (i.e., with 9312 graphs, s = 2) and run all these algorithms in an active-learning scenario (all targeted homogeneous models form a subset of this space) over 50 runs for each algorithm.The plot highlights the fact that enhancing the richness of the design space enables the algorithms to search for more accurate models (6% improvement averaged across all models).We also see that Transformer2vec embeddings help NAS algorithms reach better-performing architectures (9% average improvement).Overall, BOSHNAS with the Transformer2vec embeddings performs the best in this representative design space, outperforming the state-of-the-art (i.e., BANANAS on naive embeddings) by 13%.
Figure 5(a) shows the best GLUE score reached by each baseline NAS algorithm against the number of models it trained.Again, we perform these runs on the representative design space described above, using the Transformer2vec encodings.As observed in the figure, BOSHNAS reaches the best GLUE score.Ablation analysis justifies the need for heteroscedastic modeling and second-order optimization (see Figure 5(b)).The heteroscedastic model forces the optimization of the training recipe when the framework approaches optimal architectural design decisions.Second-order gradients, on the other hand, help the search avoid local optima and saddle points and also aid faster convergence.Table 3: Comparison between FlexiBERT and baselines.We evaluate the models on the development set of the GLUE benchmark.We use Matthews correlation for CoLA, Spearman correlation for STS-B, and accuracy for other tasks.We report MNLI on the matched set.We also include ablation models for BOSHNAS without second-order gradients (w/o S.) and without using the heteroscedastic model (w/o H.). Best (second-best) performance values are in boldface (underlined).*  Xu et al.  (2021) do not report the performance of NAS-BERT 10 on the WNLI dataset; we obtained it using an equivalent model in our design space.The FlexiBERT-Mini † model only optimizes performance on the first eight tasks for a fair comparison with NAS-BERT.
Table 3 shows the scores of the ablation models on the GLUE benchmarking tasks.We refer to the best model obtained from BOSHNAS in the Tiny-to-Mini space as FlexiBERT-Mini.Once we get the best architecture from the search process (using the same, albeit limited compute budget for feasible search times), we pre-train and fine-tune it on a larger compute budget (details in Appendix B.2).According to the table, FlexiBERT-Mini outperforms the baseline, NAS-BERT (Xu et al., 2021), by 0.4% on the GLUE benchmark.Since NAS-BERT finds the higher-performing architecture while only considering the first eight GLUE tasks (i.e., without the WNLI dataset), for a fair comparison, we find a neighboring model in the  Table 4: Comparison between BERT-Mini and FlexiBERT-Mini on the SuperGLUE benchmark.For CB we report macro-average F1.We report accuracy for other tasks.
FlexiBERT design space that only optimizes performance on the first eight tasks.We call this model FlexiBERT-Mini † .We see that although FlexiBERT-Mini † does not have the highest GLUE score, it generally outperforms NAS-BERT 10 by significant margins on the first eight tasks.
Figure 6 demonstrates that FlexiBERT pushes to improve the performance frontier relative to traditional homogeneous architectures.In other words, the best-performing models in the expanded (Tiny-to-Mini) space outperform traditional models for the same number of parameters.Here, the homogeneous models incorporate the same design decisions for all encoder layers, even with the expanded set of operations (i.e., including convolutional and LT-based attention operations).FlexiBERT-Mini has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score.FlexiBERT achieves 3% higher performance than the best homogeneous model while the model with equivalent performance has 2.6× smaller size.
Table 4 shows the performance of FlexiBERT-Mini on SuperGLUE (Wang et al., 2019), which contains more challenging tasks relative to those in the GLUE benchmark.FlexiBERT-Mini outperforms BERT-Mini on the tasks in SuperGLUE.We give details of the selected set of training hyperparameters in Appendix B.2.After running BOSHNAS for each level of the hierarchy, we obtain the respective bestperforming models, whose model cards we present in Appendix B.5.From these bestperforming models, we can extract the following rules that lead to high-performing transformer architectures: • Models with DCT in the deeper layers are preferable for higher performance on the GLUE benchmark.Shallower layers prefer the traditional SDP-based attention heads.
• Models with more attention heads, but a smaller hidden dimension, are preferable in the deeper layers.On the other hand, fewer attention heads with higher hidden dimensions are preferable in shallower layers.
• Feed-forward networks with larger widths, but a smaller depth, are preferable in the deeper layers.Shallower layers prefer the opposite, i.e., smaller width and higher depth.
Using these guidelines, we extrapolate the model card for FlexiBERT-Mini to get the design decisions for FlexiBERT-Large, which is an equivalent counterpart of BERT-Large (Devlin et al., 2019).Appendix B.5 presents the approach for extrapolation of hyperparameter choices from FlexiBERT-Mini to obtain FlexiBERT-Large.We train FlexiBERT-Large with the larger compute budget (see Appendix B.2) and show its GLUE score in Table 5. FlexiBERT-Large outperforms the baseline RoBERTa by 0.6% on the entire GLUE benchmarking suite and AutoBERT-Zero Large by 5.7% when only considering the first eight tasks.
Just like FlexiBERT-Large is the BERT-Large counterpart of FlexiBERT-Mini, we similarly form the BERT-Small and BERT-Base equivalents (Turc et al., 2019).Figure 7 presents the performance frontier of these FlexiBERT models with different baseline works.
FlexiBERT consistently outperforms the baselines for different constraints on model size, thanks to its search in a vast, heterogeneous, and flexible design space of architectures.

Conclusion
In this work, we presented FlexiBERT, a suite of heterogeneous and flexible transformer models.We characterized the effects of this expanded design space and proposed a novel Transformer2vec embedding scheme to train a surrogate model that searches the design space for high-performance models.We described a novel NAS algorithm, BOSHNAS, and showed that it outperforms the state-of-the-art by 13%.The FlexiBERT-Mini model searched in this design space has a GLUE score that is 8.9% higher than BERT-Mini, while requiring 3% fewer parameters.It also outperforms the baseline, NAS-BERT 10 by 0.4%.A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6× smaller size.FlexiBERT-Large outperforms the state-of-the-art models by at least 5.7% average accuracy on the first eight tasks in the GLUE benchmark.
These methods achieve functional improvement in pre-training.Other approaches include techniques such as denoising autoencoders (Lewis et al., 2020).
On the other hand, Khetan and Karnin (2020) consider optimizing the set of architectural design decisions for BERT -number of encoder layers l, size of hidden embeddings h, number of attention heads a, size of the hidden layer in the feed-forward network f , etc.However, it is only concerned with pruning BERT and does not target optimization of accuracy over different tasks.Further, it has a limited search space consisting of only homogeneous models.

Appendix B. Experimental Details
We present the details of the experiments performed next.

B.1 Possible Compute Blocks
Based on the design space shown in Table 2, we consider all possible compute blocks, as presented next: • For layer j, when the operation is SA, we have two or four heads among: h-128/SA-SDP, h-128/SA-WMA, h-256/SA-SDP, and h-256/SA-WMA.If the encoder layer has an LT operation, we have two or four heads among: h-128/LT-DFT, h-128/LT-DCT, h-256/LT-DFT, and h-256/LT-DCT; the latter entry being the type of LT operation.For a convolutional (DSC) operation, we have two or four heads among: h-128/DSC-5, h-128/DSC-9, h-256/DSC-5, and h-256/DSC-9; the latter integer referring to the kernel size.
• For layer j, the size of the hidden layer in the feed-forward network is either 512 or 1024.Also, the feed-forward network may either have just one hidden layer or a stack of three layers.At higher levels of the hierarchy in the hierarchical search framework (details in Section 3.2), all the layers in the stack of hidden layers have the same dimension until we relax this constraint in the last leg of the hierarchy.
Once we find the best models, we pre-train and fine-tune the selected models with a larger compute budget.For pre-training, we add the C4 dataset (Raffel et al., 2020) and train for 3, 000, 000 steps before fine-tuning.We also fine-tune on each GLUE task for 10 epochs instead of 5 (further details given below).We executed this extended training process for the FlexiBERT-Mini and FlexiBERT-Large models.Table 7: Hyperparameters used for fine-tuning FlexiBERT-Mini on the GLUE tasks along with the size of the training set and deviation in performance.
directly fine-tuned after knowledge transfer, we see only a marginal improvement when we pre-train from scratch.This reaffirms the advantage of knowledge transfer that it reduces training time (see Appendix B.3) with a negligible loss in performance.This is a consequence of a high overlap threshold, i.e., 80%, which results in a low performance loss at the cost of maximizing the probability of finding a pre-trained neighbor.Training with a more significant compute budget further improves performance on the GLUE benchmark, validating the importance of data size and diversity in pre-training (Liu et al., 2019).Running a full-fledged BOSHNAS on the larger design space (i.e., with layers from 2 to 24, Tiny-to-Large) can be an easy extension of this work.
While running BOSHNAS, we fine-tune our models on the nine GLUE tasks over five epochs and a batch size of 64, where we implement early stopping.We also run automatic hyperparameter tuning for the fine-tuning process using the Tree-structured Parzen Estimator algorithm (Akiba et al., 2019).The learning rate is randomly selected logarithmically in the [2 × 10 −5 , 5 × 10 −4 ] range, and the batch size in {32, 64, 128} uniformly.Table 7 (Table 8) shows the best hyperparameters for fine-tuning of each GLUE (SuperGLUE) task selected using this auto-tuning technique.This hyperparameter optimization uses random initialization every time, which results in variation in performance each time the model is queried (see aleatoric uncertainty explained in Section 3.4).
Since some tasks in the GLUE benchmark are very small, one expects a large deviation in performance as we change the training recipe.performance on GLUE tasks (as reported in Table 3).We see a large variation in smaller datasets.We observe marginal deviation in performance in large datasets like MNLI, QNLI, and QQP.These deviations correspond to the saliency of aleatoric uncertainty on the training recipe hyperparameters.Our hyperparameter tuning search method chooses the best training recipe resulting in the highest performance.
We have included baselines trained with the pre-training + fine-tuning procedure as proposed by Turc et al. (2019) for like-for-like comparisons, and not the knowledge distillation counterparts (Xu et al., 2021).Nevertheless, FlexiBERT is orthogonal to (and thus can easily be combined with) knowledge distillation because FlexiBERT focuses on searching the best architecture, while knowledge distillation focuses on better training of a given architecture.
All models were trained on NVIDIA A100 GPUs and 2.6 GHz AMD EPYC Rome processors.The entire process of running BOSHNAS for all levels of the hierarchy took around 300 GPU-days of training.

B.3 Knowledge Transfer
Recent works leverage knowledge transfer.However, these methods are restricted to long short-term memories and simple recurrent neural networks (Mazzawi et al., 2019).Wang et al. (2020) train a super-transformer and share its weights with smaller models.However, this is not feasible for diverse heterogeneous and flexible architectures.We propose the use of knowledge transfer in transformers for the first time, to the best of our knowledge, by comparing weights with computational graphs of nearby models.Furthermore, previous works only consider a static training recipe for all the models in the design space, an assumption we relax in our experiments.We directly fine-tune models for which nearby models are already pre-trained.We test for this using the biased overlap metric defined in Section 3.3.Figure 8 presents the time gains from knowledge transfer when we fine-tune on all GLUE tasks.Since we can directly fine-tune some percentage of models, thanks to their neighboring pre-trained models, we were able to speed up the overall training time by 38%.

B.4 Crossover between Transformer Models
We obtain new transformer models of the subsequent level in the hierarchy by taking a crossover between the best models in the previous level (which had layers per stack = s) and their neighbors.We choose the stack configuration of the children from all unique hyperparameter values present in the parent models at the same depth.We show a simple example of this scheme in Figure 9. First, we compute the design space of permissible operation blocks for layers in the stack, s, by the product of the individual design choices of the parents for that stack.We then independently form these new layers with the new constraint of s/2 layers having the same choice of hyperparameter values.Expanding the design space in such a fashion retains the original hyperparameters that give good performance while also exploring the internal representations learned by combinations of the hyperparameters at the same level.

Figure 2 :
Figure 2: Block-level computation graph for BERT-Tiny in FlexiBERT.The projection layer implements an identity function since the hidden sizes of the input and output encoder layers are equal.

Figure 4 :
Figure 4: Bar plot comparing all NAS techniques with (a) naive embeddings and a design space of homogeneous models, (b) naive embeddings and an expanded design space of homogeneous and heterogeneous models, and (c) Transformer2vec (T2v) embeddings with the expanded design space.Plotted with 90% confidence intervals.

Figure 5 :
Figure 5: Performance results: (a) best GLUE score with trained models for NAS baselines and (b) ablation of BOSHNAS.Plotted with 90% confidence intervals.

Figure 6 :
Figure6: Performance frontiers of FlexiBERT on an expanded design space (under the constraints defined in Table2) and for traditional homogeneous models.

Figure 8 :Figure 9 :
Figure 8: Bar plot showing average time for training a transformer model (in GPU-hours) with and without knowledge transfer.(a) Pre-train + Fine-tune: total training time.(b) Direct fine-tuning: training time for a pre-trained model.(c) Knowledge Transfer: training using weight transfer from a trained nearby model gives 38% speedup.Plotted with 90% confidence intervals.

Figure 10 :
Figure 10: Obtained FlexiBERT models after running the BOSHNAS pipeline: (a)FlexiBERT-Mini and its design choices extrapolated to obtain (b) FlexiBERT-Large.
Table 2) and for traditional homogeneous models.

Table 5 :
Comparison between FlexiBERT-Large (outside of the constraints defined in Table 2) and baselines on GLUE score.GLUE * scores reported do not consider the WNLI dataset.

Table 6 :
Table6shows the improvement in performance of FlexiBERT-Mini trained using knowledge transfer (where we transfer the weights from a nearby trained model) after additional training.When compared to the model Performance of FlexiBERT-Mini from BOSHNAS after knowledge transfer from a nearby trained model, and after pre-training from scratch along with a larger compute budget.

Table 8 :
Table 7 also shows the deviation in Hyperparameters used for fine-tuning FlexiBERT-Mini on the SuperGLUE tasks.