Graphmax for Text Generation

In text generation, a large language model (LM) makes a choice of each new word based only on the former selection of its context using the softmax function. Nevertheless, the link statistics information of concurrent words based on a scene-specific corpus is valuable in choosing the next word, which can help to ensure the topic of the generated text to be aligned with the current task. To fully explore the co-occurrence information, we propose a graphmax function for task-specific text generation. Using the graph-based regularization, graphmax enables the final word choice to be determined by both the global knowledge from the LM and the local knowledge from the scene-specific corpus. The traditional softmax function is regularized with a graph total variation (GTV) term, which incorporates the local knowledge into the LM and encourages the model to consider the statistical relationships between words in a scene-specific corpus. The proposed graphmax is versatile and can be readily plugged into any large pre-trained LM for text generation and machine translation. Through extensive experiments, we demonstrate that the new GTV-based regularization can improve performances in various natural language processing tasks in comparison with existing meth-ods. Moreover, through human experiments, we observe that participants can easily distinguish the text generated by graphmax or softmax .


Introduction
The softmax operator is a fundamental component of text generation models, particularly in deep neural language models (Sutskever et al., 2011;Radford et al., 2019).In these models, a context embedding z is generated by the hidden layers of a deep neural network, such as a recurrent neural network (Sutskever et al., 2011) or a transformer (Radford et al., 2019).The output layer, which is typically a fully connected linear layer, is connected to the softmax function to compute a probability distribution for the next word choice.
The softmax is a continuous mapping function that transforms an n-dimensional input vector onto an (n − 1)-simplex.The commonly adopted setting of softmax for text generation imposes a strong hypothesis that the probability of the next word is globally continuous over all the possible words in a dictionary.In linguistics, the arrangement of words and phrases needs to comply with syntax or popular language rules to create fluent and meaningful sentences, which should match the human habitual ways of expression.For example, when the context is a smartphone review in an e-commerce site, people prefer to using internet-style words rather than a formal style.In other words, the best selection of the next word could only be achieved by incorporating the human expression habits as Liu & Yin reflected by the current corpus.We refer to the human preference of expression of word choices as local fluency information.However, the traditional softmax function fails to exploit such fluency knowledge.
The total variation (TV) shows good performance in modeling the local spatial information, which has been successfully applied to image processing tasks (Lellmann et al., 2009;Chambolle et al., 2010;Jia et al., 2021).A naive way of computing TV over 2D signal images is given by where x i,j is a pixel value in a 2D image, and the subindices i, j are the coordinates of this pixel (Rudin, Osher, & Fatemi, 1992).In image processing, it is shown to produce piecewise smoothness regularization within a bounded R 2 space.
We extend the total variation to natural language processing (NLP) tasks to explore the concurrent words (or fluency) information.In particular, we define a graphical total variation for the 1D sequential text data, which can quantify the local word choice variation based on a scene-specific corpus.As shown in Figure 1, the language rules and special expression preferences for word order choices in the corpus can be statistically quantified by a weighted directed word concurrent graph G = (V, A, C), where V = {v i } N i=1 is a set of nodes (words), A ∈ R N ×N is a weighted adjacency matrix, and C is the corpus from which the graph is constructed.Suppose that a mapping F : C −→ G counts all the 2-gram tuples (w t , w t+1 ) of the sentences in C and then maps them to the directed edges of graph G as shown in Figure 1, where w i is a word in the dictionary D, corresponding to the vertex v i of G.In practice, A i,j is calculated by the mapping F based on the frequency of the corresponding 2-gram tuples (w t , w t+1 ) appeared in all the sentences of the corpus C. Let x denote a graph signal over G = (V, A, C), and then we can calculate the graph-shifted text signal as Ax (Chen et al., 2014).We define the graph total variation (GTV) for the text data as ∥x − Ax∥ 2 .In text generation, the graph signal x can be a probability distribution for the choice of the next word.
As discussed earlier, the traditional softmax function aims to link the dense output of a deep neural language model (LM) onto a probability simplex.It is globally continuous but fails to incorporate the local task-specific preferences into an LM explicitly.As a remedy, we propose a GTV-regularized softmax function, called graphmax, which can be incorporated into any global pre-trained LM to improve the local fluent satisfaction of the generated sentences.For example, the graphmax can be integrated into GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020) to obtain a scene-based text generation model.The contributions of our work are three-fold: 1. We utilize GTV to capture the local style of scene-specific text data.
2. We propose an n-gram mixture language model, graphmax, which can be plugged into a global pre-trained LM for a scene-specific task.
3. We evaluate the performance of graphmax on text generation and machine translation tasks.The experimental results demonstrate that graphmax outperforms the traditional softmax in scene-specific text generation and macine translation.
I try to use the method suggested in his paper.
I try to learn the method suggested by him.
I will use the method suggested by him.

Corpus
Figure 1: Construction of a weighted directed graph G = (V, A, C) (right panel) from a corpus (left panel), where V = {v i } N i=1 is a set of nodes corresponding to the words in the dictionary of the corpus, A is an N × N weighted adjacency matrix, and C is the corpus that graph G is built upon.The directed edges denote the 2-grams of the corpus, where the weights are calculated based on the frequency of 2-grams in the corpus.A general n-gram model can be easily modeled by an n-order adjacency matrix A n .
The rest of the paper is organized as follows.In Section 2, we review the literature relevant to our method and present the architecture of graphmax for text generation.In Section 3, we propose to optimize the proposed model with the projected gradient descent, and Section 4 elaborates on the detailed algorithm and theoretical properties of graphmax.The experimental results are presented in Section 5, and Section 6 concludes with some remarks.

Methodology
In this section, we will present the fundamental framework of graphmax.To establish a solid foundation, we commence by conducting a literature review on the topic of softmax.

Literature Review on Softmax
The softmax function is fundamentally important for many applications in machine learning.It produces a probability distribution across multiple categories, enabling models' prediction capability, e.g., recognizing a digital handwritten image or generating the next word in a text sequence.However, the traditional softmax has limitations when applied to high-dimensional scenarios, such as text generation (Yang et al., 2018).To address this issue, Yang et al. (2018) proposed a mixture softmax function that improves the expres-siveness of the softmax operator by replacing the single layer in the traditional softmax with an ensemble layer that has more parameters.In language modeling, however, a word can have multiple meanings depending on the context, which means that a single word may lead to multi-sense embeddings.Miao et al. (2019) proposed a kernelized Bayesian softmax function that enhances the expressiveness of the softmax operator by incorporating a multisense kernel function.This approach provides a more flexible and expressive framework for modeling word senses and contexts in language modeling tasks.Gao and Pavel (2017) provided a comprehensive summary and analysis of the properties of the softmax function using the convex analysis and monotone operator theory.They demonstrated that the softmax function is a monotone gradient map of the log-sum-exp function.In text generation, the softmax function computes a probability distribution over the words in a dictionary and produces a high-dimensional sparse vector.Martins and Astudillo (2016) proposed a sparse softmax function that returns a sparse posterior distribution, where the loss function of the sparse softmax is analogous to the logistic loss.However, the existing works fail to consider the relationships among all components of the sparse vector.Overall, Gao and Pavel (2017) provided a valuable theoretical foundation for understanding the properties of the softmax function, while Martins and Astudillo (2016) offered a practical solution for handling the sparsity issue associated with softmax in text generation tasks.
To model the concurrent relationships among words, we propose a GTV-regularized softmax function where GTV characterizes the corpus-specific knowledge and improves the local fluency in text generation.In contrast, the total variation (TV), proposed by Rudin et al. (1992), is a spatial regularity that penalizes signals with excessive and possibly spurious local detail, which has been widely applied in computer vision and signal processing (Lellmann et al., 2009;Chambolle et al., 2010;Jia et al., 2021;Lellmann & Schnörr, 2011).For example, Lellmann et al. (2009) and Lellmann and Schnörr (2011) applied TV to multiclass image labeling, and Jia et al. (2021) proposed a 2D TV-regularized U-net for image segmentation.
The use of graphical techniques for signal processing has become increasingly popular in dealing with signals that have irregular structures, such as biological mechanisms, social networks, and citation networks.Recent studies by Chen et al. (2014) and Chen et al. (2015) have extended graphical signal recovery techniques to matrix completion (Liu et al., 2016) and semi-supervised learning (Lv et al., 2022), where TV over a graph is imposed on the object of matrix completion.Through investigation of the relevance of TV and graph energy in graph signal classification, Ahmed, Dare, and Boudraa (2017) found that TV is a compact and informative attribute for efficient graph discrimination while graph energy aims to quantify the complexity of the graph structure.These findings have important implications for understanding the relationship between signal processing and the underlying graph structure.In a related study, Raguet and Landrieu (2018) extended the cut-pursuit algorithm to the GTV regularization of functions with a separable nondifferentiable part.
Many existing works on GTV are focused on undirected graphs, while there has been increasing interest in directed graphs (Shi et al., 2019).In the context of text generation, the relationships between concurrent words are directed and weighted, which has motivated researchers to derive GTV over a weighted directed graph.

Conditional Text Generation
Deep LMs have been shown to be promising for conditional natural language generation with users' additional input.In controllable text generation tasks with different targets, different types of conditional constraints are often required, such as an image in image captioning (Anderson et al., 2017) or a style embedding vector in task-specific text generation (Ficler & Goldberg, 2017).
Traditional natural language generation methods rely on a large corpus, which is typically expensive to train.To address this issue, researchers have explored a new paradigm of pre-trained LMs, such as BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and BART (Lewis et al., 2020;Koto et al., 2020;Chen et al., 2022;Vougiouklis et al., 2020).By incorporating additional domain-specific codes (Keskar et al., 2019), such as sentiment labels (Dathathri et al., 2020) or attribute vectors (Yu, Yu, & Sagae, 2021), the goal is to modify the pre-trained LM with little fine-tuning cost.For example, the plug-and-play language model (PPLM) (Dathathri et al., 2020) builds a user-specified bag-of-words classifier on top of GPT-2 to increase the likelihood of the target attribute.Koncel-Kedziorski et al. (2019) proposed text generation from graph-based constraints, while it only inputs the knowledge graph embedding into a transformer-based model for text generation, which is an implicit way of using the graph.In contrast, our proposal explicitly regularizes the process of text generation by incorporating GTV over a weighted directed graph.

The Softmax Function
Given a context word sequence w 1:t−1 = (w 1 , . . ., w t−1 ), at each decoder time step, the feature map of (w 1 , . . ., w t−1 ) is denoted as z = ϕ(w 1:t−1 ), where ϕ(•) is a well-trained large-scale LM, z = (z 1 , . . ., z N ) ∈ R N , and N is the cardinality of a dictionary.Typically, we use the softmax function to link z with the prediction of the next word w t .From the perspective of optimization, the softmax function corresponds to the minimizer of the following problem 1 , where x ∈ R N is a graph signal (Chen et al., 2014), and 1 is an N -vector of all ones.The global minimizer of (1) is denoted by x * , whose i-th component is where z i is the i-th component of z.

Graphmax: Graph Regularized Softmax
In text generation, an LM predicts the next word based on the history word sequence with a softmax function.Suppose the total number of words in a dictionary is N .An N -dimensional context representation vector z can be taken as the input for the softmax mapping as shown in Equation ( 2).By constructing a directed graph G = (V, A, C) of the

Liu & Yin
words in the dictionary based on a large human corpus, we can take this graph G as an elementary filtering operation that replaces a signal coefficient for each word with a weighted linear combination of coefficients at its neighboring nodes of G, where V = {v i } N i=1 is the node set, A is an N × N weighted adjacency matrix, and C is the corpus that graph G is built upon.
The smoothness of signals over G can be quantified by the graph total variation (GTV), , where µ max denotes the largest eigenvalue of the adjacency matrix A. In practice, we can normalize the adjacency matrix A to make its maximum eigenvalue µ max = 1 to simplify the computation.Specifically, for a node v i , its outgoing edges have weights 1/deg(v i ), where deg(v i ) is the out-degree of v i (the number of outgoing edges).Via matrix normalization, we set Ã = D −1 A (Schlichtkrull et al., 2018;Shi et al., 2019), where D is a diagonal matrix with the i-th diagonal element D ii = j A ij + ϵ and ϵ is a small number to avoid division by zero.The normalization of the adjacency matrix ensures that the largest eigenvalue µ max = 1.By imposing GTV on the optimization problem in Equation ( 1), we obtain a graphregularized softmax, where Ã is a normalized adjacency matrix of the directed graph and λ > 0 is a tuning parameter.We name the solution to (3) as graphmax.The GTV regularizer in (3) forces x to be close to Ãx where Ã is a (normalized) adjacency matrix of the directed graph that contains local knowledge from scene-specific corpus.Note that Ã) is like a projection matrix which projects x to the space of Ã).Imagining Ã = I, an identity matrix, the regularizer would not exist (equal to zero), which indicates the components of x are all independent so that no local information on scene-specific text is utilized.By forcing x to be close to Ã = I, the local scene-specific knowledge would be incorporated in predicting the choice probabilities for the next word.

Optimization
It is difficult to solve Equation (3) with traditional methods, such as the gradient descent or Lagrangian method.Instead, we propose to optimize it with the projected gradient descent detailed as follows.

Projected Gradient Descent
In the optimization problem of Equation (3), the feasible set ∆ = {x|1 ⊤ x = 1, 0 ≤ x i ≤ 1} is a probability simplex, which is a convex set.The graph regularized objective function is which can be shown to be strictly convex as follows.We examine the second-order condition of f (x), Because I is an identity matrix and Ã is a normalized adjacency matrix of the directed graph G, (I − Ã) is non-singular.That is, (I − Ã)x = 0 holds only if x = 0. Hence, the matrix (I − Ã) ⊤ (I − Ã) is positive definite, which leads to the conclusion that the Hessian matrix in ( 4) is positive definite as λ > 0.
The optimization problem in Equation ( 3) can be solved by the projected gradient descent algorithm (Duchi et al., 2008;Jain & Kar, 2017), which involves two steps: the gradient descent step and a subsequent projection of the gradient onto a probability simplex.Specifically, the two steps are given by where α is a learning rate, a t+1 is an intermediate updated gradient at iteration t + 1 and Π C (•) is the projection operator.

Projection onto Probability Simplex
In the case of ∆ = {x|1 ⊤ x = 1, 0 ≤ x i ≤ 1}, the definition of the projection operation Π ∆ (•) can be reformulated as an optimization procedure, min which can be solved by the Lagrangian method (Duchi et al., 2008).By applying the standard KKT conditions for the Lagrangian of Equation ( 6), the i-th component of the optimal x * can be obtained as where γ can be computed from a vector of temporal variables The vector b = (b 1 , . . ., b N ) can be easily obtained by applying a sort function sort(•) to the vector

Computation and Theoretical Properties
In this section, we will outline the procedure of the graphmax algorithm and subsequently delve into a comprehensive analysis of its theoretical properties.

Computational Complexicity
Algorithm 1 outlines the procedure for the graphmax algorithm.In particular, step 3 computes the current gradient using gradient descent, and steps 4-7 correspond to the gradient projection Π ∆ (•).The time complexity of the gradient projection onto the probability simplex is dominated by the cost of sorting the components of vector a, which is O(N log N ).The maximum number of iterations is T , and thus the complexity of Algorithm In practice, the value of T is typically small.In our experiments, we set T = 20 and both the learning rate and the convergence threshold as 10 −4 , which are reasonable choices for most applications.
Input: Learning rate α, maximum number of iteration steps T 9 end Output: A near-optimal minimum x * of the graphmax.

How graphmax Works
Step 3 of Algorithm 1 serves as an interface between the pre-trained language model and the word concurrent graph.In this step, we calculate the gradient ∇f (x t ) of the objective function, as shown in Equation (3).The gradient consists of three distinct parts, where the first part −z is the decoder output of a pre-trained LM, such as GPT-2.As explained in Section 3.1, we have z = ϕ(w 1:t−1 ), where ϕ(•) is the well-trained large-scale LM, so that z carries the global linguistic knowledge from the pre-trained LM.It is important to note that the traditional softmax only receives z as the input and produces a probabilistic distribution as indicated by Equation ( 2).The second part in (8), log x t + 1, depends solely on x t .The third part in (8), 2λ(I − Ã) ⊤ (I − Ã)x t , is derived from the word concurrent graph, which is constructed from a scene-specific corpus (as depicted in Figure 1).This part represents the local knowledge relevant to the target task.The local knowledge is modeled as a second-order polynomial of the normalized adjacency matrix, denoted by Ã.
It is worth noting that Ã and Ã2 correspond to 2-gram and 3-gram models, respectively.Therefore, the graphmax with (I− Ã) ⊤ (I− Ã) can be viewed as a (2,3)-gram mixture model with a penalty of 2λ.Equation ( 8) demonstrates that the local knowledge and the global pre-trained LM operate in a plug-in mode.This property allows the local knowledge to be easily integrated into any pre-trained LM using the proposed graphmax.Unlike traditional conditional natural language generation models that modify z with additional inputs, graphmax does not alter z.Therefore, there is no need to fine-tune the pre-trained model.To perform taskspecific text generation, we can simply attach the graphmax module with local knowledge to a global pre-trained model.

Theoretical Properties
For the softmax defined in Equation ( 2), we include the following proposition for completeness and comparison.
The proof can be referred to Martins and Astudillo (2016).The proposed graphmax possesses a similar property.
Proposition 2. For a feature map vector z = (z 1 , . . ., z N ) ∈ R N , if z i ≤ z j , then the following inequalities hold, where graphmax i (z) denotes the i-th component of graphmax(z).
The proof of Proposition 2 is provided in Appendix A.2.This proposition shows that the graphmax function shares similar properties with the softmax function.As is well-known, the softmax function maps a vector of real numbers to a probability distribution over those Liu & Yin numbers in a consistent manner, such that larger numbers correspond to larger probabilities.By comparing the two propositions, we observe that not only does the graphmax function maintain this consistency but it also has an upper bound.

Experiments
Experimental Settings All models are implemented using Pytorch 1.7 on an Intel(R) Xeon(R) CPU E5-2680 v4 2.40GHz, Tesla K80 GPU, and 128G memory, based on the Ubuntu 16.04 platform.The parameters in Algorithm 1 are specified as follows: The maximum number of iteration steps for the graphmax algorithm is set as T = 20, the learning rate α is 10 −4 , and the convergence threshold is 10 −4 .Specifically, the iteration process stops when the l 2 norm of the difference between consecutive iterations, ∥x t+1 −x t ∥ 2 , is less than 10 −4 .The reported results are averaged over 100 independent runs.
Generally, we set the baselines at their best configurations as reported in the respective papers.For examples, we set PPLM with the sentiment label, KL-scale 0.1, GM-scale 0.95, and step size 0.05.We assign CTRL with a review domain control code in the corresponding experiments.
Evaluation Metrics We employ a hybrid approach for performance evaluation, combining both human and automatic metrics.Specifically, we utilize two well-known automatic metrics: BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002) and perplexity (PPL), for assessing the quality of text generation and machine translation.BLEU counts the matches of n-grams in the generated text with n-grams in the reference text.For instance, a 1-gram comparison would consider each individual token, whereas a 2gram comparison would assess every pair of words.We employ BLEU-2, BLEU-3, BLEU-4, BLEU-5 scores to measure the {2, 3, 4, 5}-gram matching between the generated text and reference text respectively.
Another important metric is the sentence fluency, which depicts the naturalness of human expressions.However, assessing the fluency of a natural language generation system is challenging (Kann, Rothe, & Filippova, 2018).Therefore, we use human testers to evaluate the fluency of the generated text as a supplement.Graph Construction We conceptualize a corpus as a directed graph, wherein the vertices represent individual words, and the edges denote the sequential relationship between two adjacent words in a sentence.The direction of each edge is determined by the word link of a 2-gram.We then tally the frequency of the 2-grams appeared in our corpus as the weight of the corresponding edge.This graphical representation intuitively captures the natural flow of expression.
To construct these graphs, we utilize datasets employed in our experimental analysis.For instance, we construct the word concurrent graph by leveraging datasets from food/restaurant reviews or e-commerce reviews sourced from Yelp.These datasets exhibit distinct characteristics that diverge from the typical structure of general text.Notably, the online review language tends to incorporate more colloquial phrases, shorter sentences, and internet slangs.These unique features of review text present a convenient opportunity to juxtapose the local fluency of our proposed model with that of other approaches.
The concurrent graph is constructed based on the statistics of 2-grams, which corresponds to a 2-gram model.However, this construction can be easily extended to an n-gram model by utilizing a higher-order adjacency matrix, denoted as Ãn−1 .As discussed in Section 4, the proposed graphmax acts as a {2, 3}-gram mixture model when the input is a first-order normalized adjacency matrix Ã.For an n-order input matrix Ãn , graphmax models an {n + 1, n + 2, . . ., 2n + 1}-grams mixture.In practice, we can even augment the input matrix by replacing Ã with Ã + Ã2 + • • • + Ãn , which yields a more complex n-grams mixture model.This approach further captures higher-order word dependencies and enhances the ability to model complex language structures.

Convergence of graphmax
The core part of our model is the projection gradient descent, which is used to calculate the GTV regularized softmax.Unlike the traditional softmax, which has a closed form solution, the proposed graphmax needs to be computed iteratively.Fortunately, the computational cost of this iterative procedure is still affordable.Taking the largest dataset Amazon as an example, the corresponding directed graph contains a total of N = 50527 words, and the adjacency matrix Ã is of size 50527 × 50527.Figure 2 GPT-2 You do not want to read this book if you don't want to change.Shepherd Tebbit Reading is business.It's business in the eyes of God.Hertha Wise Sorry to end this, but this has been a hunch for a couple of months now.See more posts on wikipedia.com.Also read Martin Walker's The Last Dance.Updated 5:18 PM CTRL You do not want to read this book if you don't want to change.It is not a guide to your own life, it is not an answer to your own questions.It is a warning to others to take the time to read this book.It is not an easy book to follow, and you will be left with a lot of questions that are going to be difficult to answer.This book is not about being afraid of change.It is about being a true believer.I think you are probably going to find a lot of mistakes, and that's okay.In this book, we are going to learn a lot and learn how to change our life.Years ago during our fighting back before we even got a chance to fight back, we will never see a great American.
Table 1: Comparisons of book reviews generated by GPT-2, CTRL, PPLM, OPT and GPT-2+graphmax (denoted as 'New') on generating product reviews, with the local graph of the graphmax constructed from the Amazon review corpus.
shows that the graphmax converges in 20 iteration steps, where the X-axis represents the iterative steps t and the Y-axis is the l 2 gap between two adjacent iterations ∥x t+1 − x t ∥ 2 .Table 3 presents a comparison of the average time required for inferring the next word in five experimental scenarios, between LLAMA2 and LLAMA2+graphmax.It is worth noting that the time cost associated with graphmax is approximately 15-17 times higher than that of the original LLAMA2 implementation, which is consistent with the findings derived from the analysis of time complexity.

General Text Generation E-Commerce Product Review Generation
In the e-commerce product review generation experiment, we construct a graph with 4 million product reviews in the Amazon   (Touvron et al., 2023) and the local corpus Amazon.
corpus (Zhang et al., 2015).Table 1 shows representative book reviews generated by the four baseline methods (GPT-2, CTRL, PPLM, OPT) as well as the proposed graphmax incorporated into GPT-2 (GPT-2+graphmax).From the perspective of narrative logic, PPLM, GPT-2, OPT, and our method perform better than CTRL.The sentences generated by CTRL often appear as paraphrases of a start sentence.On the other hand, GPT-2, OPT, and our method elaborate more on the topic and engage in further discussions.The book review generated by GPT-2 contains extensive information, such as another book's title, people related to this book, etc., while such abundance of multi-source information can be distracting.Similarly, although OPT may highly recommend a book, the basis for such a recommendation is not attributed to the book's content.In contrast, the proposed GPT-2 with graphmax attempts to gradually persuade the reader by focusing on the book's content.It utilizes sentences of varying lengths, which resembles genuine reviews.
Table 2 shows the BLEU scores of the text generated by the baseline methods and our model.The relative improvement of GPT-2 with graphmax over the baselines is more significant in the BLEU-5 metric than the BLEU-2 metric.
Food Review Generation This experiment constructs a graph based on the Yelp food review (McAuley & Leskovec, 2013;Zhang et al., 2015) dataset, which contains more than 700,000 food reviews.
GPT-2 Now I know why people review this place so highly, it's even ridiculous that a local show will automatically get chosen the greatest movie in the annals of writing.I mean even though it's written by the ones I love in all these years!Just take a glance.How many times have you seen that absolutely 100 dollar DVD and read the comments?That only proves that you are as mature as I am and as invested in The Cove as I am in Denzel Washington.Ridiculous.But I couldn't get a dentist just looking at them and thinking "well, the worst thing that could happen in a movie is the less you have to do it.CTRL Now I know why people review this place so highly.This is a place for good food, good music and amazing people.I was here for an informal "crawl" on a Thursday night, and I saw a young woman and some young guys in their late 20s, all sitting on the bench, drinking beer, with their feet dangling down in a pile of trash.We had been to a lot of great places here, but nothing beats a little party on an old-fashioned Saturday night and getting in to the groove and making some good friends.PPLM Now I know why people review this place so highly.There's always a lot of reviews here for anything, but it's not a place where we're talking about an interesting place.But I'm not sure.This article was originally published at the University of Massachusetts Press.

OPT
Now I know why people review this place so highly, Now I know why people review this place so highly, it's because it's a great place to stay.I've been staying here for years and I've never been disappointed.The rooms are clean and comfortable, the staff is friendly and helpful, and the location is great.I've stayed here for business and pleasure.

New
Now I know why people review this place so highly, for any reason-shear interest, get bored, try new things.However, I've been coming here for a few weeks now, and when I first opened the door and saw this bar, my Roast was completely filled with treats, salads, and soul food.I really enjoyed everything I've had in here for the past several days.The food came out in such a timely fashion that I stopped and have been keeping an eye on the Trolley Room Lounge bar (with its lights on) a little since I finished my morning.The food is very good, as I've been ordering fresh baked foods since my college days, I still just couldn't stop with homemade sweet potatoes.I thought the bakery was pretty good.Overall, the atmosphere was clean, laid back, and friendly, and it was friendly.Table 5: The BLEU scores of GPT-2, CTRL, PPLM, LLAMA2, and OPT with or without graphmax on generating restaurant reviews (± standard deviation), averaged over five cross-validation folds.
Table 4 presents examples of the restaurant reviews generated by the baselines and the proposed model.From the view of the narrative logic, we observe that GPT-2 explores too much information, making it difficult to capture the main story-line "a place".In contrast, CTRL, PPLM, OPT, and our GPT-2+graphmax method can always make the text focused on the core story.We also see that OPT appears to resemble a blog post authored by a seasoned food enthusiast, utilizing a formal tone of expression rather than an internet review.In contrast, the proposed method (GPT-2+graphmax) prefers a freestyle of text that is much closer to a real e-commerce review written by human users.In contrast, our model tends to use shorter phrases, such as the "shear interest", "get bored", "try new things", "laid back", and even a repetitive style of a statement, such as "and friendly, and it was friendly", which are more similar to real food/restaurant reviews on e-commerce sites.In other words, the fluency of the generated text approaches that of real reviews.The comparison between GPT-2 and our method (GPT-2+graphmax) suggests that the local knowledge obtained in Equation ( 8) is indeed important in text generation.
We report the BLEU-2 to BLEU-5 scores (Papineni et al., 2002) in Table 5 to evaluate the overall performance for the food review generation.We set λ = 1.0.The BLEU-2 to BLEU-5 scores on generating long food reviews using the models with graphmax (+graphmax) are consistently better than PPLM, CTRL, OPT, and GPT-2 without graphmax.For baseline LLAMA2, graphmax achieves better BLEU-3 and BLEU-5 but worse BLEU-4 than LLAMA2 without graphmax.
The left panel of Figure 3 shows the perplexity (PPL) of the generated text by the competing methods and our model.The local graph is constructed with the Amazon corpus.Different from BLEU, perplexity is relatively low when the contribution of the regularized local knowledge (with graphmax) is small.The right panel of Figure 3 summarizes the perplexity of generated food reviews.The results indicate that local knowledge benefits the improvement of perplexity for all the baseline models.Comparisons with Other softmax Functions In this experiment, we compare the proposed graphmax with the original softmax, sparsemax (Martins & Astudillo, 2016), and kernel Bayesian softmax (kerBayesian softmax) (Miao et al., 2019) functions on food and e-commerce review generation.As shown in Figure 5, the graphmax achieves the best performances on both the BLEU and perplexity metrics.
Visualization of x We visualize the vector calculated by the proposed graphmax and compare it with the corresponding results of the traditional softmax.In the context "I was so disappointed by this, and at this stage I was going to • • • ", we visualize the top 10 next words respectively recommended by GPT2+softmax and GPT2+graphmax.As shown in Figure 6, the recommended lists of words by the two methods are slightly different.The next word of the aforementioned context predicted by softmax is "give", while that predicted by graphmax is "be".
Figure 4: Searching λ using GPT-2 with the graphmax approach, considering both BLEU and perplexity as evaluation metrics.The results are presented in two panels: the upper panel shows the evaluation from a BLEU perspective, while the bottom panel presents the evaluation based on perplexity.The local graph used for parameter searching is constructed from the Amazon corpus in the left panels and the Yelp corpus in the right panels.

Machine Translation
To evaluate the performance of graphmax in machine translation, we conduct both monolingual and multilingual machine translations.We set the parameter λ = 1.0 in the following experiments.
Monolingual Machine Translation We use BART (Lewis et al., 2020) as a base model and link it to the proposed graphmax.We construct a local style graph with the target language of the WMT'16 corpus, and then plug the local word preference graph into BART to improve machine translation.Table 6 shows translation results on both Romanian → English and English → Romanian directions.
The upper panel of the table presents the BLEU scores for translating English into seven languages, while the lower panel displays the inverse translations.Our results demonstrate   that mBART+graphmax surpasses the performance of the baseline model in most translation tasks.

The n-gram Mixture Model
Section 2.4 (e.g., Equation (3)) and Section 5 elaborate on that graphmax with a first-order concurrent graph Ã acts as a 2-gram model.We can extend it to a general n-gram model by replacing Ã with a high-order adjacent matrix Ãn .Moreover, instead of using only Ã, we can employ a mixture model consisting of Ã + Ã2 + • • • + Ãn .This mixed adjacency matrix integrates more useful knowledge into graphmax.The results in Table 8 indicate that an n-gram mixture model achieves better performance in most of the experiments involving text generation and machine translation.This further corroborates the effectiveness of incorporating higher-order dependencies in the form of Ã + Ã2 with graphmax.

Human Evaluation
To evaluate the fluency of the generated text, we further conduct a human evaluation based on the work of Ahn, Morbini, and Gordon (2016).Participants are presented with a set of food and restaurant reviews generated by GPT-2 (baseline) and our proposed model (GPT-2+graphmax).They are asked to rate the quality of the text based on four perspectives: (1) text style, e.g., a human participant is asked whether the food review generated by the machine is in the social media slang style or in a formal style; (2) word choice; (3) word order variation; (4) story-line, which is a complete narrative logic of the generated text.
In particular, participants read the reviews randomly selected from the baseline and the proposed model, and then rate the text from the four perspectives on a 5-point Likert scale: from 1 (worst) to 5 (best).
We randomly select 50 food and restaurant reviews generated by GPT-2 and another 50 reviews by the proposed model.The 100 reviews are shuffled and five participants (students) are invited to read and evaluate them independently.Moreover, we distribute the evaluation task to a wider range of participants through a questionnaire website3 .A total of 27 participants complete the evaluation, including teachers, older adults, and secondary school students.The 100 mixed reviews are shuffled and presented for rating to all 32 participants, resulting in 426 valid ratings.The overall rating results for the four evaluation categories, as well as their averages, are shown in Table 9.Our proposed model (GPT-2+graphmax) performs better in controlling the text style and story-line coherence compared with the baseline.Overall, our method generates reviews that are more in line with the social media style and have a more coherent narrative logic.

Conclusion
Motivated by the word concurrent relationships which describe the human preferences of expression, we construct a weighted directed word concurrent graph for scene-specific text generation.The nodes of the graph are the words that appear in the corpus, the direction of edges is determined by the 2-grams of the corpus, and the corresponding weights can be calculated based on the frequency of the 2-grams.To incorporate the concurrent information into the process of text generation, we propose a concurrent graph regularized softmax function, called graphmax, for the next word prediction.The proposed graphmax is an n-gram mixture model that can be readily plugged into a large-scale pre-trained LM.We apply graphmax to scene-specific review generation and machine translation, and the results from both tasks demonstrate that our method can improve the performances than existing methods using softmax.This suggests that due to its simplicity and versatility graphmax should be used broadly in LMs.
This completes the proof.

Figure 3 :
Figure 3: Comparisons of perplexity using GPT-2, CTRL, PPLM, OPT, and LLAMA2 with or without (w/o) graphmax on generating product reviews, with the local graph constructed from the Amazon and Yelp corpuses.

Figure 6 :
Figure 6: Visualization of the top 10 words respectively recommended by GPT2+softmax and GPT2+graphmax in the context "I was so disappointed by this, and at this stage I was going to • • • ".
That means that you are going to find a lot of happiness.PPLM You do not want to read this book if you don't want to change.The first part is an explanation of what it is like to play with a computer and then to get your head around it.You might just want to play with a computer, or you might want to use a calculator or a phone to get a picture of what to do.I really think it's important that you understand the basics of your game.OPT You do not want to read this book if you don't want to change.This is a very inspiring book, which encourages you to change your life and become a much better person.If you're unhappy with the way you are, you should read this book!The author describes in the book several ways to improve yourself and your personality.New You do not want to read this book if you don't want to change.Blossom flows, thickly.Whatever you say it has become pretty clear.You are learning from the book and finding things that you must learn.You are learning from the criticism, the griping, and the washing up.What you learn and remember about it is that it is what we learned from it, and so what we learn will help us get back into this new energy of determination that helped us overcome the most difficult elements of battle.

Table 2 :
The BLEU scores of GPT-2, CTRL, PPLM, OPT, LLAMA2, and the proposed methods on generating product reviews (± standard deviation), averaged over five cross-validation folds.The local graph is constructed with the Amazon corpus, and the best results are highlighted in boldface.

Table 3 :
Comparison of the average inference time (in seconds), with the base model LLAMA2

Table 9 :
Results of human evaluation on text generation (standard deviations in brackets).