Removing Bias and Incentivizing Precision in Peer-grading

We study peer-grading with competitive graders who enjoy a higher utility when their peers get lower scores. We propose a new mechanism, PEQA, that incentivizes such graders through a score-assignment rule which aggregates the final score from multiple peer-evaluations, and a grading performance score that rewards performance in the peer-grading exercise. PEQA makes grader-bias irrelevant. Additionally, under PEQA, a peer-grader's utility increases monotonically with the reliability of her grading, irrespective of her competitiveness and how her co-graders act. In a reasonably general class of score assignment rules, PEQA uniquely satisfies this utility- reliability monotonicity. When grading is costly and costs are private information, a modified version of PEQA implements the socially optimal effort choices in an equilibrium of the peer-evaluation game. Data from our classroom experiments confirm our theoretical assumptions and show that PEQA outperforms the popular median mechanism.


Introduction
A peer-evaluation process aggregates assessments from peers to judge the quality of submitted work. Scientific communities use peer-evaluation for reviewing the quality of articles and grant proposals (Campanario, 1998). Coursera and EdX that offer Massive Open Online Courses (MOOCs) to 94 million learners 1 , use peer-grading to evaluate submitted assignments. Many in-person classes are also adopting it and its growing popularity can be explained by the following three reasons. First, it simplifies and accelerates the evaluation and grading process. Second, it improves learning outcomes of the participating students (Sadler and Good, 2006). Third, it easily scales to large classes.
When students are evaluated on a curve, students naturally care about their relative performance vis-à-vis peers. Even when evaluated on an absolute grading scale, students care about their relative performance due to the role it plays in admission into jobs or higher studies. This creates perverse incentives for peer-graders. Strong et al. (2004) find that peer-graders give consistently biased grades in peer-grading schemes. In an anonymous survey that we ran on the students of a reputed technical institute in India, 49% of the 549 respondents expected that their fellow students would grade aggressively to reduce the scores of others, and thereby try to improve their relative class-ranking.
We study the problem of incentivizing competitive and strategic peer-graders. In our model, students write an exam and then peer-grade each others' exams. Thus, every student has dual roles: (i) the student role, where she writes an exam that gets evaluated, and (ii) the grader role, where she evaluates others. Their total course-score is the sum of their own exam score (aggregated from peer-reports) and a score based on their peer-grading performance. To model competitive students, we assume that their utility is linearly increasing in their total course-score and linearly decreasing in their peers' total course-scores.
To model strategic grading, we adapt the PG 1 statistical peer-grading model of Piech et al. (2013) to a strategic environment. 2 Our model assumes that each paper being peer-graded has a true score. Peer-graders choose the reliability (inverse of variance) of the independent, noisy signals that they observe about the true score. Choosing higher reliability results in observing a more accurate signal. Graders can then decide to add a bias to their observed signal while reporting their assessment. Graders who care about their relative success within peers might purposefully bias their evaluations. They may also choose to receive less reliable signals.
What is the set of desiderata one could ask for a mechanism in this setup? At a minimum, the mechanism should be able to overcome the perverse competitive incentives of biased or unreliable grading. To simplify, we initially assume that more reliable grading (lower variance) does not come at an extra cost to the peer-grader.
We propose a new mechanism, Peer Evaluation with Quality Assurance (PEQA), that ensures that (Theorem 1): Assigned scores and grader's utility are bias-insensitive (defined in Definition 2). Higher reliability ensures monotonically higher utility to the grader, despite her competitiveness and actions of her co-graders. (reliability monotonicity, Definition 3). PEQA uniquely satisfies the monotonic reliability-utility relation within a moderately general class of mechanisms (Theorem 2). In Section 6, we address if PEQA satisfies the more ambitious desiderata of implementing a "preferred level" of grading among competitive graders, while accounting for the cost of gradingeffort. We assume that students face an additional disutility (cost) from grading that increases with their reliability. How much effort should one ask students to exert? Reliability is desirable, but it might be prohibitively costly for students to spend all their time on grading! We define the net social utility (Equation (9)) from the game as the difference between the social benefit of high reliability and the aggregate cost of effort. Under this setup, we show that: A modified version of PEQA implements Nash equilibria of the peer-grading game (with private costs) in which individuals spend the socially-optimal level of effort (Theorem 3). The modified PEQA maintains the same ranking among the students as the original PEQA (Lemma 1). How does the mechanism PEQA work? A small subset of the total number of papers (called probes) is evaluated by the teaching staff. Each grader is assigned K > 2 papers (with K/2 probes) and they never grade their own paper. The peer-graders cannot tell apart the probes from the nonprobes. PEQA compares the grader's and the teaching staff's evaluations of the probes to estimate each grader's bias and reliability. 3 This requires two identifying assumptions: that the teaching staff can observe the true scores on the probe papers, and, that the graders grade identically on probes and non-probes. The estimated grader-bias is subtracted from the peer-reports to de-bias the reports. PEQA's score-assignment function assigns a weighted average of the de-biased grader-reports, with the weights being the inverse square-root of the estimated grader-variance. Thus, reports from high variance graders play a smaller role in the finally assigned score.
We allow students to raise regrading requests after seeing their score. The teaching staff regrade such papers and assign them the true score. We assume that such requests are raised only when the student knows that her initially assigned score was lower than the true score. 4 The schematic diagram of the stages of PEQA is shown in Figure 1. Any peer grader's total score = final score on own paper + grading performance score K=4 papers assigned to grader Set of graders of paper Submitted scores aggregated for paper , regrading requests resolved to find final score 1 2 3 n Phase 0: Collect papers Phase 1: Select and grade probe papers Phase 2: Assign each peer-graders K/2 non-probe papers and K/2 probe papers Phase 3: Evaluate bias, reliability, Aggregate scores, grader-performance Peer-graders' performance score computed Figure 1: Schematic diagram of the PEQA mechanism decomposed into four phases. A typical non-probe paper is denoted by j here.
To test some of our baseline assumptions and to see how easily our mechanism could be implemented in practice, we ran classroom experiments (Section 7). Students enrolled in a computing course were asked to peer-grade a weekly class-quiz. The scores assigned under PEQA were remarkably accurate, and only 1 out of 41 had a wrong score.
Results from our PEQA sessions (Tables 1 to 3) confirm two of our assumptions. 1. The bias and variance were indeed identical across probes and non-probes: subjects were not able to discern one from the other (Hypothesis 4). 2. Grade-manipulations, whenever present, reduced scores instead of inflating scores. This rejects the existence of collusive (i.e., the opposite of competitive) graders (Hypothesis 1). We ran a second competitive session under a Median mechanism, which is currently the most popular mechanism used in MOOCs. 5 In our experiments, PEQA mechanism outperformed Median mechanism in terms of allocating accurate final scores (Hypothesis 3). These differences were statistically significant. We have also developed a peer-grading platform SwaGrader (swagrader. cse.iitk.ac.in) which uses PEQA as the main peer-grading algorithm and is being tested by instructors and students within the Indian Institute of Technology Kanpur.

Related Work
The existing research on peer-evaluation mechanisms can be broadly divided into three strands. The first strand of literature abstracts away from any strategic motives of the peer-evaluators. Instead of providing a mechanism to incentivize strategic evaluators, they propose how the grader reports could be aggregated efficiently (Hamer et al., 2005;Cho and Schunn, 2007;Piech et al., 2013;Shah et al., 2013;Paré and Joordens, 2008;Kulkarni et al., 2014;De Alfaro and Shavlovsky, 2014;Raman and Joachims, 2014;Caragiannis et al., 2015;Wright et al., 2015).
The final strand consists of hybrid approaches where the true quality of some of the peer-assessed material can be found, for e.g, via evaluating a part of the materials by the mechanism designer (teaching staff in case of MOOCs) herself. Graders are then rewarded for agreement with the designer-agreed report (Jurca and Faltings, 2005;Dasgupta and Ghosh, 2013;Gao et al., 2016). Our mechanism also utilizes the feature that the true scores on a small subset of assignments can be revealed at a small cost. However, additionally, we address new and practical features of the peer-grading probem: we allow for competitive graders, we solve the efficient allocation problem under costly grading, and we allow regrading requests. Alon et al. (2011) and Holzman and Moulin (2013) study situations where peers have to choose a subset amongst themselves for a reward. The challenge here is to incentivize the peers to reveal their private information unselfishly. In particular, the goal is to guarantee that what peers report does not affect their chances of winning or getting selected. In these settings, there is no need to incentivize peers to gather information that is 'objective' (e.g., true score on an exam) and verifiable at a cost. There is also no need to ensure that peers enjoy higher utility when their gathered information is more precise. Finally, peers are purely selfish: they do not care about who wins in case they do not win themselves. Thus, by debriding the reports from personal winning chances, the mechanism makes the peers indifferent between all reports. Cai et al. (2015) consider a setting where data-sources (e.g., human labelers) can be paid monetarily to get their estimation of f (x i ) at points x i allocated to them. The end goal is to estimate an exogenously provided f using a given estimatorf . Data-sources can observe a noisy version of f (x i ) with the noise decreasing in their effort, and they maximize the difference between the payment and the cost of the effort. They show that under their VCG-like payment mechanism and the assumption of a "well-behaved"f , the dominant strategy for a data-source is to reveal its observation correctly and always participate in the data-providing exercise. Cai et al. (2015)'s data-sources naturally have no competitive preferences, like our graders do. We also propose the optimal estimatorf to use on the observed data, which they do not.
2 Peer-grading Mechanism 2.1 Definition Each subject i ∈ N = {1, . . . , n} has written an exam, and is also a participant in the peer-grading process. Thus N = {1, . . . , n} represents both the set of papers to be graded and the set of graders. We use i as the index for a grader and j as the index for a paper. For simplicity of exposition, we assume that each paper has only one question for evaluation.
Our mechanism would instruct the teaching staff to evaluate a fixed number (<< n) of these papers so that their true grades are known. These papers are called the probe papers. Let G(j) denote the set of peer-graders of paper j and G −1 (i) := {k ∈ N : i ∈ G(k)} denote the set of papers assigned to evaluator i. The set P i ⊂ G −1 (i) and N P i = G −1 (i) \ P i denotes respectively the probe and non-probe papers assigned to i. Both true and reported scores belong to R. The co-graders of individual i are CG i = ∪ j∈N P i G(j) \ {i}. We assume that the co-graders of i grade at least one common non-probe paper with i.
Assuming that peer-reported scores are real numbers, a peer-grading mechanism M is the tuple G, r, t , where G is the assignment function G : N → 2 N that maps papers to graders. r : × j∈N R G(j) → R n is the score-assignment function, where the jth component r j (·) is the function assigning the final score of paper j based on the scores reported by G(j). t : × i∈N R G −1 (i) → R n is the peer-grading performance score function, where the ith component t i (·) is the function that yields the peer-grading performance score to grader i. Since every student i has dual roles in peer-grading as explained in Section 1, r i and t i are the mechanism-assigned scores corresponding to her student and grader roles. For example, in a course that has 80 points on the exam and 20 points on peer grading performance, a student might score r i = 60 and t i = 15 on those two respectively. Her total course-score would be 75 out of 100.

Model of the True and Reported Scores
We generalize the PG 1 model of true score, bias, and, reliability (Piech et al., 2013) to a strategic environment. We make two major changes. First, we replace their assumptions of normality with a general distribution F(·) with a support of (−∞, ∞) and a differentiable density function f (·). We use F(µ, 1/γ) for such a distribution with mean µ and variance 1/γ. Second, instead of assuming that bias and reliability are drawn randomly and independently from Normal and Gamma distributions respectively, we make each a strategic choice by the peer-graders. Subject to these changes, the following features in our model resemble the PG 1 model.
The true score y j for paper j is distributed as F(µ, 1/γ), for all j ∈ N . This distribution is known from historical data of past examinations. Peer-graders do not see y j but after they choose their reliability τ i , they observe an independent draw from F(y j , 1/τ i ). Higher is 1/τ i , noisier is the draw. Graders then add a bias b i to the signal before reporting.ỹ (i) j is the reported score of paper j by grader i. Conditional on the true score y j , it is distributed as f (ỹ where b i and τ i are called the bias and reliability of i respectively. Independence across exams and graders: The conditional distributions of i and k's reported scores on exams j 1 and j 2 are independent. Thus, f (ỹ (i) j 2 |y j 2 ), for all i, k, j 1 , j 2 , y j 1 , y j 2 such that i = k and j 1 = j 2 aren't true simultaneously. Thus, even for the same grader, the signals from different exams are conditionally independent. And, even for the same exam, the signals received by different graders are conditionally independent.
We have used the same distribution F for both the true scores y j and the score observed by the grader i, i.e.,ỹ (i) j , to keep the model similar to PG 1 . However, this is not critical to our results. In particular, (a) we can have two different distributions for these two sets of random variables, and (b) the distribution of the observed scoreỹ (i) j can vary with i. None of these will affect the main conclusions of this paper. The dynamics of the grading process is shown in Figure 2. Reliability is defined as the inverse of noise variance. Bias originates from a strategic manipulation or from non-strategic (generous or strict) grading-habits. In this paper, we would assume that the grader chooses her bias and reliability.
We assume that a grader grades all papers (probes and non-probes) with the same bias and reliability. This assumption is natural if the graders cannot tell the probes from the non-probes. We find support for this assumption in our experimental sessions. Bias and reliability are indeed identical across probes and non-probes. We use the shorthand θ i = (b i , τ i ) ∈ R × R 0 to denote grader i's strategic choices.

Other primitives of our mechanism
We have already defined a general peer-grading mechanism in Section 2.1. In this section, we fine-tune the G, r, t functions for our proposed mechanism.
Paper assignment rule G * (·) Every paper is graded by at least one grader, and every grader grades at least two probe and one non-probe papers. Thus, (a) G * (j) = ∅ and j / ∈ G * (j), ∀j ∈ N , (b) |P i | 2, ∀i ∈ N , and (c) N P i = ∅, ∀i ∈ N . The graders know the proportion of probe and non-probe papers assigned to them, but cannot tell them apart.

Grade assignment and performance scores
The mechanism compares the peer-graded scores (ỹ (i) j ) with true scores (y j ) on the probe papers P i , to statistically estimate the error parametersθ The estimated parameters are used in assigning performance-scores to papers and performance scores to peer-graders.
Definition 1 (Score and Reward) We define the score-assignment rule and the social reward as follows.
The score-assignment function r = (r j : j ∈ N ) is inverse standard-deviation weighted de-biased mean (ISWDM) if for every non-probe paper j, it assigns whereỹ (i) j is the evaluation by the ith peer-grader and (b i ,τ i ) are her estimated parameters. Score r * assigns the instructor-verified grade on every probe paper. The social reward for paper j, at a score r * j and true score y j , is whereỹ G(j) j is the vector of peer-evaluated scores reported on paper j,θ G(j) is the vector of evaluated error-parameters for the relevant graders G(j), and R : R 2 → R is a reward function that measures the closeness of the true score y j and the given score r * j . Formally, R(x 1 , y 1 ) < R(x 2 , y 2 ) if |x 1 − y 1 | > |x 2 − y 2 | for all x 1 , x 2 , y 1 , y 2 ∈ R. We assume that R(x, x) = 0 R(x, y) = R(y, x) for all x, y ∈ R. One example of such a function would be R(x, y) = −(x − y) 2 , which calculates the squared error in assigned scores. The social reward at a score r * j for paper j without grader i when the true score is y j is denoted by W is defined as above. The parameters γ, µ, b i , and τ i are as given by the PG 1 model of Piech et al. (2013) (see Section 2.2).
We will use the shorthands W * j and W (−i) * j for the social rewards with and without agent i respectively when the arguments of such functions are clear from the context.
The ISWDM score-assignment function takes a weighted average of the prior mean µ and the de-biased (subtracting the estimated bias from the reported scores) reported scores. De-biasing ensures that the biases of the graders do not affect the finally assigned grade. The weight is chosen to be the square-root of reliability, which is the inverse of the variance for that grader. Higher the estimated reliability, higher is the weight on a grader.
Without incentive concerns, a statistician would have suggested a score-assignment function that would minimize the expected squared distance between the assigned score and true score on exam j, conditional on the true bias and variance parameters. Then, those true parameters could be approximated by the estimated bias and variance. In Equation (26) of Appendix B, we show that under the strong distribution-assumptions of Piech et al. (2013), such a score-assigment function on exam j would come from the class of weighted average (WA) score-assignment functions: where λ 0 , λ i 0, ∀i ∈ N , not all zero. In particular, the parameters turn out to be λ 0 = γ and λ i =τ i , ∀i ∈ N (note the diffence with λ i = √τ i in Equation (1); see Appendix B for details). Here, µ is the prior mean of all papers, and the term (ỹ (i) j −b i ) is the de-biased score on paper j from grader i. This is indeed the expected (social) reward maximizer (ERM), with the reward function R(·) being a negative quadratic function: However, in Theorem 2 we will show that in the class of WA score-assignment functions, ISWDM uniquely satisfies certain desirable properties. Even though ISWDM is not exactly the ERM, it does not compromise the expected social reward (W j ) much (see Appendix C).

Regrading Requests
We consider peer-grading mechanisms that allow regrading requests. We assume that when a regrading request is raised, the instructor regrades the paper herself and assigns the true score on the paper. We also assume that the students know the true scores on their own papers and only raise a regrading request when they expect it to raise their score further.
Assumption 1 Student j knows y j and raises a regrading request only if r * j < y j .
In the next section, we lay down the peer-graders' incentive structure and the desirable properties of a mechanism.

Incentives and Design Desiderata
Individual Preferences. We assume that every individual i cares about (i) her total score (sum of her exam score r i and peer-grading performance score t i ), and (ii) the total scores of the other individuals. To model a competitive grader who cares about her relative performance in the class, we assume that her utility is increasing in (i), weakly decreasing in (ii). This assumption is consistent with the Strong et al. (2004) finding that peer-graders give biased grades in peer-grading schemes.
For agent i in mechanism M = G, r, t , the utility is given by where w ij 0.
In this section, we will assume that more reliable grading does not come at any extra cost for the peer-grader, and hence we exclude such a cost component from the utility expression. The objective here is to understand whether a peer-grading mechanism can reward more reliable grading monotonically, despite the presence of competitive preferences, and when increasing costs are not at play: we define the desirable properties accordingly.
One could have instead considered costs of grading to be increasing in reliability. We do this in Section 6, and the desiderata changes accordingly.
Note that a few uncertainties are resolved after grader i chooses her decision variables (b i , τ i ) and before r * and t * are computed by the mechanism: (i) the scores are reported by grader i,ỹ (i) j for paper j, which is realized from (ỹ chosen by co-grader k (i.e., the strategic uncertainty), (iii) the true score y j on paper j realizes, and (iv) the score on paper j is reported by a co-grader k, which is realized from (ỹ We define two desirable properties of peer-grading mechanisms. The properties consider the grader i's expected utility from the choice of strategies she makes. All expectations are taken only with respect to uncertainty (i), i.e., the distribution of i's grade-evaluation process (ỹ (i) j |y j ) ∼ F(y j +b i , 1/τ i ). The properties hold for any ex-post realization of the other uncertainties (ii) to (iv), and there is no expectation taken on them. This is why both properties are defined as ex-post.
Definition 2 (Ex-Post Bias Insensitivity (EPBI)) A peer-grading mechanism M = G, r, t is ex-post bias insensitive, if the expected utility of participant i is independent of the bias b i . Bias independence holds irrespective of the biases and reliabilities of other graders j = i, the true score y j , and reported scores of the other graders. Mathematically, Definition 3 (Ex-Post Reliability Monotonicity (EPRM)) A peer-grading mechanism M = G, r, t is ex-post reliability monotone if for every grader, the utility is monotonically increasing with her reliability, irrespective of the biases and reliabilities chosen by the other graders, the realizations of the true scores and the scores reported by the different graders. Mathematically, Both these properties are stronger than a dominant strategy version of the above definitions. The utility depends on the realizations of random variables like the true scores y and the reported scores y. A dominant strategy definition will only require the (in)equalities to be satisfied after taking interim expectation over some relevant distribution over those variables. However, our ex-post properties require this to be satisfied for every realization of these random variables.
We are now in a position to present the central mechanism of this paper.

The PEQA mechanism
Algorithm 1 shows the detailed steps of PEQA. In short, the algorithm description specifies the Algorithm 1 PEQA 1: Inputs: (1) the parameters µ and γ of the priors on y j , ∀j ∈ N , which is distributed as F(µ, 1/γ), (2) the reported scoresỹ N P of the graders on the probe papers, and (3) reported scoresỹ N N \P on the non-probe papers. 2: Set the probe set P with |P | = , a pre-determined constant n K 2 +1 , where K (even) is the number of papers assigned to each grader. 3: G = G * : every grader i ∈ N is assigned K 2 probe and K 2 non-probe papers, in such a way that every non-probe paper is assigned to at least K 2 and at most K 2 + 1 graders. This is always possible by assigning the (n − ) non-probe papers to (n − ) graders with each paper assigned to exactly K 2 graders. The rest graders can be assigned to the same (n − ) papers arbitrarily such that these papers get at most one additional grader (since K 2 n − ). Note that this is the reason cannot be larger than n/(K/2 + 1). Ensure that a grader does not get her own paper for evaluation. 4: Estimateb i ,τ i , ∀i ∈ N as given in Section 2.3. 5: r: the score of the paper j is given by the ISWDM r * (Equation (1)). 6: At this stage, students may request for regrading. Instructor learns the correct grade y j for the papers which came for regrading. For other papers, y j = r * j is assumed. 7: t: the performance score to grader i for grading paper j ∈ N P i is given by t , where α > 0 is a constant chosen at the designer's discretion. The total performance score to grader i is therefore t i = j∈N P i t j i .
three functions of a peer-grading mechanism G, r, t as defined in Section 2.1. The papers are assigned to the graders in a specific way. The assigned score on a paper is a weighted average (with appropriately chosen weights). Finally, the grading performance score is the marginal contribution of the grader towards the social reward. In the next section, we present our results on PEQA.

Properties of PEQA
Our first result shows that PEQA satisfies both the properties mentioned in Section 3, as long as the subjects care more about their own scores than others' scores.
A direct consequence of this result is that a grader will have no incentive in putting a deliberate upward or downward bias in this competitive environment and also will find it in her interest to maximize her reliability. All the r i terms in the utility expression (Equation (5)) would be replaced by max{r i , y i } due to Assumption 1, because the instructor is assumed to give the correct score y i when a regrading request is received.
To make the proofs easily readable, we provide an intuition of the main ideas here. The complete details are available in Appendix A.
The EPBI result is driven by how the score-assignment function de-biases the grades through the estimated grader bias. Though the bias estimates from probes are noisy, in expectation, they are correct and are identical across probes and non-probes. Thus grader i's bias cannot lower other's assigned final scores. We show that bias also does not effect the post-regrading expected score max{r * j (·), y j }. Thus biasing reports does not provide any competitive incentives. Her grading performance score depends only on the assigned final scores on the papers she graded, and hence it is unaffected by bias too. EPBI is independent of the condition on w ik s.
Intuitively, two forces drive the EPRM result.
The link between i's grading performance score and her marginal contribution to accurate grading plays a crucial role. A lower grading reliability of i ∈ G(j) invariably lowers i's marginal contribution to accurate score-assignment on paper j. This lowers i's grading performance score and hence, her total utility. The score-assignment function and our regrading assumption (see Assumption 1) are crucial too. As mentioned previously, under our score-assignment function, grader i's noisier grading leads to a noisier assigned grade on paper j. The noise moves the assigned grade above or below the true grade. Higher is the noise, larger is the potential movement in either direction. Grader i determines the magnitude of the noise, but not the direction in which the noise moves the assigned grade. By selectively asking for regrades, student j keeps any undeserved high grades and reverses any low grades that result from the noise. Thus, i's noisier grading ends up increasing j's grades post regrading-requests. Given i dislikes when j gets higher grades, this decreases i's utility in expectation. Thus, i's competitiveness also fuels i's desire for an accurate grading. Deriving the EPRM condition requires a bound on w ik s. This is because the choice of reliability of grader i affects the final grades of whoever she grades, and the marginal contributions (thus, grading performance scores) of her co-graders. We show that the condition on w ik s is sufficient to ensure that the collective weight on other's grading performance scores never outweighs a competitive student's regard for her own performance score, irrespective of other's actions and noise. In the proof, we also show that the sufficient condition on the w ik 's can be further weakened to a sum over only her co-graders. We kept the condition as mentioned in the theorem statement for simplicity and explainability.
The weight α that an instructor assigns in PEQA (see Step 7 of Algorithm 1) on the peer-grading performance score can vary across different instructors. It is, therefore, desirable to have a scoreassignment function that is robust to any choice of the weight α while retaining the two properties above.
Following our discussions around Equation (3), we show in our following result why the ERM score-assigment function is not the optimal choice from the WA class in a world where reliability needs to be incentivized. Rather PEQA's score-assignment function that weighs the de-biased scores by the inverse of the square root of estimated variance, works better. (3)) with any performance score function t satisfies EPRM for every peer-grading performance score weight α > 0 and for all realizations of y j andỹ Therefore, in the class of weighted average score computing function, ISWDM score-assignment function, used by PEQA uniquely (upto constant multipliers) ensures EPRM for flexible performance score weight α > 0. This result shows why our score-assignment function is special, irrespective of the choice of grading performance scores.
At the risk of oversimplification, here is an intuition about how this result works. For the class of weighted average (WA) score-assignment functions, let us consider how the weights affect the post-regrading score.
The first term on the RHS is the true score y j which is independent of grader i's actions. We will focus on how grader i's choices affect the numerator and the denominator of the second term, which is non-negative. The (ỹ terms are approximately a measure of the noise present in the signals that grader i observed for paper j, which has a variance of σ 2 i . But, λ i = c i /σ i uniquely makes the product λ i (ỹ (i) j −b i − y j ) independent of grader i's chosen σ i , for all values of σ i . This is true for all her co-graders too. Hence the numerator is independent of the variance of the graders, which is the first step of the proof.
The denominator is the sum of positive numbers. The term λ i = c i /σ i guarantees that when σ i increases the whole fraction increases. Thus, noisier grading ends up increasing the post-regrading score max r WA j (ỹ G(j) j ,θ G(j) ), y j . In the proof, we formalize this intuition, while accounting for what information is available to grader i when she contemplates how her actions affect postregrading scores.

Efficiency Under Costly Effort
In this section, we assume that increasing reliability is effort-intensive. Now that reliability is costly, what is a desirable level of reliability (effort) from a social point of view? To calculate the social utility, we sum the grading-accuracy of all exams and subtract the total effort-cost from it. We show that a modification of the grading performance score t * of PEQA implements the socially optimal level of costly effort.
We assume that the graders have a uniform weight for the other-regarding component in their utility (w ij = w, ∀ i, j ∈ N ) and is a common knowledge.
Costly effort. We assume that each exam has at least two questions in an exam, and the grading difficulty is dependent on the question type. 8 We also assume that all graders face the same effortcost function c k while grading the question k of the exam. Hence, in every paper (answerscript) grading question k will have the cost c k (τ ik ) for grader i, and we allow c k = c k for questions k = k . We assume that graders can observe c k while they grade question k, but the mechanism designer cannot.
The grader can choose different reliabilities for different questions within a paper, but she grades any particular question at same reliability across all papers: τ ik remains the same for question k on all the papers i grades. Hence, grader i chooses a reliability vector τ i = (τ ik , k ∈ Q), where τ ik is the reliability specific to the question k of the paper and Q is the set of all questions in a paper. The estimated reliability for grader i for question k,τ ik , is computed from her performance on the kth question in the probe papers. Reliability is bounded above, i.e., τ ik ∈ [0,τ ], ∀i ∈ N, k ∈ Q. We summarize our assumptions below.
1. The cost c k : [0,τ ] → R 0 is convex, increasing, and equal for all graders i ∈ N . 2. The course instructor does not know c k , only the graders do.
Social utility of grading. For any question on a non-probe paper, we assume that the social planner (e.g., the instructor) cares about two things: (a) the accuracy of the final score (measured by the reward function R(r * j , y j )) and (b) the total cost of grader-effort. We presume that if the social planner was aware of the cost functions of grading, she would have recommended a joint strategy profile (τ i , τ −i ) that maximizes some linear combination of the reward and cost factors, which we call the social utility. Formally, the social utility of grading paper j is written as where β > 0 determines the relative weight between the two factors. The final social utility is the sum of this over all non-probe papers. The socially optimal effort for any question depends on its cost. When the cost is private information, it impossible to dictate socially optimal effort centrally.
Aligning social and individual incentives. There are three challenges on the way to aligning social and individual incentives. We discuss these below along with their solutions.
An instructor would care about the accuracy in grade-allocation, but why would peer-graders care about the same? PEQA's grading performance score solves this. It forces subjects to internalize accuracy in their decisions by paying each grader their marginal contribution to grading accuracy. Each competitive grader wants lower scores for others as part of her other-regarding utility. Clearly this is not aligned with the social utility of grading and becomes a source of externality. We solve this by suggesting a modified grading performance score below, that additionally compensates graders for any potential losses from their other-regarding utility component. The solution to the point above presents a new challenge. The other-regarding utility component would be different for each grader i, as their reference groups N \{i} would naturally be different. Thus they will be compensated different amounts. Would this change the ordinal ranking of students in the class from that of PEQA? We show that the answer is no (Lemma 1).
Modified grading performance score. Let the post-regrading request score be g i = max{r i , y i }.
We propose the modified grading performance score where t i is the original PEQA grading performance score. The additional terms on the RHS compensate for the other-regarding component in grader i's utility. Though this simplifies the net utility of grader i, the simplicity comes at a price: if i and j are co-graders then π i has been described as a function of π j and vice-versa! How is the designer supposed to decide the values of π i and π j given the interdependency? We show that π i has an alternative expression that is independent of π j s.
where, π = t + w(n − 1)g 1 − w(n − 1) , and t = i∈N The game of peer-grading. The modified PEQA mechanism induces a game among the peergraders after all the answerscripts of the exam have been submitted. The players (the graders) choose their reliabilities as their strategies to maximize their utility. Grader i's utility is given by which is common knowledge. Players simultaneously choose their reliability vectors τ i . The score assignment and performance score functions, that map players' strategies to players' utilities, are also common knowledge. The following result shows that it retains the same order of the scores as in the PEQA performance score.
Lemma 1 (Order Invariance) Fix a profile of player strategies and true scores in the peer-grading game. The modified PEQA performance score π retains the same order among the students as the original PEQA performance score t.
Proof : g i + π i > g k + π k ⇔ g i + t i +w j∈N \{i} g j +wπ 1+w > g k + t k +w j∈N \{k} g k +wπ 1+w The following result shows that the modified grading performance score π := (π i , i ∈ N ) implements the social optimal level of effort for every paper j in a pure Nash equilibrium. The proportion of non-probe papers (which is fixed once the mechanism is announced) is denoted by p NP .
Theorem 3 For every paper j ∈ N , if the course designer uses the modified grading score π i and sets α = β p NP , every maxima of the expected social utility is a Pure Strategy Nash Equilibrium (PSNE) of the induced game among the graders of the paper.
Proof : The modified PEQA already compensates for the other-regarding component and makes it inconsequential. The residual performance score t i is the sum of performance scores t j i from each paper j ∈ G −1 (i). Now, t j i = α(W * j − W Hence the part of the i's utility expression that depends on τ i and is related to grading paper j is (using the shorthand c(τ i ) ≡ k∈Q c k (τ ik ): whereỹ G(j) j is the profile of all scores given by G(j). Thus it depends on the bias and reliability of co-graders, which is chosen strategically and simultaneously. Grader i is uncertain whether paper j is a probe versus a non-probe paper, and only the latter provides a performance score. Since the proportion p NP of non-probes is announced by the mechanism, i assigns a probability p NP to any paper being a non-probe. For any choice of bias and reliability by all the graders on paper j, the expected reward to the mechanism designer is denoted by ,θ G(j) ), y j ). 9 From the analysis of Theorem 1, we know that this function is independent of bias under PEQA. Therefore, for simplicity, we assume that every grader strategizes only on her reliability. To emphasize the strategic and simultaneous choice of reliability, we rewrite R(·) using the shorthandR(τ i , τ −i ). Hence, for any reliability profile chosen by the set of graders on paper j, the part of the expected utility of grader i that depends on τ i is Let the social optimal be obtained at τ * = (τ * k , k ∈ G(j)). Then, it must be that βR(τ , because, if any grader i could increase her expected utility by choosing any other τ i , then it is easy to show that it will contradict the optimality of the social utility at τ * . Thus, clearly if all except i choose τ * −i , player i cannot do any better than choosing τ * i . Thus, τ * = (τ * k , k ∈ G(j)) is a PSNE of this game. As mentioned in the proof above, the property of EPBI is retained even in this setting with costly efforts since the change in the utility due to cost is independent of bias. In the next section, we present our experimental study that tests some of our hypotheses made in the earlier results and verifies its practical usability.

Experimental study
One major objective of the experimental study was to understand the practical trade-off between the theoretical desirability of PEQA against the simple and widely used peer-grading mechanism. As a comparison candidate, we chose the median mechanism, where the score-assignment of a paper is done based on the median of the given scores of the peer-graders evaluating that paper. This mechanism is used in practice for peer-grading MOOCs, e.g., Coursera (https://www.coursera. org) uses this mechanism across multiple courses (Coursera, 2021).
Another objective of this study was to test two of our modeling assumptions: if bias and reliability were indeed identical in probes and non-probes, and, if competitive (w ij 0) preferences are a good model of the peer-grader behavior.
Finally, the theoretical desirability of PEQA is established on restrictive assumptions about the domain of true and given scores, player utilities, and strategies. These assumptions approximate reality instead of describing it. How well does PEQA perform in a real-life exercise, where the scores and signals come from a bounded interval, or when player's utilities are competitive but not necessarily linear?

Experimental Design
We ran two experimental sessions: one with the median scoring mechanism (27 students), another with the PEQA mechanism (42 students). We recruited subjects through two open-calls to undergraduate students enrolled in a computing course (Prog101). The open calls did not contain any particulars of the two sessions. Every student who signed up for participation was assigned to one of the two sessions.
The experimental environment is not an exact replication of model assumptions, rather a replication of how a real-life peer-grading scenario would look like. In many classes, instructors grade on a curve: final numerical scores are converted to letter-grades (A to D) based on relative rankings. Grading on a curve creates a competitive classroom-environment that we wanted to replicate. We told participants that their total score is the sum of their peer-evaluated score and grading performance score. We paid subjects by the relative ranking of their total scores in the class, in both the sessions. The students who ranked in the first quartile of the total scores received M 650, 10 the next three quartiles received M 450, M 250, and M 50 respectively. They also received a show-up fee of M 50, irrespective of their total score. The monetary payments were placeholders for grades A to D in a class that grades on a curve: high relative performance resulted in high rewards. 11 In the median mechanism session, the grading performance score of all subjects were set to zero. The total-score ranking was identical to their peer-evaluated-score ranking. Thus, a subject could decrease others' scores on the peer-evaluation task to increase her relative ranking and payment.
The PEQA session used the PEQA assignment and grading performance scores. Thus, manipulation on the peer-evaluation task risked getting a lower performance and total score, which would result in a lower payment.
The instructions and incentive-scheme, included in Appendix D, were explained in detail before each of the sessions began. In both sessions, we used numerical examples in our explanation. For PEQA, we showed the relation between performance score and grading reliability through a graph and verbally summarized the monotonic relationship.
We conducted both sessions during the weekly Prog101 labs, that happen in a large computer lab. Our study lies at the intersection of Lab and Field experiments. We are interested in peergrading behavior and the students are our population of interest. In this study, we observed our population of interest in their naturally occurring environments, like in Field experiments.
In both sessions, we asked subjects to peer-grade the same weekly class-quiz. We partitioned each quiz into three sub-quizzes (by treating one(two) question(s) of the quiz as a sub-quiz 12 ), and divided each session into three rounds. In every round, the subjects were asked to peer-grade five sub-quizzes (each corresponding to one of five of her anonymous peers). At the end of each round, subjects saw: (a) how peers had evaluated her performance on the sub-quiz, (b) her assigned score (median-scoring or PEQA), and (c) how her co-graders that round had evaluated the sub-quiz.
Within every sub-quiz, some (and not all) of the questions were 'regradable'. The students could raise a regrading request for only those questions at the end of the session. In the PEQA sessions, only the regradable questions were incentivized by the grading performance score. The non-regradable questions used the same assignment function but did not have any grading performance score.
We also graded all the papers ourselves (the instructor graded all of them), and we considered these scores to be the true scores. The difference between mechanism assigned scores and true scores is a measure of the quality of these mechanisms.

Hypotheses and results
Bias is the difference between the true score and the peer-assigned score. It measures the average direction and magnitude of manipulation. PEQA assumes that bias is zero or positive: subjects generally do not manipulate scores upwards (i.e., do not collude). Our first hypothesis builds on this assumption.
Hypothesis 1 Score-manipulation is not collusive.
Our second hypothesis suggests that bias should be higher in the last round for two reasons. First, most repeated interactions have an end-game effect: selfish behavior unravels when no future interactions remain. Second, subjects who have experienced score-manipulation by others might retaliate as a punishment or reciprocal strategy in the later rounds of the treatment.
Hypothesis 2 Score-manipulation or bias peaks in the last round of the game.
In Tables 1 and 2, we summarize the bias in individual grading behavior in the three rounds of both treatments. To compare across questions and rounds, we normalize bias by the total score of the corresponding question.  Table 1: Average bias from 3 rounds grading under the Median mechanism. We report the 95% confidence intervals below the averages.
In each round, every subject graded a regradable and a non-regradable question. The average bias is statistically identical to zero for the first two rounds, and significantly positive in the third round. This is true for both the regradable and non-regradable questions. Thus, the bias is either zero, or positive, and we cannot reject Hypothesis 1. The average bias is also significantly higher in the third round, confirming Hypothesis 2. This holds for both regradable and non-regradable questions.  The low bias in grading in the first two rounds parallels the results on honest reporting from the "die-roll in person and report" studies (Mazar et al., 2008;Fischbacher and Föllmi-Heusi, 2013). In these studies, subjects roll a die privately, self-report the outcome, and get paid based on the report. Fischbacher and Föllmi-Heusi (2013) report that only 20% of people lie to the fullest extent, 39% choose to be honest, and a sizable proportion cheats only marginally. Lying aversion (Dufwenberg and Dufwenberg, 2018), caring about lie-credibility, and a notion of selfconcept maintenance (Mazar et al., 2008) are potential reasons for why people do not lie completely even under full anonymity.
How do the two mechanisms perform? The median assignment rule, due to its robustness to outliers, is immune to insincere grading as long as only a minority of graders are insincere. PEQA is bias invariant (EPBI), incentivizes effort, and should outperform the Median mechanism. We use the accuracy of the mechanism-assigned scores as a metric of relative performance. Given subjects graded most insincerely in the third round, we use this round to test Hypothesis 3.

Hypothesis 3
In the presence of strategic manipulation, the final score assigned under PEQA should be closer to the true scores, than that assigned under median-scoring.
In Table 3, we present the means of fractional-difference and squared fractional-difference between the mechanism-assigned score and true score. Thus, for the former we calculate d j = (true score j − mechanism assigned score j )/total score j on student j's third-round sub-quiz, and then take the average over all j. Similarly, the latter is the average of d 2 j . We find the true score on an exam by grading it ourselves.  The average difference between true and mechanism assigned scores is 14.8% under Median and only -1.2% under PEQA. The negative sign indicates that PEQA assigned slightly higher scores than the true score. Both difference and squared difference are significantly smaller under PEQA. Under the median mechanism, the difference and squared difference were equal because d j almost always took values of 0 or 1.
The median mechanism assigned lower than true grade (assigned a 1 instead of a 2) for 15% (4 out of 27) of the sub-quizzes. In comparison, the PEQA mechanism was (almost) always pointprecise: only one sub-quiz (out of 41) assigned a grade of 0.5 points higher. The number of regrading requests in the median and PEQA sessions were 4/27 and 3/41 respectively, a difference that is statistically significant.
One of the crucial assumptions of PEQA was that bias and noise are invariant across probes and non-probes.

Hypothesis 4 Bias and noise are identical in probes and non-probes.
We tested if bias was different across probe and non-probe questions from the PEQA session. We pooled across three rounds to maximize power. Our statistical tests fail to reject Hypothesis 4. In a t-test, we could not reject the equality of average bias across regradable probe and non-probe questions: the p-value was 0.19. The p-value was 0.16 when we ran the same test for the nonregradable questions.
We also tested if the mean squared deviation (noise) was different across probe and non-probe questions from the PEQA session. We could not reject the equality of noise across probe and nonprobe questions. The corresponding p-values were 0.86 and 0.31 respectively for the regradable and non-regradable questions.

Conclusion
We introduce a new mechanism, PEQA, that uses a score-assignment rule and grading performance scores to incentivize competitive graders. The rule and the performance score guarantee unbiased grades. They also guarantee that any grader's utility increases monotonically with her grading reliability, irrespective of her competitiveness and how her co-graders act. Our assignment rule is unique in its class to satisfy this utility-reliability monotonicity while allowing flexibility in how large performance scores need to be. When grading is costly, a special version of PEQA implements the socially optimal effort-choices in an equilibrium of the peer-evaluation game among co-graders. Finally, in our classroom experiments, PEQA outperforms the popular median mechanism.

Appendices A Omitted Proofs
A.1 Proof of Theorem 1 By Assumption 1, the student knows her y j perfectly and if r * j y j , she does not raise a regrading request. PEQA will assume r * j to be the true score and design the peer-grading performance score accordingly when there is no regrading request. The student asks for regrading only if r * j < y j . The utility of grader i after the regrading requests have been addressed is (we have omitted the arguments of the functions in Equation (5) where it is understood) therefore We decomposed the utility expression to gather together the terms that are affected by the choices of b i , τ i of student i. They are (a) the exam scores of the papers graded by i (first term in the parentheses), and (b) the peer-grading performance score of the co-graders of i (second term in the parentheses). The function φ is the remaining part of u i that is independent of b i , τ i . We prove that PEQA is EPBI and EPRM in four steps. First, we observe that the first term on the RHS is independent of the values of b i and τ i . In the second step, we show that each summand max{r * j (·), y j } in the first summation term is independent of b i and decreasing in τ i . The third step shows that t i is independent of b i and increasing in τ i , and the fourth step shows that this conclusion is true even for t i − k∈CG i \{i} w ik t k for the sufficient condition of the theorem.
Step 1: max{r * i (·), y i } is independent of the values of b i and τ i . This is obvious since student i does not grade her own paper and hence she has no control on the grade given by PEQA on her paper.
Step 2: The expected value of max{r * j (·), y j } is independent of b i and increasing in σ i . Recall that the score-assignment function for PEQA is ISWDM (Definition 1) The final grade after regrading is Grader i's estimated bias is given byb In PEQA, we use the same number K/2 as |P i |, for all i. Hence, x = K/2, is a constant in our analysis.
Given our model of peer-reports,ỹ Substituting these values we get the expression for Note that z j = √ γ(µ − y j ) is a F(0, 1) variable, that is independent of all the other variables in the expression. In the following, we take the expectation of the term max{r * j (·) − y j , 0} w.r.t. z j and show that it is independent of b i and increasing in σ i = 1/ √ τ i , which implies that irrespective of the values of the other graders' biases and reliabilities, it is best for grader i to reduce her σ i to increase this component of her utility (since the term comes with a negative sign in the utility expression).
In the third equality, we have substituted v j = z j + l∈G(j) √τ l (n lj − k∈P l n lk x ), and in the fifth equality, we substituted n ik = m ik · σ i . Since n ik ∼ F(0, σ 2 i ), we get m ik ∼ F(0, 1). Note that f is the density of a F(0, 1) random variable. Hence the whole expression within the integral is independent of σ i . It is easy to see that the pre-multiplied term is increasing in σ i . Hence, we conclude that the integral I j is independent of b i and increasing in σ i = 1/ √ τ i .
Step 3: The expected value of t j i is independent of b i and decreasing in σ i . We assumed in Section 2 that the reward function is decreasing in the difference |r * j − y j | and the mechanism assigns reward to be zero when r * j > y j . Hence, we calculate the condition on y j when the reward is non-zero.
Note that the RHS is independent of σ i . Hence the limits of the integral where the reward R is non-zero is also independent of σ i .
By definition, the W (−i) * j component of the performance score is independent of bias and reliability of grader i. Hence, we only consider the first component which is dependent on the bias and reliability of grader i. We will consider the integral only w.r.t. y j to compute t j i and we just showed that the limits of this integral is independent of σ i . Hence, if we show that the reward function R(r * j , y j ) is independent of b i and decreasing in σ i , then we are done. Consider the argument of the reward function In the last equality, we substituted and X −i = ( √ γ + l∈G(j)\{i} √τ l ). As before, we substituted n ik = m ik · σ i . Since n ik ∼ F(0, σ 2 i ), we get m ik ∼ F(0, 1). We see that the absolute value of the above expression is independent of b i and increasing in σ i . Hence R(r * j , y j ) is independent of b i and decreasing in σ i .
Step 4: t j i − k∈CG j i \{i} w ik t j k is independent of b i and decreasing in σ i for k∈N \{i} w ik 1. First, we show that t j i − t j k is independent of b i and decreasing in σ i . This is because W * j cancels and this difference reduces to W The second term is independent of b i and σ i . The first term is independent of b i and decreasing in σ i by the same argument as step 3, with the set of graders reduced to N \ {k}.
Observe that, in the utility of grader i, the difference in these two performance score terms appear as follows.
Consider the terms in the parentheses on the RHS.
Both terms in the RHS is independent of b i and decreasing in σ i as we have already shown and since k∈CG j i \{i} w ik k∈N \{i} w ik 1. (Note that the first inequality can also be written as max i∈N max A⊂F i K/2 k∈A w ik , since |CG j i | K/2 + 1, which proves the other version of the theorem with a weaker sufficient condition.) Combining all steps, we have shown that the expected utility of grader i, where the expectation is taken only w.r.t. her true score is independent of b i and decreasing in σ i . Hence these two properties hold for any choice of actions by the other graders. Hence we have proved that PEQA is EPBI and EPRM.

A.2 Proof of Theorem 2
Consider the utility expression for agent i with peer-grading performance score weight α As before, the first term on the RHS of the above expression is independent of σ i . We will show that for all realizations of the random variables m ik , n k , v j ,τ , where ∈ G(j) \ {i}, j ∈ G −1 (i), k ∈ P i , and α, the utility is monotone decreasing in σ i if and only if λ i (σ i ) = c i /σ i . For brevity of notation, we will use λ i to denote the function where the argument is clear from the context. We will show that the second term is monotonically decreasing for all realizations of the random variables if and only if λ i (σ i ) = c i /σ i . This will complete the proof, since for any other choice of λ i , the second term will be non-decreasing (derivative is non-negative) for some realization of the random variables. The derivative can be zero only when the weights are independent of σ i 's. But for that case, the t i 's will also be independent of σ i , and hence the third term will also have derivative of zero. When the realizations make the derivative increasing, one can choose α to be small enough such that the decay in the third term is always smaller than the increase in the second term, yielding the overall utility to be increasing in σ i .
Define the following terms to shorten the forthcoming expressions.
Consider the second term in Equation (20). Using Equation (17), we reduce the expression for paper j in the sum into the difference term and consider its expectation w.r.t. z j = λ 0 (µ − y j ), which is a λ 0 √ γ F(0, 1) random variable, to get a similar expression like Equation (18) as follows. We ignore the positive constant λ 0 √ γ as it does not play a role in determining the sign of the variation.
To find the change w.r.t. σ i , we take its partial derivative and find ∂I j ∂σ i to be Note that K 1 is positive, while K j and K 3j can take any sign. To ensure that the expression above is positive for all values of the realized random variables, it is necessary and sufficient that The expression inside the exponent is quadratic. We consider the exponent as follows.
Therefore the resultant distribution is Gaussian: Now we are in a position to calculate r ERM j (ỹ G(j) j ; θ G(j) ). The reward function is R(x j , y j ) = −(x j − y j ) 2 where x j is the estimated score and y j is the true score for paper j. The scoreassignment rule expected risk minimizer (ERM) is given below.
where b i , τ i are the estimated bias and reliabilities ∀i ∈ G(j) = argmax Let g j (x j ) = y j ψ(y j |ỹ G(j) j ; b G(j) , τ G(j) )(x j − y j ) 2 dy j . Hence we need to find x j that minimizes g j (x j ). The first and second order conditions are given as follows.
The first and second order conditions show that The last equality follows from Equation (24). Replacing θ with the estimated parameters, i.e.,θ, we get, C Comparison of W ERM j and W ISWDM j Figure 3 shows the sub-optimality, i.e., W ERM j − W ISWDM j as a fraction of W ERM j , with respect to an increasing symmetric bias and reliability, where 'symmetric' means that every peer-grader has the same bias and reliability. We ran the simulation for the peer-grading model proposed by Piech et al. (2013) with parameters µ = 1, γ = 16, and the reward function R(r, y) = −(r − y) 2 . The simulation is repeated 100 times for each bias and reliability to obtain the statistical measures. It shows that the sub-optimality is small but insensitive to bias (roughly 20%) and monotonically decreasing in reliability.

D Instructions provided to the human subjects (Section 7)
The instructions for both the mechanisms were as follows. median-mechanism and payment system used in this study. This is a study on peer-grading. You should read the following instructions carefully, as they would help you perform successfully in the study. In this study, each of you will be asked to grade the assignments of five anonymous students in this room. Similarly, your own assignment would be graded by five anonymous students from this room. Your peer-graded marks and the relative ranks in this peer-grading exercise only determine your payment from this session. It will not be used to determine your actual score for your final grade in the course. The assignment score used towards your university grades will be provided to you by the instructor (i.e., tutors or myself) later.

D.1 Median Mechanism Instructions
Would I know whose exam papers I might be grading / correcting? You would not have this information. We will take maximal precautions to make sure that the grader or the assignment-owner's identities are anonymous to each other during and after this session. Further you would also not know which other four participants are grading the same papers as you. Thus, this procedure is double-blind. We will provide you a solution manual to help you in the grading process. Follow the explanation of the questions and correct answers presented before the study. Please be respectful and encouraging in the grading process. Scores should reflect the learner's understanding of the assignment and points should not be deducted for difficulties with language or differences in opinion or for using a different but correct methodology.
How are the final grades on my own assignment decided? All five peer-graders independently assign you grades on all of the questions (there are 5 in total, all worth 2 points). Then for each question-part your final grade is the median of those five grades. For example, if on the second round of peer-grading, the five graders assign you 0, 1, 1.5, 2, and 2 respectively, then your final assigned grade on that question would be 1.5. We would calculate your grades on all the questions separately by the above median-method, and then aggregate those median grades from all the questions. For example, if there are five questions and the median grades on the questions are 0, 1, 1.5, 2 and 2 respectively, then the total grade on the assignment is 6.5.
How does one calculate the median of five numbers? Sort the numbers in increasing order and the third highest number would be the median.
Can I dispute my peer-assigned grades? Yes, for certain questions you can, and for others you cannot. In case you think your true grade is different than the grade that has been assigned to you on these questions, you can privately indicate that on a form, that would be sent at the end of the peer-grading and that will immediately notify us. We would then reassign you the grade the Teaching staff had assigned to your assignment previously. This whole process would be completed in a click of a button and you would be shown your updated grade in a matter of seconds. Please note that once a dispute is lodged, your grade would become the Teaching Staff assigned grade irrespective of whether that results in an increase or decrease over your original grade.
How are my payments decided? Every participant would get a show up fee of M 50 for participating in and completing this session. You would also get an additional amount depending on your ranking in the pool of 'n' participants today. The ranking would be done in decreasing order of the final grades assigned to you all on the whole assignment. A ranking of x means that there are (x-1) other people who have a strictly higher grade than you. The additional amount would be equal to M 650 for the top 25% (first quartile) ranked students, M 450 for the next 25% (second quartile) ranked students, M 250 for the third quartile ranked students, and M 50 for the bottom quartile students. If the number of students that scored the same overlaps to two or more different quartiles, then all of them get the average payment of those quartiles. E.g., suppose 7 students got the same marks, and 3 students are in first quartile while 4 are in second quartilethen all 7 get M 600 (average of M 700 and M 500). Hence, in this study, the higher you are in the ranking based on your peers' judgment (and a potential review), higher is your total payment.
How do the grades you submit affect your own payment? The grades you submit obviously do not affect your own grade, because you are never grading your own paper, but they can still affect your own payment. Your grading would potentially affect the grades of others, and that can change the relative rank between you and the person(s) you are grading. For example, when you assign someone a higher grade, that might change the median grade they are assigned, and thus move them to a higher rank than you. Similarly, when you give them a lower grade, it might move them to a relatively lower rank than you. Obviously, both of these scenarios would affect the final payments of both you and the other person, as everyone is paid according to the final rankings.
Time-line for the study in chronological order: Stage 0: The whole assignment to be graded is broken up into 3 small parts, that would be peer-graded in three stages. The total grade from the whole assignment determines your final ranking and payment. At this stage, you are expected to complete the questionnaire successfully.
Stage 1: Every one of you peer-grades the first part of the assignment of 5 of your peers. Therefore, for any question you are grading in this stage, you know that 4 other anonymous participants are also grading that question. Also, the first part of your own assignment is also being peer-graded by 5 other participants. One part of these questions will have options for regrading, while the other part will not (it will be mentioned in the response sheet, but all regrading requests will be collected at the end of stage 3).
Feedback Stage 1: For each paper you graded in Stage 1, we will show you the grades assigned by you and the 4 other anonymous graders. We will also show you how part 1 of your own assignment got graded by the assigned graders.
Stage 2: Similar to Stage 1, now part 2 of the assignment gets peer-graded. But the papers are now sent to a new random set of peer-graders. One part of these questions will have options for regrading, while the other part will not (it will be mentioned in the response sheet, but all regrading requests will be collected at the end of stage 3).
Feedback Stage 2: Feedback of Stage 2 (similar to Stage 1) observed.
Stage 3: Similar to Stage 2 (one part has regrading requests, the other does not), now part 3 of the assignment gets peer-graded.
Feedback Stage 3: Feedback of Stage 3 (similar to Stages 1 and 2) is sent to all students, along with their tentative total score. You may raise regrading requests for the part that is regradable (as mentioned above). Any regrading requests that are lodged will be acted on. Performance on the whole assignment is aggregated, and the final ranking and payments are sent via email. To finish the study, complete the survey that comes in the last email. Study ends.
Is my data confidential? Yes, your data is completely confidential. Before observing and analyzing the collected data, we would be removing every personal identifier from the data, so that none of the decisions can be traced back to the individual who made the decision. The first practice example tests you on your understanding of the mechanism how the peergrading leads to your final grade, rank, and payment. You must complete this practice example with a score of 80% or more (i.e., correctly answer at least 4 questions out of 5). You will get one chance only, so please do this carefully. Failing this, you would be asked to leave this session with a M 20 reward.
Important: Please do not communicate with any other participants during this session. For the grading, open one file at a time, finish grading, submit the grade in the google form and then move on. Please keep seated even if you are done with grading before time. If you have any questions, please raise your hand and one of us will come by to answer your query. Please use your university domain email id throughout this session. Please come remembering your google id/password, since that may be needed for some form filling.

D.2 PEQA Instructions
Before you begin, please register yourself on: [registration link]. Submit the form only once. This is a study on peer-grading. In this study, each of you will be asked to grade five anonymous assignments. Similarly, your own assignment would be graded by a certain number of anonymous students from this room. Your peer-graded marks and your performance in the peer-grading exercise will only determine your payment from this session. It will not be used to determine your actual score for your final grade in the course. The assignment score used towards your university grades will be provided to you by the instructor (i.e., tutors or myself) later.
Would I know whose exam papers I might be grading / correcting? You would not have this information. We will take maximal precautions to make sure that the grader or the assignment-owner's identities are anonymous to each other during and after this session. Further you would also not know which other four participants are grading the same papers as you. Thus, this procedure is double-blind. We will provide you a solution manual to help you in the grading process. Follow the explanation of the questions and correct answers presented before the study. Please be respectful and encouraging in the grading process. Scores should reflect the learner's understanding of the assignment and points should not be deducted for difficulties with language or differences in opinion or for using a different but correct methodology.
How are the final grades on my own assignment decided? Your peer-graders independently assign you grades on all of the questions. Then for each question your final grade is decided by running it through a new mechanism called PEQA (Peer Evaluation with Quality Assurance). This is a mechanism which is designed to remove the individual biases in grading, and selectively weight and reward graders by how precise they are (details to follow). We would calculate your grades on all the questions separately by the above method, and then aggregate those grades from all the questions. In each round, you will have some regradable and non-regradable questions. For the regradable part, you will earn the peer-given score computed through PEQA and an additional PEQA reward for grading. For the non-regradable part, you will only receive the peer-given score computed through PEQA, but no additional reward for grading.
Can I dispute my peer-assigned grades? Yes, for certain questions you can, and for others you cannot. In case you think your true grade is different than the grade that has been assigned to you on these questions, you can privately indicate that on a form, that would be sent at the end of the peer-grading and that will immediately notify us. We would then reassign you the grade the Teaching staff had assigned to your assignment previously. This whole process would be completed in a click of a button and you would be shown your updated grade in a matter of seconds. Please note that once a dispute is lodged, your grade would become the Teaching Staff assigned grade irrespective of whether that results in an increase or decrease over your original grade.
What is the PEQA mechanism? Let us describe PEQA in short in the following two steps: Step 1 Probes: Out of the five questions you (a grader) grade, two are randomly assigned to be probes (rest three are non-probes). On the probe papers, we would directly assign the teaching staff assigned grades and also use the teaching staff assigned grades to get an estimate of your individual average deviation (or bias) and variance in the assignments you graded. We will do this for all the graders. For a grader who on average, assigns a grade higher than the true-grade, the estimated deviation would be negative, and otherwise would be positive.
Step 2 Non-Probes: The non-probes would be graded using the information from, (i) the assigned grades of all the graders, and (ii) the estimated average deviation (or bias) and variance of grading by peer-graders in Step 1. The assigned scores would be "de-biased" using the information in 2. Here is a numerical example that goes through these two steps. Suppose on the five questions you graded, the first two questions are randomly assigned as probes (this is for illustration only, the actual probes will be interspersed and not the first two, and you won't know which are the probes). On the probe questions, your evaluation would be compared with the evaluation done by the course instructors (True score), to calculate an average deviation in your grading. We would then use this to calculate the variance of your deviation.

Score you True Deviation Bias=Avg
Variance of assigned Score Deviation Deviation Suppose the (bias,variance) pairs of the other two graders, who are also grading question 4, are (.25, .05) and (-.5,.2) respectively. Suppose the scores they had assigned to the same Q4 was 3 and 2 respectively, while you have given 4 to that question.
Then, the final grade on Q4 (a typical non-probe question) would be calculated as (k 1 and k 2 are some appropriately chosen constants) assigned score = k 1 + 1 When we assign the final grade on any non-probe question, we will "de-bias" the reports from all the graders by subtracting out the bias, and also selectively over-weight the information from the low-variance graders. We consider the inverse of the square-root of your variance as your precision of grading, and use this precision to weight your assigned score on this paper. The accuracy of the mechanism assigned score is given by −(assigned score-true score) 2 . If you were not one of the graders, and the mechanism only assigned scores using the reports of the other graders, assigned score without you = k 1 + 1 √ .05 (3 + .25) + 1 √ .2 (2 + .5) k 2 + 1 √ .05 + 1

√ .2
The new accuracy is −(assigned score without you-true score) 2 . Now, your PEQA performance score from peer-grading question 4 would be calculated as the difference between the accuracy with you, and the accuracy without you. This is intuitively equivalent to you getting paid for your relative contribution in your group towards making the final assigned grade accurate. The more accurate the assigned score is, when you are included in the group of graders, the higher would be your performance score!
The PEQA performance score on each question you have graded that is worth x points, is assigned on the scale of [0, x 2 ]. So, in round 1, where each regradable question is worth one point, and you grade a total of 3 non-probe questions, the maximum PEQA performance score you could get is 3 × 0.5 = 1.5 and the minimum is 0.
This PEQA grade and performance scores have the following properties: Bias Invariance: Suppose you had reported grades of 3+x, 2.5+x, 3.5+x, 4+x, and 2+x, on all the questions instead, and thus had an individual deviations x points higher than before. This would have no effect on the PEQA performance scores, as it would be de-biased as described above. This is a mathematical property of the mechanism described.
With the new reported grades, your average deviation is changed to −.25 − x from −.25. The +x and −x cancel out in the expression of the assigned score, leaving it unchanged. Clearly the assigned score without you also cannot change if your bias changes, so your expected PEQA performance score cannot change here! Precision Monotonicity: For every set of (bias, variance) your co-graders might have, your expected PEQA performance score from the peer-grading task is monotonically increasing in your grading-precision (precision is the inverse square-root of your variance). This is a mathematical property that can be easily showed by using calculus and statistics. Thus the more precisely you evaluate a paper in the peer-grading task, (or alternatively the lower your grading variance) the higher your peer-grading score.
Here is a graph that shows how the PEQA performance score changes with the Precision for a grader, who is grading alongside with two graders, one of highest precision and one of lowest precision.
How do you calculate −(assigned score-true score) 2 ? If there is no regrading request, then we would assume that true score=assigned score, and this value is zero. If there are regrading requests, then we would evaluate the paper ourselves and assign the course-instructor assigned score as true score to calculate the value.
What is my consolidated score? Your consolidated score is the sum of (i) the score on your own assignment (consolidated score from the regradable and non-regradable parts), and (ii) your PEQA performance score (peer-grading score). For example, if the peer-assigned score (computed via PEQA) on your own assignment is x, and your peer-grading score is y, your consolidated score is x+y.
How are my payments decided? Every participant would get a show-up fee of M 50 for participating in and completing this session. You would also get an additional amount depending on your ranking in the pool of 'n' participants today, based on the consolidated score. The ranking would be done in decreasing order of the final grades (i.e., the consolidated score) assigned to you all on the whole assignment. A ranking of x means that there are (x-1) other people who have a strictly higher consolidated score than you. The additional amount would be equal to M 650 for the top 25% (first quartile) ranked students, M 450 for the next 25% (second quartile) ranked students, M 250 for the third quartile ranked students, and M 50 for the bottom quartile students. If the number of students that scored the same overlaps to two or more different quartiles, then all of them get the average payment of those quartiles. For example, suppose 13 students out of a population of 40 got 10/10, then all 13 get M (700 × 10 + 500 × 3)/13 = 654the next rank starts from 14. Hence, in this study, the higher is your consolidated score, higher is your total payment.
How do the grades you submit affect your own payment? The grades you submit obviously do not affect your own grade, because you are never grading your own paper, but they can still affect your own payment, in two ways. 1) By affecting the grade of others: Your grading could potentially affect the grades of others, only if the question is chosen as non-probe question, and consequently that can change the relative rank between you and the person(s) you are grading. For example, when you assign someone a higher/ lower grade on a question that is chosen as a non-probe question, that might change the PEQAassigned quiz score (and thus the consolidated score) they are assigned, and thus affect the relative rankings. But, note that Bias Invariance result described above already tells you that a different bias would not change the expected quiz scores of any peers. 2) By affecting your peer-grading score: Assigning a higher/ lower score on any question, could change your payments in two ways. If this happened on a question that was chosen as probe, we would be calculating your precision and bias to a different number, and a lower (respectively higher) precision would result in a lower (respectively higher) marginal impact of your peer-grading reports, and hence, a lower (respectively higher) peer-grading score (and hence lower consolidated score) for you. If this was a non-probe question instead, then you might be able to change the peer-graded score on that paper, depending on how much weight we assign to your evaluation.
Is my data confidential? Yes, your data is completely confidential. Before observing and analyzing the collected data, we would be removing every personal identifier from the data, so that none of the decisions can be traced back to the individual who made the decision.
You would be given a questionnaire of three questions that tests you on your knowledge of calculation of median. Failure in answering at least two correctly out of those three questions would disqualify you from participation in this study. In this case you would be asked to leave this session with a M 20 reward. Important: Please do not communicate with any other participants during this session. For the grading, open one file at a time, finish grading, submit the grade in the google form and then move on. Please keep seated even if you are done with grading before time. If you have any questions, please raise your hand and one of us will come by to answer your query. Please use your university domain email id throughout this session. Please come remembering your google id/password, since that may be needed for some form filling.