Todd Farley’s Making the Grades: My Misadventures in the Standardized Testing Industry gets a review in the Washington Post. After rising from scorer to trainer to test writer, Farley concluded that standardized tests are “less a precise tool to assess students’ exact abilities than just a lucrative means to make indefinite and indistinct generalizations about them.”
Throughout his career, grade manipulation was the norm. He and other leaders would change scores or toss some out in order to achieve “reliability,” a measure of how frequently different readers scored a question the same way. Among scorers, he writes, “the questions were never about what a student response might have deserved; the questions were only about what score to give to ensure statistical agreement.”
In a Christian Science Monitor column, Standardized tests are not the answer, Farley writes:
On one scoring project I managed, for instance, the government agency in charge passed down an edict stating that all scorers had to go through a remedial retraining (group discussions with their peers about scoring rubrics and training papers) after any work stoppage of 30 minutes or more, including their scheduled half-hour lunch break. The government agency in charge said such retrainings would help ensure the student responses were scored within the proper context of “psychometric rigor.”
The company avoided the time-consuming retraining sessions by cutting the lunch break to 29 minutes.
Farley’s entertaining book highlights the difficulties of evaluating thousands of student answers to questions that invariably turn out to be more ambiguous than the test writers thought.
In one test, elementary students read a passage about taste and answered a few questions, including naming their favorite food and identifying it as salty, sweet, bitter or sour. Scorers argued about what’s a food: water? dirt? grass? And is the kid who thinks pizza is sweet or bitter or sour necessarily wrong? Who knows what toppings are on that pizza?
Farley wants to leave grading to the teachers, who know their own students. That’s fine only if we want to give up on accountability measures. If Mrs. Chips says all her kids are proficient readers, we don’t know if that’s true or if Mrs. C has very low standards. It takes an independent test of some sort to analyze whether children have learned what the state has decided they should learn. It won’t be precise, but it doesn’t have to be unless if enough scores are aggregated. (If an open-ended test is used to decide whether an individual student passes to the next grade, then there has to be a second look to make sure it’s an accurate reflection of the student’s performance.)
My reaction to the book was to wonder if open-ended questions that need to be scored by fallible humans are worth the cost. Farley describes scorers who don’t understand English idioms or just plain aren’t very bright. If that’s the way it is, why bother? Multiple-choice items can be scored quickly and very cheaply.