Who shall test the test readers?

Poorly trained part-timers determine test scores that loom so large in education, writes Todd Farley in a New York Times op-ed. The author of Making the Grades, Farley was hired to score fourth-grade, state-wide reading comprehension tests when he was a graduate student in 1994.

One of the tests I scored had students read a passage about bicycle safety. They were then instructed to draw a poster that illustrated a rule that was indicated in the text. We would award one point for a poster that included a correct rule and zero for a drawing that did not.

The first poster I saw was a drawing of a young cyclist, a helmet tightly attached to his head, flying his bike over a canal filled with flaming oil, his two arms waving wildly in the air. I stared at the response for minutes. Was this a picture of a helmet-wearing child who understood the basic rules of bike safety? Or was it meant to portray a youngster killing himself on two wheels?

Some fellow scorers wanted to give full marks for understanding bicycle safety; others wanted to give a zero.

I realized then — an epiphany confirmed over a decade and a half of experience in the testing industry — that the score any student would earn mostly depended on which temporary employee viewed his response.

This is why multiple-choice tests can be more reliable than subjectively graded tests that rely on drawing (or writing) skills to measure reading comprehension.

I have a review copy of Farley’s book, which I plan to read very soon — along with the four other review books waiting for me. Maybe today! Anyhow, I vowed not to mention other people’s books without promoting my own book, Our School: The Inspiring Story of Two Teachers, One Big Idea and the Charter School That Beat the Odds.

About Joanne


  1. There is something of a trade-off. The multiple choice, machine scored items do come up with much more reliable (that is they will produce the same scores repeatedly, given the same conditions) scores. It is sometimes more difficult to cram much validity into a such questions, however. That is, the kinds of questions that will fit into that format may not tell us as much as we would like about what a student actually understands or is able to apply–may not adequately mimic real life.

    Throw in the expense of developing teams of highly reliable graders for those constructed response type items and you can see why most tests rely so heavily on multiple choice–although I think almost all of the tests include some form or fashion of essay or problem solving that extends beyond the four choice item. I would also suggest that Todd Farley’s experience fifteen years ago may not exactly represent state of the art today.

    Linda Darling-Hammond has thrown her hat into developing reliable tasks and evaluation in the realm of project-oriented assessment. It is certainly an attractive notion–and one with the potential to provide yet another type of data. But I hold no illusions that this might be widely embraced in real practice. I think there will be an initial phase of shock and surprise as teachers realize just how much work it is (as many special educators using similar methods for alternate assessment have already found), as well as discovering that it uncovers few diamonds in the rough–you know the kids who really “know,” but “just don’t test well.”

  2. The NZ external exam system has open-ended questions like this. The points are:
    1. The markers are teachers or retired teachers (who volunteer and are paid, my Mum used to do this when she was teaching even though it meant a lot of extra work in the run-up to Christmas while she had three small kids, because she wanted the insight into the mistakes her students could be making)
    2. The school exams are anonymous, you are assigned a number to write on the exams rather than your name, and the exams from each region are sent to a different region of the country to be marked to reduce the odds of the marker recognising your handwriting.
    3. Markers are organised into groups, so if a question comes up that a marker is not sure how to answer they refer to the group, and then there’s a structure for raising questions further up the ladder.
    4. Exam papers are publicly released after the exam is taken, so any poorly written question is front-page news.
    5. The answer books are returned to the individual student, who can appeal their mark, for a fee.

    This still results in some variation in exam results, I much prefer the hard sciences and maths where the answers were more objective. Even so I remember that on my School C general science exam a question I had gotten right was most definitely marked wrong (I confirmed this with my physics teacher the next year). I didn’t appeal as it was only 1/2 a mark and wouldn’t have made any difference to my letter grade. But the grades people got in exams typically corresponded with the grades in class.

  3. Andrew Bell says:

    Any good large-scale standardized test should have safeguards built in. Questions that cannot be reliably scored would be discarded. If you don’t know the statistics necessary to give test, you probably shouldn’t be in the business.

    What is the point of such a high-stakes test? Why do we think a test is so darn important? (I think I asked this the other day). What happens to the student if they don’t do well? Do they not go to college? Do we hold them back a grade? Do they not graduate?

  4. Andrew–I think that the term “high stakes” is much overused. Many of the tests have no stakes at all for individual students. A few states use scores at a particular grade to either hold students back or to require summer remediation. The results of such consequences for students are underwhelming at best–harmful at worst. Some states attach stakes in various ways to a high school exam–requiring passage for graduation, or requiring that the score be included as a percentage of the grade for the class. The states using scores for graduation typically allow for multiple tries over several years–suggesting that the “bar” has been set fairly low–at about the 10th grade level.

    So–what are these high stakes we keep hearing about? Well, first and foremost a schools scores are made public–in particular to parents–who may opt to transfer their child to a better performing school in some cases, or receive after-school tutoring (paid for from the school’s Title I $). Both of these options have been little utilized. A school must demonstrate a low level of performance for several years running before these options are even available.

    Next set of stakes–schools who continue to do poorly must enter an improvement planning process. They must identify problems, choose solutions, implement them and track the results. Really horrific stuff this.

    Now a school that does not evidence minimal improvement from this process (minimal being defined as an annual reduction of 10% of the students not scoring proficient in math or reading–in the aggregate and in subgroups as applicable), must continue with improvement planning and step up the improvement effort.

    Following five to seven years of failure to make minimal levels of improvement, more drastic measures are called for–which may include (not MUST include) new staffing, school reorganization, closing and re-opening as a charter, etc. Remember that during that five to seven years of not making progress a kindergartner has moved to middle school.

    I have seen restaurants open and close in the space of a summer. TV shows can be yanked after a single episode. In many states the court system allows a scant six months for a children’s services agency to return a child to their family of origin or make a permanency plan for them elsewhere.

    If five to seven years of failure to show a minimal amount of improvement produces too high a stake (and remember that the kids who move through without sufficient learning to succeed at the next level are the recipients of considerable “stakes”), exactly how much time and how little improvement ought we expect?