Not so standardized tests

Todd Farley’s Making the Grades: My Misadventures in the Standardized Testing Industry gets a review in the Washington Post.  After rising from scorer to trainer to test writer, Farley concluded that standardized tests are “less a precise tool to assess students’ exact abilities than just a lucrative means to make indefinite and indistinct generalizations about them.”

Throughout his career, grade manipulation was the norm. He and other leaders would change scores or toss some out in order to achieve “reliability,” a measure of how frequently different readers scored a question the same way. Among scorers, he writes, “the questions were never about what a student response might have deserved; the questions were only about what score to give to ensure statistical agreement.”

In a Christian Science Monitor column, Standardized tests are not the answer, Farley writes:

On one scoring project I managed, for instance, the government agency in charge passed down an edict stating that all scorers had to go through a remedial retraining (group discussions with their peers about scoring rubrics and training papers) after any work stoppage of 30 minutes or more, including their scheduled half-hour lunch break. The government agency in charge said such retrainings would help ensure the student responses were scored within the proper context of “psychometric rigor.”

The company avoided the time-consuming retraining sessions by cutting the lunch break to 29 minutes.

Farley’s entertaining book highlights the difficulties of evaluating thousands of student answers to questions that invariably turn out to be more ambiguous than the test writers thought.

In one test, elementary students read a passage about taste and answered a few questions, including naming their favorite food and identifying it as salty, sweet, bitter or sour. Scorers argued about what’s a food: water? dirt? grass? And is the kid who thinks pizza is sweet or bitter or sour necessarily wrong? Who knows what toppings are on that pizza?

Farley wants to leave grading to the teachers, who know their own students. That’s fine only if we want to give up on accountability measures. If Mrs. Chips says all her kids are proficient readers, we don’t know if that’s true or if Mrs. C has very low standards. It takes an independent test of some sort to analyze whether children have learned what the state has decided they should learn.  It won’t be precise, but it doesn’t have to be unless if enough scores are aggregated. (If an open-ended test is used to decide whether an individual student passes to the next grade, then there has to be a second look to make sure it’s an accurate reflection of the student’s performance.)

My reaction to the book was to wonder if open-ended questions that need to be scored by fallible humans are worth the cost. Farley describes scorers who don’t understand English idioms or just plain aren’t very bright. If that’s the way it is, why bother? Multiple-choice items can be scored quickly and very cheaply.

About Joanne


  1. Psychometric Rigor would be a good band name.

    In my district, junior high math exams used to be scored district-wide, with all teachers crammed into a warehouse together. Seeing the difficulty so many had scoring the exams makes one realize one partial reason why scores were so low. The tests were the least of the issue.

  2. In NZ, when I went through, the external exam questions were released after the exam was sat, and the answer papers were returned to the relevant students after marking, and you could ask for a re-mark at a price.
    This created a strong, though not perfect, incentive to get the exam questions right and to mark them honestly. The year I sat School C English some of the questions were terribly written and this was front-page news the next day, very embarassing for the exam board. (The board eventually agreed to mark any answer to two of the questions right).

  3. As a parent of three school age children, I’d prefer to have accountability in my hands. I’ll contract with whatever testing company delivers what I think best for what I can afford.

  4. The joke is that short-answer responses were put on these tests in response to complaints about multiple choice questions–which are, hands down, the most reliable test questions, particularly in the verbal skills area.

  5. Cardinal Fang says:

    What about testing writing? Don’t we need students to actually write in order to see if they can write?

    On the other hand, the SAT writing test, for example, doesn’t seem to be a very good way of testing writing. But Joanne is right– grading might not be precise, but probably there are few cases where the graders can’t decide whether the student is a totally incapable writer or a skilled writer.

  6. Cal is right. Open (constructed) response questions were put in summative (accountability) assessment to respond to teachers’ complaints about the ‘mindlessness’ of multiple choice items. In reality the complaints were mindless, not the multiple choice items. Then other reasons to peddle open-ended items were promoted: testing should be like learning (why?); open-ended items will promote better instruction (yeah, right); open-ended items require more thoughtful responses (sometimes yes, more often than not they don’t). And few more. What really drives teachers is the belief that their students are somehow “better” than tests indicate, and they are forever after that illusive assessment that will finally “prove” that the teachers are OK after all, and it is just the test that was wrongly pointing out that the kids didn’t learn.

    Writing is somewhat different and rather unique in academic learning and indeed one needs actually to write to assess it in full. Somewhat similar to performing arts or sports. But this is not true with almost any other academic skill.

    BTW — open ended questions are very helpful in classroom (formative) assessment. Just not in a summative one.

  7. In NZ there were open-ended writing questions. The marking was done by teachers and retired teachers (it was voluntary and teachers got paid extra for it, my mother used to do it though she said her main motivation was to find out where her students were ilkely going wrong). The really bad questions in the English exam I sat were multi-choice.

    To reduce the risk of the marking being biased, you only put a number on your exam answer booklets, not your name, and the papers were sent to different parts of the country to be marked.

  8. palisadesk says:

    I’ve graded our tests on several occasions (they hire teachers to do it, or they used to — I am not sure if that is still the case). It was an interesting experience, to say the least. The most striking thing was that despite extensive training with anchor paperts and exemplars, graders still varied widely on the final “level” assigned each test (I graded middle school mathematics and third grade reading) Often the levels given ranged from 1 (the lowest) to 4 (the highest) on the same paper.

    Zeev’s observations seem not to be true here, since our staff has consistently observed that the scores obtained are usually higher than the actual level the students perform at. Having been involved in the grading, I can see how the rubrics used can be constructed so that a weak performance will be graded higher than it deserves. Inter-rater reliability is only about .6, apparently, on this kind of open-response test.

    These tests are very expensive to produce (every year all the items are new) and to grade, since it is done manually. I am not convinced the results justify the expense, but I can see how they are easier to manipulate for political purposes.

  9. Don Bemont says:

    palisadesk’s experiences are nearly identical to my own.

    I have graded New York State English Regents exams for many years:

    Scores obtained are usually higher than the actual level students perform at, particularly in the middle three quintiles.

    Inter-rater reliability is shaky, although ours is not quite as poor as palisadesk reports.

    It would be an understatement to say that the test was custom made for political manipulation. To be honest, the results are corrupted so smoothly in so many ways that I have lost any faith I ever had in evaluations of schools and programs, based on standardized test results.

  10. I do not think that standardized testing is fair. I believe that it is very biased! I don’t understand how they can effectively measure students’ improvements with these tests. I think that having it be timed is the worse part. I consider myself an average person. I believe that I am intelligent and knowledgeable. I am not a great test taker, especially when tests are timed. It takes about 50% of my focus away. When I would take these tests, I would completely panic and worry about not getting through all the questions. This should never happen to a student. The tests should not be timed and not be biased according to race when wording questions. There are so many types of learning and these tests only cater to one.

  11. Stacy, non-standardised tests are even worse. If you give two kids non-standardised reading tests and the kids have different results, how can you tell if that’s because the kids have different reading abilities, or if because the tests were different?

    And timing, well I think non-timed tests are cruel, conscientious students can drive themselves nuts checking and re-checking questions, reluctant to ever finish.

    I am not sure what you mean about so many types of learning and these tests only cater to one. What the tests measure is what was learnt, and what wasn’t, surely? Not how it was learnt.

    As for panicking over worrying about not getting through all the questions, why is this something that should never happen to a student? Aren’t we stronger if we confront our fears and learn how to deal with panic? One of my friends at high school was so terrified at having to give a speech during English class that she was crying during it, but she didn’t run away and hide, and now she’s getting the odd acting job. I had a summer job once at a secure unit for brain-injured people and got attacked a few times, and through this I learnt what I am like when the adrenaline hits, and when it goes, which was very useful when copying with my brother having a bad accident later in life when I had to hand over my work commitments while internally panicking about his life.

    I am not sure what it means for test questions to be worded so as not to be biased about race. Test questions should not be written in a way that racially, or sexually, or etc insults any one. Test questions are often going to be culturally-biased, eg most reading tests given in English-speaking countries measure the ability to read English and use the Roman alphabet but I don’t think anyone could pass a non-culturally-biased reading test. But the evidence seems pretty clear that a normal baby of any race can learn any language so I don’t see how reading tests are biased based on race.


  1. […] This post was mentioned on Twitter by kriley19 and Fred Roemer, USWorldClassMath. USWorldClassMath said: RT @kriley19: Joanne Jacobs: Not so standardized tests Full […]

  2. Social comments and analytics for this post…

    This post was mentioned on Twitter by kriley19: Joanne Jacobs: Not so standardized tests Full