The exaggerated power of test scores

Test scores should be information at a teacher’s disposal, not information used to dispose of teachers.

The New York State Board of Regents passed the new regulations, which allow scores on state standardized tests to account for up to 40 percent of a teacher’s evaluation. Several researchers expressed concern about this, before the vote, in a letter. At least two board members spoke up against this change, and three voted against it. Kathleen Cashin said that this would lead to even more reliance on test prep. Roger Tilles pointed out that the districts that can’t afford to develop local assessments will be forced to use state assessments for the full 40 percent of the evaluation.

We need tests, including standardized tests. As a teacher, I want to know promptly how my students did on a given test. (Often the results don’t come back until the following year.) I would like to look at the questions and my students’ answers, instead of relying on diagnostic reports that tell me that such-and-such a student needs to work on “finding the main idea.”

The tests are one way of verifying that students have learned what they are supposed to learn. But they cannot be the only way, or even close. In English language arts, the tests can be especially misleading, as they are generally rather weak in “content.” That is, they do not presume that students have read anything in particular. They test generic skills–sometimes accurately, sometimes not.

They are even less reliable as indicators of teachers’ performance. For reasons that have been brought up again and again, reasons given by scholars, teachers, policymakers, and others, test scores should not decide a teacher’s fate or override human judgment. There are simply too many unstable factors–the tests themselves, the students’ lives, conditions on the day of the test–that make the scores inaccurate indicators of what a teacher is accomplishing.

In an op-ed in the New York Daily News, Arthur Goldstein points out that students’ efforts are not uniform: “For example, how much television does a student watch? … If my students don’t know how to read, haven’t been in school for the past six years or refuse to put a mark on a piece of paper, is it my fault? If a kid was dragged to the U.S. against his will and simply won’t learn English, should I be penalized?” (Having taught ESL, I have seen these situations.)

Value-added formulas are just as problematic as test scores, if not more so. They control for all sorts of factors, but the various controls create their own problems and distortions. Value-added ratings can provide useful information about schools, over time. But in teacher evaluations and tenure decisions, they should be regarded carefully and critically. And there should be room to “unpack” them–to figure out what a teacher’s rating might have been under this or that different condition.

All of this has been said, many times. Of course, human judgment is also fallible. “Multiple measures” can also be misleading. Don’t get me started on portfolios–it is often the teacher, not the student, who puts time and effort into these portfolios, and they may not reflect what a student can do independently.

What, then, should constitute teacher evaluations? Well, as in government, I prefer a system of thoughtful checks and balances. Consider test scores, but don’t give them too much power. Consider the principal’s judgment, but don’t let that override all else. Consider student work, but look carefully at it–don’t just check off items on a checklist. Consider a teacher’s lesson plans, assignments, and contributions to the school. Yes, this comes down to “multiple measures,” but the point isn’t just that they are multiple. The point is that each one is regarded carefully.

When one measure (not to mention a flawed one) is given too much power, it is bad for schools through and through. It tells teachers (and, indirectly, students) that excercising one’s judgment isn’t that valuable after all.

Comments

  1. MagisterGreen says:

    On a day when I read about 18 classrooms in DC having their standardized test scores tossed on suspicion of cheating, (http://washingtonexaminer.com/local/dc/2011/05/dc-schools-investigate-security-breaches-2011-tests), by all means let’s make tests account for nearly half of the scale we use to determine a “good” teacher. Cause that won’t encourage more cheating. Not at all.

  2. Before we use test scores for any accountability–student and/or teacher and/or schools, let’s be sure the tests are valid–that is, that they actually measure what they say they measure. Check out my friend, Marcia Kastner’s new book, TESTING THE TEST on this… and her Commentary in Education Week (May 11). http://www.marciakastner.com

    http://marciakastner.com/ed-week-commentary_311.html

  3. . Consider test scores, but don’t give them too much power.

    Sounds about like the New York plan to me — state tests are given a weight of 20%, and local “measures of student achievement” (this can be a “goal setting process”) another 20%.

    If giving state tests a 20% weight is too much, how about 10%? Would you be happy if teachers were assessed 90% on everything else, and 10% by how well their students are improving on tests?

  4. Stuart,

    Now, with the new regulations, New York districts can opt to use state tests instead of local evaluations. In such cases, the state tests will account for 40 percent of a teacher’s evaluation.

  5. OK. So what’s the right number? You don’t seem to think it’s zero. But the mere possibility of somewhere between 20% and 40% is too high.

  6. Well, 40 percent is too high. I’d say 20-25 percent would be fair. But the test scores should also be teased apart and analyzed. One must take the trouble to make sense of them.

    I taught in a high-poverty school that actually did quite well on the tests (the curriculum had a lot to do with it, I think). But it had much more difficulty showing student “growth” among the advanced students. Part of this may have been due to the limitations of the tests themselves. Part of it may have been due to the limitations of Balanced Literacy (which was combined with the curriculum). It would be a mistake to attribute it to the teachers alone or mainly to the teachers.

    Someone should set up an online dummy database of students’ scores (for an imaginary district) and allow users to test various value-added formulas on it and tweak various conditions. It would take some work to set that up, but it would be enlightening.

  7. Stuart, I think there is a credible argument to be made that even 15 or 20% is too much when you consider that A. the value-added ratings will have as much as a 35%+ margin of error; B. the average principal/administrator likely has no idea what margin of error is or how to apply it, and I haven’t seen any concrete plans put forth to help them understand it; and C. the subjective formulas that would allegedly allow seamless apples-to-apples comparisons between the teacher in Pound Ridge with 100% white/Asian students living in $200,000+ income households and the teacher in Morris Heights with classes of 100% free lunch eligible and 50% ELL + special ed students haven’t been published (as far as I know).

    If I had more confidence in what it meant to be proficient on New York’s math and English exams, and if the statheads could come up with scores that are more reliable in the absence of years and years’ worth of data, I’d be the first to clamor for value-added to count 50% or more toward a teacher’s evaluation. Until that happens, though, the data is bound to be abused and will make it even more attractive for prospective teachers to avoid schools with challenging student populations.

  8. Stuart Buck says:

    allegedly allow seamless apples-to-apples comparisons between the teacher in Pound Ridge with 100% white/Asian students living in $200,000+ income households and the teacher in Morris Heights with classes of 100% free lunch eligible and 50% ELL + special ed students haven’t been published (as far as I know).

    I don’t think any such thing is contemplated by value-added models. If a value-added equation controls for poverty, ELL status, etc., and especially prior test scores, then any teacher is effectively being compared only to teachers who have the same sort of mix of kids. In fact, in some value-added models, each school is examined separately, and the teachers therein are being compared only to each other, not to teachers from any other school.

  9. Roger Sweeny says:

    This argument always makes me uncomfortable. We teachers give tests all the time. Well over 40% of a student’s chances of graduating depend on tests. And yet we find it wrong when tests make up 40% (or less!) of our future.

  10. Roger Sweeny says:

    Tim,

    I’m sure rich Pound Ridge and poor Morris Heights classes would have different value added scores. But if I’m a principal trying to decide whether to retain a teacher, I’ll be comparing her value-addeds to the value-addeds of the other teachers in my building.

    If I’m in Pound Ridge, I’ll be comparing her to other teachers in Pound Ridge. If I’m in Morris Heights, I’ll be comparing her to other teachers in Morris Heights.

  11. Roger,

    In grades 3-8, the test scores have minimal consequences for students. In New York City, if they score a 2 (the scores are converted to 1, 2, 3, or 4), they may be promoted to the next grade, unless this has changed over the past year. If they don’t make the cut, they can still go to summer school, submit a portfolio, and pass.

    Besides that, it’s the student who takes the test, not the teacher. The consequences should be more serious for the student (but they are not).

    Even with the supposedly harder tests, I doubt it’s very hard to get a 2. And for many kids, as long as they go on to the next grade, they aren’t too concerned.

    Also, for students it really only matters how they score, not how much they “grow.” Teachers’ ratings, by contrast, are based on student “growth.” This is problematic in many ways, as growth may not be linear.

  12. I had several sophomores earn perfect scores on their state test this year. How do you show value added from that?

  13. tim-10-ber says:

    Won’t encourage low cut scores — the only grade that enables a kid to pass when they should have flunked…government education…ugh!!! Getting back tests scores without the raw data and comparing the score grades to your grades has to be demoralizing when they are multiple letter grades apart. Enough…please!

  14. Oh, I know, tim-10-ber! Imagine how pissed my “advanced” scorers must be that they have low B’s! Totally friggin sucks. Down with The Man!

  15. Roger Sweeny says:

    Diana,

    I wasn’t clear. Teachers give their students tests all the time, usually tests that they or a colleague have made up. These tests have a lot to do with whether a student passes the course and goes on to the next. Most people want teachers to do something like this They don’t want students passed on just because they are a year older. Teachers willingly use tests to determine the future of others. Something inside me winces when we argue that tests shouldn’t be used to determine our future (I realize that one can, as the lawyers say, “distinguish” the two situations but they have more in common that we want to admit).

    I worry that we teachers make contradictory arguments. On the one hand, we say that teachers are the most important of the education system; we deserve to be paid well, etc. On the other hand, we say that we can’t be held responsible when our students don’t learn; there are so many things that go into learning; it’s more what students bring to the classroom than what is done in the classroom; etc.

    Some people have objected to the use of value-added measures on the grounds that they are not stable. A teacher with high value-added one year has low value-added the next. He had better students the first year and worse the second. If this turns out to be common, if few teachers have consistently high or low value-addeds, it would indicate that most teachers are about as good (or bad) as one another. That would be interesting to know.

  16. Roger,

    I agree with you that teachers give students tests frequently and that these should have a lot to do with whether a student passes the course. I doubt that many would argue with you there. And I agree that teachers should be responsible for what students learn. But this has to be evaluated intelligently.

    No, if few teachers have consistently high or low value-addeds, it doesn’t necessarily mean that they are about as good or bad as one another. Teachers for a certain level or demographic group may be clustered close together in terms of their students’ scores, so that a minor fluctuation can send them from one percentile range to another. Teachers for another level might be farther apart. How close they’re clustered together may have little or nothing to do with how close the quality of their teaching is.

    A recent study by the Measures of Effective Teaching (MET) project, funded by Gates, found that the difference between top and bottom quartile teachers in math was much larger than the difference between top and bottom quartile teachers in English. The authors of the study suggested that “outside the early elementary grades when students are learning to read, teachers may have limited impacts on general reading comprehension.”

    Of course, English teachers teach much more than “general reading comprehension.” But “general reading comprehension” is what the state tests measure.

  17. Stuart and Roger, thanks for setting me straight on how comparisons using value-added would be made. However, aren’t there subjective decisions made in the methodologies created to project where kids of various SES/ethnic backgrounds should be at a given point past their last test? I was led to believe that value-added formulas were very tweakable, proprietary sorts of things.

    And Roger, do the results of the tests that you give your kids have a 12-35% margin of error?

  18. Yes, it’s true. Tests can be badly written, and value added will always be somewhat unreliable, principals can have grudges, etc. It’s impossible to produce a perfect measure. It’s difficult to even produce a very good measure. All these complaints have merit.

    And the end effect is that even horrible teachers never get fired.

    Don’t let the perfect be the enemy of the good.

  19. > Test scores should be information at a teacher’s disposal, not information used to dispose of teachers.

    Kind of leaves the question hanging in the air, doesn’t it?

    What factors should be used to dispose of teachers? I assume teachers aren’t sacrosanct and teaching isn’t a sinecure so there must some measure/measures by which a teacher can be determined to be unfit to teach.

    Since there’s an infinite of factors which oughtn’t to be used to determine whether a teacher should to be disposed of perhaps it would be more worthwhile to list the reasons which could lead to a teacher’s disposal.

  20. If I were a principal, there would be several things I’d notice that might lead to a teacher’s dismissal:

    1. lack of knowledge of the subject;
    2. inability to handle the class;
    3. lack of actual instruction in a lesson (for instance, telling students at the start of the lesson that they are to complete pages 61-63 in the textbook, and doing this over and over);
    4. other glaring problems.

    And I would make sure that these were chronic and not circumstantial (for instance, a result of a teacher being reassigned to a subject outside of his or her license area).

    Of course test scores would come into play as well. But there, too, I’d ask: what is going on here?

    And I would also want to ensure that the school had the appropriate discipline supports, so that a teacher who ended up with a particularly rough class one year would not be left in the lurches.

    Some cases are clear cut, and some are not.

  21. Stuart Buck says:

    The nice thing about standardized tests, though, is that they can provide a useful check on what’s really occurring.

    Suppose there’s a teacher that in the principal’s eyes is doing all the wrong things. But when the standardized test results come out, it turns out that while her kids started third grade not knowing how to do two-digit multiplication, or how to make change, or subtract three-digit numbers, now they’re all whizzes at those things. So maybe the teacher is doing something right after all.

    Conversely, suppose there’s a teacher who looks great on the one day a year that the principal shows up to evaluate her. But it turns out that at the end of a year, her kids haven’t made any progress — they started out not knowing two-digit multiplication, and they still don’t know it. Isn’t that good to know? Why should the kids’ failure to learn anything be the smallest portion of the teacher’s evaluation?

  22. Supersub says:

    Hey, when NY can put out tests without typos, poorly designed questions, or complicated grammar that makes english teachers wince, maybe then I will consider accepting value added scores. My school just did field testing for future test questions, and some looked like NY c ontracted them out out to the same guys that write the Nigerian scam emails.

    Oh, and my other requirement is that my district allow me to teach the way I know is best. If I can’t design my own curriculum, then there is no way I will allow myself to be evaluated on the success of students taught with a curriculum that is substandard.

  23. Supersub says:

    Stuart

    I don’t think that anyone is saying that student performance should be ignored, just that creating a rigid statewide system is going to hurt the system more than help.

  24. Stuart,

    All of this is true. The test results can offer a counterbalance to the principal’s judgment and can draw attention to serious problems (or accomplishments).

    I prefer curricular tests to non-curricular tests. That, presumably, is where local tests come in. Of course it’s important to have some general test for everyone, but it should not be given inordinate weight.

    What you’re talking about–using tests to determine whether students have learned something specific, like two-digit multiplication–makes sense.

    What doesn’t make sense is comparing teacher A, whose students scored an average of 3.52, to teacher B, whose students scored an average of 3.62. Calling one more “effective” than the other, without further investigation, is very dangerous.

  25. So all power to dismiss teachers ought to reside with the principal?

    Hmmm. What *is* the problem with that idea? And what group of folks is likely to find fault with the idea?

    Oh yeah, teachers will start howling about principals playing favourites. Then, that proposal assumes competence, fairness and professionalism on the part of the principal the crucial word being “assumes”.

    All things considered, if basing teacher evaluations on something, rather then the current situation which evaluates teachers based on how much they annoy their superiors, then tests have a great deal to recommend them.

    And not that it’s going to do much to assuage the fears of teachers, the same tests that can be used to evaluate teachers can also be used to evaluate principals.

    Does that make the idea a bit more palatable?

  26. I tend to agree with Roger about the apparent disparity between the way teachers grade, and the way they wish to be graded. I think the core of the issue is coming up with a “fair” set of tests.

    The interesting thing there, I think, is that most students would agree that many of the tests given to them by their teachers are not, in fact, fair measures of what they’ve learned. I know I had a lot of teachers who gave terrible exams, while I had others that gave amazing ones. Seems like teachers are caught in the same dynamic.

  27. Allen,

    One question here is: who should make the ultimate decision (about teacher ratings, tenure decisions, etc.)?

    Should it be the principal? Perhaps not, especially in borderline or questionable cases. Should it be more than one person? Perhaps, if this does not result in an excruciating, drawn-out process.

    I’d think it might be something like grad school–have a committee evaluate the candidates for tenure (or dismissal, or whatever it might be) on the basis of observations, local tests, state tests, student work, and lesson plans and other materials. The principal would be on that committee but could be overturned.

    Of course, there’s the danger that the committee would be “stacked.” But in any case, the final decision should involve human judgment, with safeguards against irrational or emotionally charged human judgment.

  28. Roger Sweeny says:

    Roger, do the results of the tests that you give your kids have a 12-35% margin of error?

    I’m not sure what exactly you mean by “margin of error” but I know that my tests aren’t perfect measures of what my students know. Some times they seem to be off by a lot, e.g. a kid who generally does well gets a lousy score on one test. But I use them, and every teacher I know uses tests. They aren’t the only thing that goes into a grade but they are a major part.

    A student who had mediocre test grades but was wonderful in everything else would get a higher mark than his test grades would indicate, and would certainly pass. On the other hand, really bad test grades would doom him. I gather that’s the way most proposed value-added systems would work. Mediocre value-addeds, coupled with great observations and other things, would not get a teacher fired. But consistently terrible value-addeds would. That does not seem unfair to me.

  29. Stuart Buck says:

    By the way, since you know Ravitch, tell her to release the video of the Gist exchange. Gist has already agreed to do so. See http://jaypgreene.com/2011/05/18/diane-release-the-tapes-day-1/

  30. Diane, I’ll answer your question with a question – who gives a damn if the kids get educated?

    The teachers? Some. Some, a great deal. Some, not at all. Most, somewhere along a bell curve in between.

    The principals? Some. Some, a great deal….you know where this is going.

    The point being that before we decide who ought to decide whether teachers continue to draw a pay check it might be a worthwhile exercise to determine who cares whether the kids get an education.

  31. SuperSub says:

    Heck, I’d rather give 100% of the power over firing to the principal than to some sort of value added system. The principal, even one that is a mad, power hungry dictator, can be held accountable by parents, the school board, or the teachers’ union.
    I have a lot less trust in the various test-makers and bureaucrats that stay locked in their offices…and who are hardly ever held accountable for their actions.

  32. Leroy Hartley says:

    For a teacher to be soley judged as being a good teacher or not based on test scores is very unfair to teachers. I beleive all students can learn, but what about those that don’t want to learn. No matter how many “strategies” are in place, there will always be students who do not want to learn. Is that the teachers fault? Should they be held accountable for that. Where does parental involvement come into play with all of this? Are they not responsible for their child as well?