We need more tests, but what kind?

American Schools Need More Testing, Not Less, writes Ezekiel J. Emanuel in The New Republic. Students learn more when they take frequent, short tests.

A young neuroscientist named Andrew Butler has gone further, showing that testing can actually facilitate creative problem solving. In Butler’s research, undergraduates were given six prose passages of about 1,000 words each filled with facts and concepts. (Fact: There are approximately 1,000 species of bats. Concept: how bats’ echolocation works.) He had the students just study some of the passages; others, he repeatedly tested them on. Not only did his subjects demonstrate a better grasp of the tested material, but they also fared better when asked to take the concepts about which they’d been quizzed and apply them in completely new contexts—for example, by using what they’d learned about bat and bird wings to answer questions about airplane wings. When students had been tested on the passages, rather than just reading them, they got about 50 percent more of the answers correct. They were better at drawing inferences, thanks to the testing effect.

Only tests written by teachers are useful, responds Diane Ravitch. “Today’s standardized tests are useless.”

What he really admires, and appropriately so, are the regular weekly tests that he took in high school chemistry. His chemistry teacher Mr. Koontz knew what he had taught. He tested the students on what they had learned. He knew by the end of the day or over the weekend which students were keeping up and which ones were falling behind. He could act on that knowledge immediately to make sure that students understood what he thought he had taught and to explain it again to those who did not. He also learned whether to adjust his style of teaching to communicate the concepts and facts of chemistry more clearly to students. Mr. Koontz used the tests appropriately: to help his students.

Standardized exams are being used as “a ranking and rating system, one that gives carrots to teachers if their students do well but beats them with a stick (or fires them and closes their school) if they don’t,” Ravitch writes.

Most researchers say that teacher quality cannot be reliably measured by student test scores, because there are so many other variables that influence the scores, but the federal Department of Education is betting billions of dollars on it.

The job of writing, grading and analyzing tests belongs to “Mr. Koontz, not to Arne Duncan or Pearson or McGraw-Hill,” concludes Ravitch.

What is a typical sixth grader?

According to Meredith Kolodner at Insideschools, many principals and teachers have been raising concerns over the rubrics and scoring procedures for this year’s standardized tests in New York State.

Sometimes the rubrics (for the written portions of the tests) are ambiguous. Sometimes they work against good judgment. Sometimes the writing prompt itself puts students and scorers alike in a quandary.

Here’s an example of the last of these:

In addition, a listening passage about a kid who loved music asked students to write about how the child in the passage is like and unlike a “typical 6th grader.” Teachers debated what would lead to a high score: does a typical 6th graders really like music? Does a typical 6th grader attend after-school? Take the bus? There was not consensus on what details would be considered “meaningful and relevant examples,” as dictated by the scoring guide.

Assuming that the description is accurate, I wonder what the test makers had in mind. What is the point of asking students to compare a character to a “typical” sixth grader? Is there such a thing? Are children supposed to know (or care) what a “typical” sixth grader is?

In order to receive a high score, a student must fulfill all the requirements of the task. Here an intellectually advanced student could easily get sidetracked with definitions of “typical” and fail to write the essay as required.

Rubrics have inherent limitations; you can’t standardize good judgment. When applied on a massive scale, they become more limiting still. But they are here to stay, at least for now. Given that state of things, it’s all the more important to create good test questions. This, apparently, is not one.

I scored tests this year but signed a confidentiality agreement. I am not allowed to discuss what I saw on the tests or in student writing. Thus I am limiting myself to commenting on what others have reported. In the past, New York State tests were released to the public after they had been administered and scored. This is good practice; we should all have the opportunity to see and comment on them. After all, they presumably reflect what students are expected to learn.

Automated essay grading leads to more writing

My standard advice for learning how to write can be boiled down to six words: Read a lot. Write a lot. If brevity is essential, three words are enough: Write a lot. I can even make do with one word: Write.

So I’m sympathetic to the argument that students will write better if they write more, with feedback on their efforts. But teachers don’t have the time to read and respond to every draft of every paper.

Automated essay scoring lets teachers assign more writing and focus their own time on “higher order feedback,” argues Tom Vander Ark on Getting Smart. In response to an attack on scoring engines in the New York Times, Vander Ark summarizes and links to the case for automation.

Measurement is a friend to creativity, he writes in another post.

The online scoring engines use the same rubrics to score essays as human graders.  Any ‘standardization’ of writing is not a function of the method of scoring but the nature of the prompt, i.e., if a state requires every 8th grader to write a five paragraph essay every year it may lead to formulaic teaching—that’s a teaching issue driven by a testing issue, not a scoring issue.

People are sick of standardized tests “because most states are using old psychometric technology to administer inexpensive tests with little real performance assessment.”

. . . we’ve been using these tests for more than they were designed for—to hold schools accountable, to manage student matriculation, to evaluate teachers, and to improve instruction.But remember the state of the sector in the early 90s before state tests were widely used. There was no data, chronic failure was accepted, and the achievement gap was largely unrecognized. Measurement is key to improvement.

Soon, “essay graders will soon be incorporated into word processors and will be used as commonly as spell-check,” Vander Ark predicts. Students will get more assessment to help them improve.

Update: Machines Shouldn’t Grade Student Writing — Yet, writes Dana Goldstein on Slate.

‘Star’ school shows signs of cheating

When test scores soared at a low-performing District of Columbia school, the principal and teachers collected bonuses. Crosby S. Noyes Education Campus was called one of D.C.’s “shining stars” and was named a National Blue Ribbon School. But cheating may explain Noyes’ apparent turnaround, reports USA Today.

In 2006, only 10% of Noyes’ students scored “proficient” or “advanced” in math on the standardized tests required by the federal No Child Left Behind law. Two years later, 58% achieved that level. The school showed similar gains in reading.

. . . Michelle Rhee, then chancellor of D.C. schools, took a special interest in Noyes. She touted the school, which now serves preschoolers through eighth-graders, as an example of how the sweeping changes she championed could transform even the lowest-performing Washington schools. Twice in three years, she rewarded Noyes’ staff for boosting scores: In 2008 and again in 2010, each teacher won an $8,000 bonus, and the principal won $10,000.

Noyes’ proficiency rates fell significantly in 2010.

“For the past three school years most of Noyes’ classrooms had extraordinarily high numbers of erasures on standardized tests,” reports USA Today. “The consistent pattern was that wrong answers were erased and changed to right ones.”

On the 2009 reading test, the average erasure rate for D.C. seventh graders was less than one. At Noyes, seventh graders averaged 12.7 wrong-to-right erasures. “The odds are better for winning the Powerball grand prize than having that many erasures by chance,” according to statisticians consulted by the newspaper.

Cheaters prosper

When test scores seem too good to believe they probably are, concludes a USA Today story on cheating on standardized tests.

Tip-offs: The same cohort of students earn very low scores in one grade, very high scores in the next grade and very low scores in the following grade. Or, investigators look for an unusually large number of erasures with nearly all answers changed to the correct one.

In an Arizona State survey, more than half of teachers admitted to some form of cheating. Among 19 ways to cheat, they listed erasing incorrect answers and filling in correct ones, telling students to redo answers,  giving students extra time and peeking at test questions in advance by “tubing” sealed exams.

Not so standardized tests

Todd Farley’s Making the Grades: My Misadventures in the Standardized Testing Industry gets a review in the Washington Post.  After rising from scorer to trainer to test writer, Farley concluded that standardized tests are “less a precise tool to assess students’ exact abilities than just a lucrative means to make indefinite and indistinct generalizations about them.”

Throughout his career, grade manipulation was the norm. He and other leaders would change scores or toss some out in order to achieve “reliability,” a measure of how frequently different readers scored a question the same way. Among scorers, he writes, “the questions were never about what a student response might have deserved; the questions were only about what score to give to ensure statistical agreement.”

In a Christian Science Monitor column, Standardized tests are not the answer, Farley writes:

On one scoring project I managed, for instance, the government agency in charge passed down an edict stating that all scorers had to go through a remedial retraining (group discussions with their peers about scoring rubrics and training papers) after any work stoppage of 30 minutes or more, including their scheduled half-hour lunch break. The government agency in charge said such retrainings would help ensure the student responses were scored within the proper context of “psychometric rigor.”

The company avoided the time-consuming retraining sessions by cutting the lunch break to 29 minutes.

Farley’s entertaining book highlights the difficulties of evaluating thousands of student answers to questions that invariably turn out to be more ambiguous than the test writers thought.

In one test, elementary students read a passage about taste and answered a few questions, including naming their favorite food and identifying it as salty, sweet, bitter or sour. Scorers argued about what’s a food: water? dirt? grass? And is the kid who thinks pizza is sweet or bitter or sour necessarily wrong? Who knows what toppings are on that pizza?

Farley wants to leave grading to the teachers, who know their own students. That’s fine only if we want to give up on accountability measures. If Mrs. Chips says all her kids are proficient readers, we don’t know if that’s true or if Mrs. C has very low standards. It takes an independent test of some sort to analyze whether children have learned what the state has decided they should learn.  It won’t be precise, but it doesn’t have to be unless if enough scores are aggregated. (If an open-ended test is used to decide whether an individual student passes to the next grade, then there has to be a second look to make sure it’s an accurate reflection of the student’s performance.)

My reaction to the book was to wonder if open-ended questions that need to be scored by fallible humans are worth the cost. Farley describes scorers who don’t understand English idioms or just plain aren’t very bright. If that’s the way it is, why bother? Multiple-choice items can be scored quickly and very cheaply.

Who shall test the test readers?

Poorly trained part-timers determine test scores that loom so large in education, writes Todd Farley in a New York Times op-ed. The author of Making the Grades, Farley was hired to score fourth-grade, state-wide reading comprehension tests when he was a graduate student in 1994.

One of the tests I scored had students read a passage about bicycle safety. They were then instructed to draw a poster that illustrated a rule that was indicated in the text. We would award one point for a poster that included a correct rule and zero for a drawing that did not.

The first poster I saw was a drawing of a young cyclist, a helmet tightly attached to his head, flying his bike over a canal filled with flaming oil, his two arms waving wildly in the air. I stared at the response for minutes. Was this a picture of a helmet-wearing child who understood the basic rules of bike safety? Or was it meant to portray a youngster killing himself on two wheels?

Some fellow scorers wanted to give full marks for understanding bicycle safety; others wanted to give a zero.

I realized then — an epiphany confirmed over a decade and a half of experience in the testing industry — that the score any student would earn mostly depended on which temporary employee viewed his response.

This is why multiple-choice tests can be more reliable than subjectively graded tests that rely on drawing (or writing) skills to measure reading comprehension.

I have a review copy of Farley’s book, which I plan to read very soon — along with the four other review books waiting for me. Maybe today! Anyhow, I vowed not to mention other people’s books without promoting my own book, Our School: The Inspiring Story of Two Teachers, One Big Idea and the Charter School That Beat the Odds.

Take, take, take your test . . .

Kids are learning a test-prep version of Row, Row, Row Your Boat, writes Patti Hartigan, who got the lyrics from Ed Miller of the Alliance For Childhood.

Take, take, take your test
Follow all the rules
Go to bed and get some rest
Eat some good brain food.

Keep, keep, keep your desk
A neat and tidy spot
Wear smart clothes so you don’t feel
Too cold or too hot.

Bub, bub, bubble in
Answers carefully
Do the easy problems first
And hard ones finally.

It goes on.

Allegedly, a kindergartener told her after-school teacher the class was supposed to memorize the song. I have my doubts since No Child Left Behind requires testing third graders; some states test second graders but none that I know of require standardized tests  in first grade much less in kindergarten.  This could be a hoax, but it’s probably a real song written for test-age elementary students.