The tests that can be computer scored

Over at the Curriculum Matters blog, Erik Robelen has a link-filled post about machine-scoring of essay tests entitled “Man vs. Computer: Who wins the essay-scoring challenge?

It seems there was a study.

“The results demonstrated that overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre,” says the study, co-authored by Mark Shermis, the dean of the University of Akron’s college of education, and Ben Hammer of Kaggle, a private firm that provides a platform for predictive modeling and analytics competitions.

There’s something odd going on here.

Barbara Chow, the education program director at Hewlett, said in the press release that she believes the results will encourage states to include a greater dose of writing in their state assessments.

And she believes this is good for education.

“The more we can use essays to assess what students have learned,” she said, “the greater likelihood they’ll master important academic content, critical thinking, and effective communication.”

Even if we grant that assessments help in mastering what they measure — something that I don’t think is clear in the absence of grade-like motivation — I can imagine that a school could spend the entire day doing nothing but assessing what’s learned through essays, and never actually get around to teaching anyone anything at all.  But let’s put aside the fact that the last quoted sentence is a blatant falsehood and focus on something else entirely.

The prevailing thought seems to be along the following lines: The test is a good test.  The fact that its essays get the same results from human readers and computer evaluation makes it better, because a machine-scorable essay is cheaper, and easier to deploy as an assessment.

But here’s another view: the fact that a machine can score your essays just as well as your human readers suggests that your human readers aren’t really doing a good job of reading the essays in the first place.  It suggests that having the essays you have on your test, and grading them in the way you do, is an utter waste of time, money, and effort.  The fact that you’re able to waste this time more cheaply by using a computer doesn’t transform it into a worthwhile activity.

If I were a testing agency, and someone established in a study that a computer program could grade my essays as well as my human graders, I’d be embarrassed, because it would now be public knowledge, proved by social science, that my essay tests weren’t really being read for substance and content all along, but instead were being assessed through some sort of cheap, easy algorithmic rubric — either by design or (less likely) through the laziness of my graders.  Of course, I don’t think anyone in the test industry is thinking of denying that students’ essays are assessed through a cheap, easy algorithmic rubric.  They’re issuing press releases.

The fact that this is a selling point for the test-makers is all the proof I need to know that there are large chunks of the education establishment in this country that have no real interest in actually educating people to “think critically” or “communicate effectively”.

In fact, you might even think that learning these skills might require practice communicating with someone that thinks.   But what do I know?


  1. Michael, this is another thoughtful post and, by the way, I think you know a lot. However, as thought-provoking as your post is, it completely misses the most important mark. You contend, if I read right, that if a computer can score an essay as efficiently as a human, then the human must not be an efficient essay grader. (Of course, I realize that you question the entire process and rightly so.)

    The point I believe you fail to make — and by far the most critical point — is that the idea of scoring or grading an essay is, in and of itself, archaic and utterly ridiculous.

    A computer can be programmed to do anything, including place a number on an essay, based on a particular set of acceptable statements. This score is no more valuable than a teacher placing a score of 80/100 on a student’s essay. In fact, both are an insult to the student.

    Something a computer can’t do (sadly, many teachers aren’t very good at it either) is evaluate a student’s thinking, based on a written work. It’s important to consider how a student’s thought processes brought certain statements and questions to the paper. This evaluation should be followed by verbal or written feedback, involving questions for the student. A time for reflection should be provided, followed potentially by revision and resubmission of the work.

    This is teaching and learning. It’s not the teacher’s place to put an arbitrary number on a written work. It’s certainly not appropriate for a machine to do so.

    • Stacy in NJ says:

      Yes, but the point of a test essay isn’t to learn; it’s to evaluate prior learning. Nothing can replace the informed, thoughtful feedback of a skilled teacher in the process of learning to write reasonably well. The purpose of an essay test isn’t to further that process, but to evaluate the outcome of that process.

      • With all due respect, everything is about learning. It’s ongoing.

        • Stacy in NJ says:

          Oh goodness, do you realy believe that? Have you ever worked in the private sector?

        • Mark, you should read up on assessment. What we’ve got here is assessment OF learning. Not all assessment is assessment AS learning or FOR learning. Sometimes, you need to know whether a student got it to know whether the way they were taught worked, not just to improve the student, but to improve the teaching practice itself.

  2. Michael E. Lopez says:

    You contend, if I read right, that if a computer can score an essay as efficiently as a human, then the human must not be an efficient essay grader.


    My claim is about the actual performances of the humans, not the humans themselves.

    I am claiming that if a computer scores an essay the same as a human, then whatever the human is doing when they are “grading” that essay isn’t something worthwhile in the world of writing.

    • On that we can definitely agree. The way teachers evaluate learning is, in most cases, poorly done. The blame falls to the system we’re in; it teaches teachers to test, rather than to evaluate learning and to diagnose the problems that students have.

      Evaluation should always be a conversation, never a test.

      • Roger Sweeny says:

        Evaluation should always be a conversation, never a test.

        No doubt these are ignorant questions but,

        Can a teacher have 120 useful conversations with students in a day or two?

        What do the other students do while each conversation is going on?

        • Pair them up and have them engage in conversations about having a conversation with the teacher. Seriously, the idea is nice but wholly impractical. And I fail to see why a test or quiz can’t “evaluate learning.” I give oral quizzes and we grade them and discuss the answers in class. Calling on students to say why they answered the way they did can reveal a lot about what they’re getting or not getting. At least I’d like to think it is.

          • I’d never embarrass my students by having them answer a test question in front of their peers.

            The problem with your quiz idea is that it gives into the idea of assessment — a completely backward way of evaluating learning. Of course, it’s universally accepted, because so few people are bold enough to say it’s wrong.

            Also, I have conversations with my students all of the time. Do we meet everyday? Of course not. I also write narrative feedback on their work, in places where they can respond. Is this a lot of work? Absolutely. If we’re afraid of work, we’re as bad as the bureaucrats suggest that we are.

            Oh, and the suggestion that students can’t work independently or collaboratively while others converse with the teacher is one embraced by people who prefer control over cooperation — another issue but somewhat relevant here.

        • I only attempt to have evaluation conferences once each grading period. These take 2-3 days. The rest of my students work on the projects they’ve been constructing all year.

          I do leave meaningful feedback about learning for my students in a variety of ways, and I speak to them during class often.

          So, the conversation about learning is ongoing.

      • Peace Corps says:

        Well clearly I am almost worthless as a teacher, because for some of my classes I test weekly. I would like to test less frequently, or even not at all. I’ve found that my students are too cavalier with any assessment (evaluation, etc.) that is not a test.

        Maybe it is just different with math. Most of my students will learn the new skill, terms, or whatever the day I teach it, but I need them to remember what I’ve taught so they can use it later on. If I don’t test fairly often, they won’t make an effort to try to retain what they have learned.

        I would welcome any computer to grade any of my tests if it was proven to grade up to my standards. Wow, if I didn’t have to figure out what the kid was thinking when doing some problems (in order to give appropriate feedback), if a computer knew….. I could have saved at least 2 hours of grading this morning alone.

        • Hopefully, you’ll read my book, ROLE Reversal (ASCD) when it comes out in February; it addresses a lot of what you say in greater detail than I can offer here. For now, I’ll offer these few responses.

          There is nothing wrong with a computer diagnostic tool that is used to help the teacher get a pulse on learning in a class. If 60% miss number 4, why did it happen? I do this often, when evaluating simple concepts that I’ll need for future scaffolds. I’d never use individual scores to assess students, though. The most obvious problem with multiple choice test questions is you never really know when a student guessed.

          Not to sound harsh, but if your students are “cavalier” with your assessments, this might be one more statement about the idea of assessing in the first place. I often hear math teacher say their subject is different; I’ve never understood that. Perhaps it’s like ELA teachers who say they have to teach grammar, even though their kids hate it. Sounds like a weakness to me. It’s easy to teach grammar, but it’s a lot tougher to create meaningful projects that integrate the skills necessary for good reading and writing.

          Welcoming a computer program to help you is fine. I would say, however, if you welcome it so you don’t have to provide feedback about learning, that’s something entirely different.

          All I’d ask is that you consider the possibility that your students might embrace math if it’s fun and if they don’t fear the punishment of a test grade.

          • Roger Sweeny says:

            Why is a test grade punishment?

          • Peace Corps says:

            Math is fun for me! I love math, that is why I teach it. But, I have yet to discover how to make it fun for everyone. (Although several students have repeatedly told me I am their favorite teacher, and that I am funny. They may say that to all their teachers.)

            This is so off topic. Maybe we can pick this up again when is more aligned with the original post.

          • Roger, any grade is a punishment.

          • Roger Sweeny says:

            Mark, I hope you don’t blow off your students like you blew me off.

          • As a veteran of too many tests to count, including those for licensure and master’s and doctoral comprehensives, I considered those hard-earned top grades to be a reward. Only poor grades, likely correlated to less effort, might be punishment.

          • Roger, I apologize. I certainly didn’t mean to blow you off. Cal implies I talk too much, so I was attempting to keep it short.

            Perhaps I wrongly assumed you had read one of my blogs, which clearly outlines my feelings on grades. The idea of replacing number and letter grades with narrative feedback is a major focus of my blog and my forthcoming book.

            Grades are a punishment, because they are subjective, judgmental and provide no useful feedback for students. There’s much more to it, but I’m not sure this is the thread for it.

            I hope this helps and, again, I apologize for the initial response.

  3. Todd Farley says:

    One thing to keep in mind about the computers in question: they CANNOT read. They literally have no idea what an essay says, what its ideas are, how trenchant, how funny, how well-written (or not).

    So, in effect, a study like this is saying there are computers that don’t know what an essay says that can assess it as accurately as can the overworked, underpaid temporary employees who currently read them at the rate of one essay every two minutes. Yes, we’re talking a couple of great options here…

    Just buy my book, why don’t you, if you want to see the testing industry in all its glory — “Making the Grades: My Misadventures in the Standardized Testing Industry.”

    • Neural networks don’t know anything either, yet they produce useful results.

      I think the practical here is more relevant than the philosophical.

  4. Good lord, can someone manage to stop Mark Barnes from boring the world endlessly with 8-12 posts on the fabulousness of his teaching? Surely one post is five hundred too many.

    But here’s another view: the fact that a machine can score your essays just as well as your human readers suggests that your human readers aren’t really doing a good job of reading the essays in the first place. It suggests that having the essays you have on your test, and grading them in the way you do, is an utter waste of time, money, and effort. The fact that you’re able to waste this time more cheaply by using a computer doesn’t transform it into a worthwhile activity.

    It is rare that I agree with you, so mark this day well.

    It’s not that human readers aren’t “doing a good job”, but rather that standardized essays are graded on a ferociously tight rubric that allows for little deviation. And yes, they are a waste of time.

  5. palisadesk says:

    Sounds like a lovers’ quarrel, how sweet. Methinks Cal and Mark are soul-mates. They are so much alike! Apart from, of course, the specifics of their opinions.

    • Stacy in NJ says:

      They’re different in that Cal is a real iconoclast and contrarian; Mark is a pseudo.

  6. I’m guessing that most of the posts here are uninformed by AI developments. Advanced computer programs are at the point where they can fool most people into thinking they are real. I don’t know what technology was used for this grading, but the fact that the discussions seem to revolve around algorithms shows the limited understanding how decision-making programs have evolved. I’m more interested in how computers will continue to get better at this and be as strong as the best human efforts in a fraction of the time at a fraction of the cost.

    Yes, the essay evaluation on mass scales now stinks. No, the fact that computers can do it as well as humans is not proof that it stinks.