Why Rising Test Scores May Not Equal Increased Student Learning

Reclaiming Ed: The Problem with Test Scores

When state test scores go up, teachers and administrators often breathe a sigh of relief that their schools will not be declared failing under the No Child Left Behind (NCLB) Act. Others feel the satisfaction that funding for their public schools, which is such a large part of their states’ budget, appears to be well spent and there is an increased likelihood of having an educated work force to meet their community’s future economic needs. The positive feelings associated with rising test scores are based on our interpretation about what that increase in scores means. A rise in state, district, or school test scores leads most people to infer that greater learning has taken place. The psychometrician uses a technical term for these inferences: validity.

Validity is a complex judgment about the likelihood that a person’s interpretations of test scores are warranted, reasonable, sensible, or trustworthy. It seems patently obvious that if a test is well made, then rising scores for a student, school, or district indicate greater student learning, increased understanding, greater comprehension of the curriculum that has been assessed, and so forth. Although it seems obvious, life just isn’t that simple. In the era of NCLB there is reason to believe that those kinds of inferences may not be valid.

Here are some of the numerous reasons why rising test scores may not be related to increases in student learning:

1) Changing the passing score of a test so that more children are considered proficient. NCLB is bizarre since it requires that 100 percent of the children at a school need to be “proficient” in reading and mathematics by 2014. But what does proficiency mean?

Setting the passing score for a test is almost pure politics, only slightly informed by statistics and psychometrics. The choice of a cut score for “passing” and “failing” a test, or for being “proficient,” “in need of improvement” or “exemplary,” are value-laden choices. They are, in fact, predominantly political decisions. Think of it this way: If your students, teachers, and schools were faced with inevitably and inescapably being labeled as failures under a law that is impossible to meet but could instead be seen as successful by dropping the score at which students are classified as proficient, wouldn’t you be tempted to do that? Virtually every state in America has done this by fiddling with its passing scores to make it appear that more children are proficient.

As a result, the increasing percentage of students declared to be proficient does not necessarily mean that students’ scores have actually risen. It may only mean that more students are classified as proficient because the score for entry into the category “proficient” has been lowered.

2) Engaging in excessive, perhaps unethical, and, in some cases, illegal test preparation, resulting in higher test scores, but not necessarily greater learning. Under NCLB many schools and districts have chosen to markedly increase test preparation activities to ensure that student scores go up. But the test was built to assess learning under normal conditions, not conditions in which students are drilled daily in tasks that are known to be on their state assessment. Normal conditions do not mean daily or weekly testing with exams that are suspiciously like those used by the state to assess NCLB standards.

There is evidence from all over the country suggesting that it is not uncommon for 20-60 school days per year to be spent in test-preparation activities. Children can certainly be trained to answer questions a certain way if they are drilled enough on items like those that will appear on their test. And so their scores on the tests for which they were drilled will increase. But that is not education. It is training. Scores will go up, but it is less clear that any authentic learning has occurred.

3) Familiarity with the objectives and the items on a test result in increased scores every year. Teachers and administrators are not fools. If a test is given every year with the same objectives—built to the same curriculum standards—and uses many of the same items from one administration to the next, teachers and administrators come to know what will be on the test. Unless the testing company employed by the state is willing to change items frequently, test scores are likely to rise every year. Then we end up uncertain about the validity of our inference that students really learned more than the year before.

It is expensive to change test specifications (what the test will assess and the type of items it will use to do that), and it is also expensive to change test items frequently. So in most states that does not happen. Thus, inevitably, a larger and larger percent of the children in a state begin to score above the average that was obtained the first time the test was given. The result is called the “Lake Wobegon Effect,” named after Garrison Keillor’s mythological Lake Wobegon, where “all the woman are strong, all the men are good looking, and all the children are above average.”

4) The test items are not tapping the knowledge we really want to assess. It is easy to understand why so many items on so many NCLB tests are multiple-choice, rather than essays or some kind of assessment of complex performance. Multiple-choice items are cheap to produce and score. By including lots of such items during any one testing session a test becomes more reliable. That is, the scores obtained are more dependable as indicators of whatever it is that the multiple-choice items measure. The test is not more valid as an indicator of learning, but it almost always a more accurate measurement tool. Essays, on the other hand, are more expensive and time consuming to score, and you can only have one or a few of them per test session. So the essay test you might design may not be very reliable and the scores are often less dependable as indicators of learning.

So when the multiple-choice format is relied on as much as it is in the NCLB testing programs, it is harder to be sure that deeper, more complex learning has taken place. Students can select the right answer from memory, but they may not understand the area being assessed as well as they should. They may be unable to evaluate the data related to some phenomenon, even though they know the name of the phenomenon and thus get the multiple-choice item assessing their knowledge in that area correct. Eliciting higher-level thinking is certainly possible with multiple-choice items, but it is not typical. Here is an actual item assessing reading and writing skills for the TAKS—the Texas Assessment of Knowledge and Skill. After reading a short story the student is asked:

What change, if any, should be made in sentence 4?

a Change civilization to civilazation
b Insert a comma after Madan
c* Delete it
d Make no change

If I wanted to know if a child could write I’d have them write. If I wanted to know whether they understood the precarious life of some cultural group (which was what the story was about) I’d ask more complex questions than this. I’d ask why the way of life of these people is threatened? How would you convince these people to move and give up their centuries-old culture? And so forth. My point is that a series of select-type, multiple-choice items, requiring little in the way of complex cognition, does not ensure that learning of the type we desire in our youth is taking place. And this is true even when the scores go up.

5) Pushing out the score suppressors, keeping the score increasers. Because of the high stakes associated with so many state tests, administrators in many schools and districts have found ways to keep some students—the poorest performing students—from taking those tests. In Birmingham, Alabama, they dropped over 500 students from the high schools just before state testing. In New York City, political leaders had to apologize for their school policies that pushed out thousands of students from the schools.

Pushing out the weakest students helps to raise the scores at a school or in a district. School administrations drop or push out students through various means: they suspend certain children, or move them to another school mid-year so that their scores will not have to be counted. Children who are score suppressors are not liked by their teachers or school administrators. They are made to feel unwanted because they are unwanted.

On the other hand, score increasers are the high performing students. Many schools have required these kinds of students to take tests that they previously passed a second and third time so that the average score of a class or school will go up. The Wall Street Journal reported that in one Ohio district teachers and schools stopped identifying gifted children because they were afraid those students would be moved to a special class or school and their test scores would go with them. As a result, score increasers are deliberately (and immorally and possibly illegally) kept from enrichment opportunities that the district intended to provide for them.

These efforts by schools and teachers make our interpretations of test scores problematic. We cannot trust the data we get. But, in fact, these activities have a much worse effect: They change the relationship between teachers and students from what is typically a caring one to an instrumental one. Under the high stakes associated with the tests used to satisfy the NCLB law, children are too often seen positively only if they can increase scores and they are too often seen negatively if they cannot. Their worth becomes their test score and that is a sad state of affairs.

6) Out and out cheating to make the scores go up. Thousands of cases of suspicious scores have been uncovered. There are companies that look for anomalies in test scoring, often finding incidents such as a low-scoring student suddenly gets seven right in a row or a class in a low-performing school suddenly outperforms classes in a neighboring high-performing school. These may or may not be instances of cheating, but several hundred of these anomalies were found associated with the NCLB tests in Texas and the State Department of Education refused to investigate. This, of course, should not be a surprise since it was discovered that the head of the State Department of Education had for many years been turning in false scores from the district in which she had previously been a superintendent.

Texas is the state that gave us the model and the advocates for NCLB, and it has therefore had more time to develop its record for cheating on the tests the state uses. For example, for almost a decade, Wesley Elementary School in Houston won accolades for teaching low-income students how to read. The school was even featured on an Oprah Winfrey show about schools that “defy the odds.” But Wesley wasn’t defying the odds at all; all the adults at the school were cheating. In 2003 Wesley’s fifth graders performed in the top ten percent in the state on the Texas Assessment of Knowledge and Skills reading exams. The next year, as sixth-graders at the Williams Middle School, the same students fell to the bottom ten percent in the state. Confronted with the data the Wesley teachers admitted cheating was standard operating procedure.

The point is that there will always be a possible corruption of people and of the indicators used to assess learning when there is so much pressure on school administrators and teachers to raise test scores. And that means when scores go up, we need to be wary. We need to thoroughly investigate whether the rise in scores is a real indicator of learning and not some form of deception or cheating that turned low-performing children into high performers for just the week of testing.

Sadly, cheating in contemporary American schools and throughout American culture has become more acceptable. Among the many problems associated with this cultural shift is that the validity of the test scores associated with NCLB is harder to assess. We simply can no longer be certain that a rise in test scores means that more student learning is taking place.

David C. Berliner is co-author with Sharon L. Nichols of Collateral Damage: How high-stakes testing corrupts America’s schools (2007). Some of the material and examples reported here comes from that book. He is also co-author with Bruce J. Biddle of The Manufactured Crisis: Myth, fraud, and the Attack on America’s Public Schools (1995).

For more on cheating in contemporary American schools see Callahan, D. (2004): The cheating culture: Why more Americans are doing wrong to get ahead; Nichols, S. N. and Berliner, D. C. ( 2006) “The Pressure to Cheat in a High-Stakes Testing Environment” found In E. M. Anderman & T. B. Murdock (eds): The Psychology of Academic Cheating.