The support for high-stakes testing rests on a belief that a test score accurately measures what a student has learned. Therefore, many argue, standardized tests should be a major factor in evaluating teachers. Jeffrey Livingston, an economics professor at Bentley University, wondered what would happen if you offered students a financial reward of $90 to do well on a standardized test – a curiosity sparked in part by discussions with his wife, a high school teacher. Livingston and some colleagues set up a trial with some students in Chicago. They released the findings in May. Livingston recently spoke with NEA Today about how the results call into question the value placed on standardized tests.
What was the specific question you wanted the study to adress and how exactly did the program work?
Jeffrey Livingston: The concern we hoped to test is whether students in urban schools who are at risk of failing to meet state standards try as hard as they can on standardized tests when they have no personal stake in their score.
The experiment gives students incentives to improve their academic achievement over a two-month period, culminating with a chance to show what they have learned on an incentivized standardized test. We designed it to cover the same skills and knowledge that are tested on the official standardized test. We wanted to see the extent to which they show this same new knowledge on the test that the school administers.
Students took the official tests over one week in March and our incentivized tests the following week. Unless there is a difference in how hard students try on each exam, one would thus expect similar performance on the two exams. What we found is that students do much better on the test that is incentivized, but do not show the same gains on the official test.
There’s been quite a bit of research into these extrinsic motivators. How did your results compare with previous studies?
JL: Many similar experiments have been conducted, though there are differences – such as the outcome that is incentivized, the amount of the reward, and exactly what one has to do in order to earn the reward. So to some degree it is difficult to say why their results might differ.
However, most experiments which give incentives to improve grades or test scores have actually found very little impact. For example, Roland Fryer’s experiments which incentivize “outputs” such as grades and test scores have resulted in small or even negative effects (although none are statistically significantly different from zero effect).
If students do not show what they know on standardized tests, then they are not a reliable measure of a teacher’s quality and should not be used as part of a teacher’s evaluation.
In the experiment which was closest in spirit to this one, students in New York City were paid for their performance on interim tests which were similar to state assessments. 4th graders could earn up to $25 per test and 7th graders could earn up to $50 per test. He found that these incentives had no effect. Others who have studied different incentives for outputs typically find similarly modest effects. So, the effects we find for the large incentives – $90 for either the student, the parents, or the tutor – are extremely large compared to these other studies which incentivized outputs.
In your study, the size of the impacts between the incentivized and non-incentivized tests were quite large. Did this surprise you?
JL: It did. Kids tend to have small “discount rates” as the literature terms them. In other words, they are motivated by rewards that they get immediately but do not respond strongly to promises of rewards that they will not receive until much later. It could be that, in our study, when the date of the test arrived, because they were reminded that they had a good chance to earn a lot of money if they performed well and they would be paid right after the test was finished, they took the test more seriously. When they took the official test the week before, they did not take it so seriously since they were not paid for their performance.
So based on the results of your test, what can we conclude or infer about standardized tests and the role they play in schools?
JL:The policy implications are crucial. The results suggest that students do not try as hard as they can on tests where they have nothing to gain personally. This calls into question the appropriateness of using standardized tests that have no impact on a student’s welfare as an evaluation of a student’s academic progress. Students may fail to show improvement merely because they have no incentive to show what they have learned, not because they are missing the requisite skills.
If students do not show what they know on standardized tests, then they are not a reliable measure of a teacher’s quality and should not be used as part of a teacher’s evaluation. I am very worried that teacher evaluation systems are using such tests and are arriving at incorrect decisions, potentially harming good teachers and rewarding bad ones.
Are you planning to extend the research to other groups of students?
JL: That is exactly the direction this research is now heading. We are in the process of conducting similar experiments that pay students based on standardized test results in more affluent schools and in schools internationally. Other types of students might be more intrinsically motivated to perform well on standardized tests. If so, the black-white achievement gap and the US-International gap, especially in mathematics, might not be as severe as we think.
For example, our early results show that students in Shanghai, China , which ranked first on the 2012 Programme for International Student Assessment (PISA) in mathematics, do not respond at all to treatment. Students there do just as well on a version of a PISA math test whether they are in the control group or the treatment group, presumably because they try hard on every test. But students in the U.S., which ranked 36th on the 2012 PISA in mathematics, do respond to the treatment.
This suggests that they don’t have the same intrinsic motivation to perform well on the PISA. When U.S. students do try as hard as they can in response to extrinsic financial incentives, the gap between students in the U.S. and Shanghai is greatly diminished. Much more work needs to be done, but it appears that the effort that different types of students put into standardized tests plays a huge role in understanding what these tests actually measure.