Friends Don’t Let Friends Misuse NAEP Data

As I’ve explained before, test scores from the National Assessment of Education Progress (NAEP) test are constantly misused and abused.  Alabama legislators have done this time and again.  The governor has as well.  The same goes for certain educators.

So when I came across the following on the blog of Dr. Morgan Polikoff, a professor and education researcher at the University of Southern California’s Rossier School of Education, I had to share in it’s entirety.  It should be required reading for every member of the Alabama legislature.

This was posted in October 2015 as we were about to be hit with results from the 2015 test cycle.

At some point in the next few weeks, the results from the 2015 administration of the National Assessment of Educational Progress (NAEP) will be released. I can all but guarantee you that the results will be misused and abused in ways that scream misNAEPery. My warning in advance is twofold. First, do not misuse these results yourself. Second, do not share or promote the misuse of these results by others who happen to agree with your policy predilections. This warning applies of course to academics, but also to policy advocates and, perhaps most importantly of all, to education journalists.

Here are some common types of misused or unhelpful NAEP analyses to look out for and avoid. I think this is pretty comprehensive, but let me know in the comments or on Twitter if I’ve forgotten anything.

  • Pre-post comparisons involving the whole nation or a handful of individual states to claim causal evidence for particular policies. This approach is used by both proponents and opponents of current reforms (including, sadly, our very own outgoing Secretary of Education). Simply put, while it’s possible to approach causal inference using NAEP data, that’s not accomplished by taking pre-post differences in a couple of states and calling it a day. You need to have sophisticated designs that look at changes in trends and levels and that attempt to poke as many holes as possible in their results before claiming a causal effect.
  • Cherry-picked analyses that focus only on certain subjects or grades rather than presenting the complete picture across subjects and grades. This is most often employed by folks with ideological agendas (using 12th grade data, typically), but it’s also used by prominent presidential candidates who want to argue their reforms worked. Simply put, if you’re going to present only some subjects and grades and not others, you need to offer a compelling rationale for why.
  • Correlational results that look at levels of NAEP scores and particular policies (e.g., states that have unions have higher NAEP scores, states that score better on some reformy charter school index have lower NAEP scores). It should be obvious why correlations of test score levels are not indicative of any kinds of causal effects given the tremendous demographic and structural differences across states that can’t be controlled in these naïve analyses.
  • Analyses that simply point to low proficiency levels on NAEP (spoiler alert: the results will show many kids are not proficient in all subjects and grades) to say that we’re a disaster zone and a) the whole system needs to be blown up or b) our recent policies clearly aren’t working.
  • (Edit, suggested by Ed Fuller) Analyses that primarily rely on percentages of students at various performance levels, instead of using the scale scores, which are readily available and provide much more information.
  • More generally, “research” that doesn’t even attempt to account for things like demographic changes in states over time (hint: these data are readily available, and analyses that account for demographic changes will almost certainly show more positive results than those that do not).

Having ruled out all of your favorite kinds of NAEP-related fun, what kind of NAEP reporting and analysis would I say is appropriate immediately after the results come out?

  • Descriptive summaries of trends in state average NAEP scores, not just across a two NAEP waves but across multiple waves, grades, and subjects. These might be used to generate hypotheses for future investigation but should not (ever (no really, never)) be used naively to claim some policies work and others don’t.
  • Analyses that look at trends for different subgroups and the narrowing or closing of gaps (while noting that some of the category definitions change over time).
  • Analyses that specifically point out that it’s probably too early to examine the impact of particular policies we’d like to evaluate and that even if we could, it’s more complicated than taking 2015 scores and subtracting 2013 scores and calling it a day.

The long and the short of it is that any stories that come out in the weeks after NAEP scores are released should be, at best, tentative and hypothesis-generating (as opposed to definitive and causal effect-claiming). And smart people should know better than to promote inappropriate uses of these data, because folks have been writing about this kind of misuse for quite a while now.   

Rather, the kind of NAEP analysis that we should be promoting is the kind that’s carefully done, that’s vetted by researchers, and that’s designed in a way that brings us much closer to the causal inferences we all want to make. It’s my hope that our work in the C-SAIL center will be of this type. But you can bet our results won’t be out the day the NAEP scores hit. That kind of thoughtful research designed to inform rather than mislead takes more than a day to put together (but hopefully not so much time that the results cannot inform subsequent policy decisions). It’s a delicate balance, for sure. But everyone’s goal, first and foremost, should be to get the answer right.

Leave a reply