What Happens When We Put Too Much Emphasis On Test Scores

Editor’s note:  Paul Thomas is a professor of education at Furman University in Greenville, SC.  He taught high school English in rural South Carolina before moving to higher education.  A contributor to many education journals, his work is always grounded by his experiences in the class room.

In this article, which is much more lengthy and “deeper” than most we publish, What is the relationship among NAEP scores, educational policy and classroom practice? he stresses how self-serving politicians and the media often misinterpret test results and make claims that can not be substantiated.

“Annually, the media, public, and political leaders over-react and misrepresent the release of SAT and ACT scores from across the US. Most notably, despite years of warnings from the College Board against the practice, many persist in ranking states by average state scores, ignoring that vastly different populations are being incorrectly compared.

These media, public, and political reactions to SAT and ACT scores are premature and superficial, but the one recurring conclusion that would be fair to emphasize is that, as with all standardized test data, the most persistent correlation to these scores includes the socio-economic status of the students’ families as well as the educational attainment of their parents.

Over many decades of test scores, in fact, educational policy and classroom practices have changed many times, and the consistency of those policies and practices have been significantly lacking and almost entirely unexamined.

For example, when test scores fell in California in the late 1980s and early 1990s, the media, public, and political leaders all blamed the state’s shift to whole language as the official reading policy.

This was a compelling narrative that proved to be premature and superficial—relying on the most basic assumptions of correlation. A more careful analysis exposed two powerful facts: California test scores were far more likely to have dropped because of drastic cuts to educational funding and a significant influx of English language learners and (here is a very important point) even as whole language was the official reading policy of the state, few teachers were implementing whole language in their classrooms.

This last point cannot be emphasized enough: throughout the history of public education, because teaching is mostly a disempowered profession, one recurring phenomenon is that teachers often shut their doors and teach—claiming their professional autonomy by resisting official policy.

November 2019 has brought us a similar and expected round of making outlandish and unsupported claims about NAEP data. With the trend downward in reading scores since 2017, this round is characterized by the sky-is-falling political histrionics and hollow fist pounding that NAEP scores have proven policies a success or a failure (depending on the agenda).

If we slip back in time just a couple decades, when the George W. Bush administration heralded the “Texas miracle” as a template for No Child Left Behind, we witnessed a transition from state-based educational accountability to federal accountability. But this moment in political history also raised the stakes on scientifically based educational policy and practice.”

Specifically, the National Reading Panel was charged with identifying the highest quality research in effective reading programs and practices. But while the NRP touted its findings as scientific, many, including a member of the panel itself, discredited the quality of the findings as well as accurately cautioning against political misuse of the findings to drive policy.

Here is where our trip back in history may sound familiar during this current season of NAEP hand wringing. Secretary of Education (2005-2009), Margaret Spellings announced that a jump of seven points in NAEP reading scores from 1999-2005 was proof No Child Left Behind was working. The problem, however, was in the details:

When Secretary Spellings announced that test scores were proving NCLB a success, Gerald Bracey and Stephen Krashen exposed one of two possible problems with the data. Spellings either did not understand basic statistics or was misleading for political gain. Krashen detailed the deception or ineptitude by showing that the gain Spellings noted did occur from 1999 to 2005, a change of seven points. But he also revealed that the scores rose as follows: 1999 = 212; 2000 = 213; 2002 = 219; 2003 = 218 ; 2005 = 219. The jump Spellings used to promote NCLB and Reading First occurred from 2000 to 2002, before the implementation of Reading First. Krashen notes even more problems with claiming success for NCLB and Reading First, including:

“Bracey (2006) also notes that it is very unlikely that many Reading First children were included in the NAEP assessments in 2004 (and even 2005). NAEP is given to nine year olds, but RF is directed at grade three and lower. Many RF programs did not begin until late in 2003; in fact, Bracey notes that the application package for RF was not available until April, 2002.”

Jump to 2019 NAEP data release to hear Secretary of Education Betsy DeVos shout that the sky is falling and public education needs more school choice—without a shred of scientific evidence making causal relationships of any kind among test data, educational policy, and classroom practice.

But an even better example has been unmasked by Gary Rubinstein who discredits Louisiana’s Chief of Change John White proclaiming his educational policy changes caused the state’s NAEP gain in math:

So while, yes, Louisiana’s 8th grade math NAEP in 2017 was 267 and their 8th grade math NAEP in 2019 was 272 which was a five point gain in that two year period and while that was the highest gain over that two year period for any state, if you go back instead to their scores from 2007, way before their reform effort happened, you will find that in the 12 year period from 2007 to 2019, Louisiana did not lead the nation in 8th grade NAEP gains.  In fact, Louisiana went DOWN from a scale score of 272.39 in 2007 to a scale score of 271.64 in 2019 on that test.  

This means that in that 12 year period, they are 33rd in ‘growth’ (is it even fair to call negative growth ‘growth’?).  The issue was that from 2007 to 2015, Louisiana ranked second to last on ‘growth’ in 8th grade math.  Failing to mention that relevant detail when bragging about your growth from 2017 to 2019 is very sneaky.

The media and public join right in with this political playbook that has persisted since the early 1980s: Claim that public education is failing, blame an ever-changing cause for that failure (low standards, public schools as monopolies, teacher quality, etc.), promote reform and change that includes “scientific evidence” and “research,” and then make unscientific claims of success (or yet more failure) based on simplistic correlation and while offering no credible or complex research to support those claims.

Here is the problem, then: What is the relationship among NAEP scores, educational policy, and classroom practice?

There are only a couple of fair responses.

First, 2019 NAEP data replicate a historical fact of standardized testing in the US—the strongest and most persistent correlations to that data are with the socio-economic status of the students, their families, and the states. When students or average state data do not conform to that norm, these are outliers that may or may not provide evidence for replication or scaling up. However, you must consider the next point as well.

Second, as Rubinstein shows, the best way to draw causal relationship among NAEP data, educational policy, and classroom practices is to use longitudinal data; I would recommend at least 20 years (reaching back to NCLB), but 30 years would add in a larger section of the accountability era that began in the 1980s but was in wide application across almost all states by the 1990s.

The longitudinal data would next have to be aligned with the current educational policy in math and reading for each state correlated with each round of NAEP testing.

As Bracey and Krashen cautioned, that correlation would have to accurately align when the policy is implemented with enough time to claim that the change impacted the sample of students taking NAEP.

But that isn’t all, even as complex and overwhelming as this process demands.

We must address the lesson from the so-called whole language collapse in California by documenting whether or not classroom practice implemented state policy with some measurable level of fidelity.

This process is a herculean task, and no one has had the time to examine 2019 NAEP data in any credible way to make valid causal claims about the scores and the impact of educational policy and classroom practice.

What seems fair, however, to acknowledge is that there is no decade over the past 100 years when the media, public, and political leaders deemed test scores successful, regardless of the myriad of changes to policies and practices.

Over the history of public education, also, before and after the accountability era began, student achievement in the US has been mostly a reflection of socio-economic factors, and less about student effort, teacher quality, or any educational policies or practices.

If NAEP data mean anything, and I am prone to say they are much ado about nothing, we simply do not know what that is because we have chosen political rhetoric over the scientific process and research that could give us the answers”.

Another NAEP Wild Goose Chase

We have discussed the National Assessment of Educational Progress test many times and talked about the silliness that politicians read into the numbers.  How they constantly single out scores from just one year and treat them as if they came from a burning bush.  How they ignore the fact that NAEP is only about looking at trends over time and that no one set of scores tell us much when taken out of context.

Still, we have people like Senator Del Marsh who tries to justify his idea of dumping the Alabama College & Career Ready standards based largely on Alabama’s NAEP scores.

And recently we told you how we have delayed implementation of new math standards because the governor wants us to see how Massachusetts, Minnesota, Wyoming, Virginia and New Jersey teach math.  Why?  Because they rank at the top of the list of 4th grade math scores.  But as I’ve pointed out before, all we are really doing is comparing apples to oranges because things in Alabama do not compare to things in Massachusetts, Minnesota, Wyoming, Virginia and New Jersey.

So let’s do something politicians seldom do–let’s look at NAEP scores over time and maybe see why we are chasing the wrong rabbits.

The first set of NAEP tests were given to 4th and 8th graders in reading and math across the country in 1992.  Surprise, surprise.  That year scores in Minnesota, Massachusetts and New Jersey ranked in the top five of the entire nation.  So we are now studying states that have consistently been good at math, even though they spend much more money educating each student, have great cultural differences and a long devotion to public education.

Where was Alabama in 4th grade math in 1992?  We were 11 points behind the national average while Massachusetts, Minnesota, Wyoming and New Jersey were all above the national average.

But guess what?  In the last NAEP testing in 2017, Alabama had achieved a greater increase compared to the national average than any of these four states.  Since 1992, Alabama has increased 4th grade math scores more than Massachusetts, Minnesota, Wyoming or New Jersey.  Only Virginia has done better from 1992 to 2017 than Alabama.

In fact, over the last 25 years only seven states have increased 4th grade math scores more than Alabama has.  Of those we are studying, Virginia is the only one ahead of us in this measure.

And we don’t have to go up north looking for states to study.  Mississippi has made more gain in 4th grade math than any other state in the country.  Also in the top seven, the only ones better that Alabama, are Florida, North Carolina, Tennessee and Louisiana.  Demographically we have far more in common with these deep south states that we do the ones we are studying.

The really CRITICAL measurement in education is GROWTH.  How much progress are you making with the resources and student population you have.  And in our infinite wisdom, we are now studying four states that have trailed Alabama in growth over 25 years.  How much sense does that make?

However, we can comfort ourselves by remembering that the only people who pay attention to NAEP scores are politicians and some education bureaucrats.  Certainly not teachers and principals.  They are too busy working with their students to fret over basically meaningless numbers.

 

 

 

NAEP Scores Revisited

Time after time we hear state politicians try to prove a point by taking things out of context.  This is especially true of NAEP scores where they isolate one score for one year and pay no attention to growth that has occurred.  They pay no attention to where you began and how far you’ve come because that doesn’t suit their purpose.

And since we are now told that we have a fourth-grade “math crisis” it seems a good time to revisit a post of several months ago since it is just as true today as it was then.

Then we called it: Okay Governor, About Those NAEP Scores

Politicians love numbers.  They can twist them and bend them and take them out of context and omit bad ones all in their effort to make the voter feel good, or convinced, or something.  Take numbers about economic development.  Sometime about the end of the year the governor will announce that we created _______ new jobs in Alabama in the last 12 months.  But you never hear one also say that at the same time we lost _____ jobs and overall, the number of people working is actually less than 12 months ago.

Since he voted to hire Michael Sentance as state school chief on Aug. 11, Governor Bentley has mentioned NAEP scores in Massachusetts over and over again.  His inference is that since Sentance is from Massachusetts and their NAEP scores are better than ours, we will soon be just like they are.

Unfortunately, the governor does not see the big picture or take things into context and consequently gives us a very distorted view.

First, what are NAEP scores?

NAEP means the National Assessment for Educational Progress, the supposed gold standard of tests.  These are given every two years across the country to 4th and 8th graders.  Students and schools are picked at random.  About 2,500 students in Alabama are tested.  Test is 75 minutes long.  About 30 students are tested per school.  Some must be students with disabilities and English language learners.  They are told that they do not get a grade for taking this test.

These tests began in 1992.  Since then Alabama has narrowed the gap between our scores and national scores in 4th grade reading and math and in 8th grade reading and math.  So we have been doing better than schools across the nation, but no one bothers to tell this story.

And comparing Alabama gains to those in Massachusetts shows that except for 8th grade math, we have matched them stride for stride.   So we have been growing our NAEP scores for more than 20 years at the same rate as the Bay State.  But no one bothers to mention this because suddenly the picture looks much different.

And here is even more interesting info the governor has not mentioned.  One of the most important measures in education is the “achievement gap” between poverty and non-poverty students and between white and African-American students.  When you look at 8th grade reading and math and 4th grade reading and math you see the “gap” in Alabama is LESS than the one in Massachusetts in every single case.

Yes, NAEP scores are higher in Massachusetts than in Alabama.  The last NAEP scores, from 2015, show they are number 1 in 4th grade reading and math and 8th grade math and number 2 in 8th grade reading.   But has their growth been remarkable?  Not necessarily considering they were tied for 3rd nationally in 4th grade reading in 1992, tied for 5th in 4th grade math, and number 7 in 8th grade math and were number 4 in 8th grade reading in 1998 (as far back as scores go in this category).

Sure seems to me that when you study the numbers in their entirety and not just cherry pick them to suit your narrative, Alabama has been running just as fast as Massachusetts for more than two decades.

And governor, do you think a real education consists of ONLY math and reading?  What about the sciences, the arts, extra curricular activities?  Do you know any Alabama students going to college on a band scholarship or to be on a debate team, much less an athletic team?

Children are so much, much more than just data points on a damn graph.  And the totality of an education can not be neatly wrapped up in a few numbers politicians use to convince the public they are right and all of us are wrong.  The sweat and struggle teachers go through daily with special needs children and those who come to school hungry and with an abscessed tooth can not be adequately valued with one test given every other year to a handful of students.

And to worship at the altar of NAEP is not what education should be about.

It is said that repetition is a great teacher.  But it is unfortunate that we have to keep explaining this to political leaders.

Friends Don’t Let Friends Misuse NAEP Data

As I’ve explained before, test scores from the National Assessment of Education Progress (NAEP) test are constantly misused and abused.  Alabama legislators have done this time and again.  The governor has as well.  The same goes for certain educators.

So when I came across the following on the blog of Dr. Morgan Polikoff, a professor and education researcher at the University of Southern California’s Rossier School of Education, I had to share in it’s entirety.  It should be required reading for every member of the Alabama legislature.

This was posted in October 2015 as we were about to be hit with results from the 2015 test cycle.

At some point in the next few weeks, the results from the 2015 administration of the National Assessment of Educational Progress (NAEP) will be released. I can all but guarantee you that the results will be misused and abused in ways that scream misNAEPery. My warning in advance is twofold. First, do not misuse these results yourself. Second, do not share or promote the misuse of these results by others who happen to agree with your policy predilections. This warning applies of course to academics, but also to policy advocates and, perhaps most importantly of all, to education journalists.

Here are some common types of misused or unhelpful NAEP analyses to look out for and avoid. I think this is pretty comprehensive, but let me know in the comments or on Twitter if I’ve forgotten anything.

  • Pre-post comparisons involving the whole nation or a handful of individual states to claim causal evidence for particular policies. This approach is used by both proponents and opponents of current reforms (including, sadly, our very own outgoing Secretary of Education). Simply put, while it’s possible to approach causal inference using NAEP data, that’s not accomplished by taking pre-post differences in a couple of states and calling it a day. You need to have sophisticated designs that look at changes in trends and levels and that attempt to poke as many holes as possible in their results before claiming a causal effect.
  • Cherry-picked analyses that focus only on certain subjects or grades rather than presenting the complete picture across subjects and grades. This is most often employed by folks with ideological agendas (using 12th grade data, typically), but it’s also used by prominent presidential candidates who want to argue their reforms worked. Simply put, if you’re going to present only some subjects and grades and not others, you need to offer a compelling rationale for why.
  • Correlational results that look at levels of NAEP scores and particular policies (e.g., states that have unions have higher NAEP scores, states that score better on some reformy charter school index have lower NAEP scores). It should be obvious why correlations of test score levels are not indicative of any kinds of causal effects given the tremendous demographic and structural differences across states that can’t be controlled in these naïve analyses.
  • Analyses that simply point to low proficiency levels on NAEP (spoiler alert: the results will show many kids are not proficient in all subjects and grades) to say that we’re a disaster zone and a) the whole system needs to be blown up or b) our recent policies clearly aren’t working.
  • (Edit, suggested by Ed Fuller) Analyses that primarily rely on percentages of students at various performance levels, instead of using the scale scores, which are readily available and provide much more information.
  • More generally, “research” that doesn’t even attempt to account for things like demographic changes in states over time (hint: these data are readily available, and analyses that account for demographic changes will almost certainly show more positive results than those that do not).

Having ruled out all of your favorite kinds of NAEP-related fun, what kind of NAEP reporting and analysis would I say is appropriate immediately after the results come out?

  • Descriptive summaries of trends in state average NAEP scores, not just across a two NAEP waves but across multiple waves, grades, and subjects. These might be used to generate hypotheses for future investigation but should not (ever (no really, never)) be used naively to claim some policies work and others don’t.
  • Analyses that look at trends for different subgroups and the narrowing or closing of gaps (while noting that some of the category definitions change over time).
  • Analyses that specifically point out that it’s probably too early to examine the impact of particular policies we’d like to evaluate and that even if we could, it’s more complicated than taking 2015 scores and subtracting 2013 scores and calling it a day.

The long and the short of it is that any stories that come out in the weeks after NAEP scores are released should be, at best, tentative and hypothesis-generating (as opposed to definitive and causal effect-claiming). And smart people should know better than to promote inappropriate uses of these data, because folks have been writing about this kind of misuse for quite a while now.   

Rather, the kind of NAEP analysis that we should be promoting is the kind that’s carefully done, that’s vetted by researchers, and that’s designed in a way that brings us much closer to the causal inferences we all want to make. It’s my hope that our work in the C-SAIL center will be of this type. But you can bet our results won’t be out the day the NAEP scores hit. That kind of thoughtful research designed to inform rather than mislead takes more than a day to put together (but hopefully not so much time that the results cannot inform subsequent policy decisions). It’s a delicate balance, for sure. But everyone’s goal, first and foremost, should be to get the answer right.

Okay Governor, About Those NAEP Scores

Politicians love numbers.  They can twist them and bend them and take them out of context and omit bad ones all in their effort to make the voter feel good, or convinced, or something.  Take numbers about economic development.  Sometime about the end of the year the governor will announce that we created _______ new jobs in Alabama in the last 12 months.  But you never hear one also say that at the same time we lost _____ jobs and overall, the number of people working is actually less than 12 months ago.

Since he voted to hire Michael Sentance as state school chief on Aug. 11, Governor Bentley has mentioned NAEP scores in Massachusetts over and over again.  His inference is that since Sentance is from Massachusetts and their NAEP scores are better than ours, we will soon be just like they are.

Unfortunately, the governor does not see the big picture or take things into context and consequently gives us a very distorted view.

First, what are NAEP scores?

NAEP means the National Assessment for Educational Progress, the supposed gold standard of tests.  These are given every two years across the country to 4th and 8th graders.  Students and schools are picked at random.  About 2,500 students in Alabama are tested.  Test is 75 minutes long.  About 30 students are tested per school.  Some must be students with disabilities and English language learners.  They are told that they do not get a grade for taking this test.

These tests began in 1992.  Since then Alabama has narrowed the gap between our scores and national scores in 4th grade reading and math and in 8th grade reading and math.  So we have been doing better than schools across the nation, but no one bothers to tell this story.

And comparing Alabama gains to those in Massachusetts shows that except for 8th grade math, we have matched them stride for stride.   So we have been growing our NAEP scores for more than 20 years at the same rate as the Bay State.  But no one bothers to mention this because suddenly the picture looks much different.

And here is even more interesting info the governor has not mentioned.  One of the most important measures in education is the “achievement gap” between poverty and non-poverty students and between white and African-American students.  When you look at 8th grade reading and math and 4th grade reading and math you see the “gap” in Alabama is LESS than the one in Massachusetts in every single case.

Yes, NAEP scores are higher in Massachusetts than in Alabama.  The last NAEP scores, from 2015, show they are number 1 in 4th grade reading and math and 8th grade math and number 2 in 8th grade reading.   But has their growth been remarkable?  Not necessarily considering they were tied for 3rd nationally in 4th grade reading in 1992, tied for 5th in 4th grade math, and number 7 in 8th grade math and were number 4 in 8th grade reading in 1998 (as far back as scores go in this category).

Sure seems to me that when you study the numbers in their entirety and not just cherry pick them to suit your narrative, Alabama has been running just as fast as Massachusetts for more than two decades.

And governor, do you think a real education consists of ONLY math and reading?  What about the sciences, the arts, extra curricular activities?  Do you know any Alabama students going to college on a band scholarship or to be on a debate team, much less an athletic team?

Children are so much, much more than just data points on a damn graph.  And the totality of an education can not be neatly wrapped up in a few numbers politicians use to convince the public they are right and all of us are wrong.  The sweat and struggle teachers go through daily with special needs children and those who come to school hungry and with an abscessed tooth can not be adequately valued with one test given every other year to a handful of students.

And to worship at the alter of NAEP is not what education should be about.

 

 

Another Look At NAEP scores

Several days ago I wrote about a senator taking to the floor of the senate to denounce the latest Alabama scores on the National Assessment of Educational Progress.  As I pointed out, he did not bother to look at trend lines and instead, only singled out scores from one test year.

Most of us have tried to lose weight at some time.  (But for me, not lately.)  Let’s say that we’ve worked hard so far in 2016 and have now lost 20 pounds.  However, we had a wee too much to eat last Easter Sunday and when we hopped on the scales Monday, we were one pound heavier than the day before.  Should we pitch a fit about just that one day and declare that our diet is a total and complete failure?

I suppose only if we are in the Alabama Senate and trying to make the numbers tell a pre-determined story (that our public schools are going backwards.)

So I looked at our NAEP scores again and looked again at how we are doing on our diet since the first of the year, not just on the Monday after a big Sunday dinner. I found that if you look at Alabama scores for 4th grade reading and math back to 1992 and at 8th grade reading back to 1992 and 2002 (as far back as info on the national NAEP site goes for 8th grade reading) you find that GAINS in Alabama have EXCEEDED national gains in all four cases.

In 4th grade math, we went up 23 points, nationally the gain was 21 points.  In 8th grade math, we went up 15 points, while the national increase was 14 points.  For 4th grade reading, Alabama increased 10 points, the national gain was 6 points.  And for 8th grade reading, we gained 6 points and the gain nationally was 3 points.

And here is something especially interesting.  The proposed RAISE/PREP Act says we will use something called VAM (Value Added Model) process to determine how good our teachers are and how we can adjust education to make more rapid gains.

The first VAM was created by Dr. William Sanders in Tennessee and was put into use in the Volunteer State in 1992.  The proponents of this very inexact methodology want us to believe it is the best thing since sliced bread.  And since Tennessee has been using it for 14 years, they must be blowing our doors off down here in Alabama.

Well, not so fast.  Truth is that when you compare NAEP scores in Alabama to those in Tennessee you see that reading scores for our 4th graders have risen more from 1992 until 2015 than their counterparts in Tennessee.  Same for 8th grade reading scores from 2002 to 2015.

Of course, this hardly fits the narrative of those non-educators wanting to tell our teachers and school how to do things.  But then, facts can be troublesome at times and often get in the way of political agendas.