The National Center for Education Evaluation is part of the U.S. Department of Education.  Their research can not be taken lightly.  (Unless perhaps you are a legislator in Alabama.)

Here are key points from a release about research looking at using tests scores to evaluate teachers, which was made public this month.  Using test scores for teacher evaluation is a central part of the proposed RAISE legislation.

A new study sanctioned by the federal government indicates that using student test scores to evaluate teachers might be extremely ineffective. The study, titled the Analysis of the stability of teacher-level growth scores from the student growth percentile model, found when looking at Nevada’s second-largest school district “that half or more of the variance in teacher scores from the model is due to random or otherwise unstable sources rather than to reliable information that could predict future performance.”

Through analyzing almost 370 elementary students, the study found using student test scores for high-stakes decisions about the effectiveness of teachers is an unreliable method and that districts should proceed with caution when doing so. Nevada was selected for the study because in 2009 it decided to mandate a statewide growth model for school accountability, ultimately deciding on a student growth percentile model and expanded this to educator evaluation in 2011.

After careful analysis, the study found “Nevada’s annual teacher-level growth scores, derived by applying the student growth per­centile model to student scores from Nevada’s Criterion-Referenced Tests in math and reading, did not meet a level of stability that would traditionally be desired in scores used for high-stakes decisions about individuals.” 

“Thus, as states examine properties of their estimates of teacher effectiveness and decision makers weigh how to incorporate teacher- level growth scores in teacher accountability policy, they may want to exercise caution and further investigate whether teacher-level growth scores are sufficiently stable for use in high-stakes decisions.”