Testing by Shutterstock
Key points in this Outlook:
- Analyses of state accountability systems demonstrate that all accountability systems will fall short in certain dimensions.
- School reformers must first and foremost incorporate the lessons learned from No Child Left Behind into the implementation of the waiver applications.
- Policymakers ought to consider controlling for student demographics; establishing clear definitions of effective schools; carefully evaluating composite measures used to classify schools; and making adjustments to systems according to short-term analyses of the implementation process.
It is no exaggeration to say that standards-based reform has been a pillar of US education policy over the past several decades. The No Child Left Behind Act (NCLB), the 2001 reauthorization of the Elementary and Secondary Education Act (ESEA), created the first mandatory national accountability structure that held schools and districts responsible for student achievement. Despite its promise and initial widespread support, NCLB was fraught with problems.
As NCLB’s 2013–14 deadline of 100 percent proficiency looms, there has been increasing pressure on the federal government to update ESEA. Given Congress’s inaction, the US Department of Education (USDOE) implemented a waiver program in 2012 to provide states the opportunity to implement their own accountability systems.
According to theory, accountability incentives will motivate educators to align their behaviors with predetermined standards and goals. In the case of standards-based reform, accountability is intended to incentivize educators to engage in certain behaviors, including 1) teaching the content specified in content standards, 2) improving their instructional quality, and 3) focusing on the achievement of groups targeted by the policy. Accountability based on assessment results gives added weight to content standards: to the extent that the tests and standards represent good targets for what students should know and be able to do, accountability should drive improvement in student outcomes.
A number of studies over the past decade show that holding schools accountable for student performance has modest positive effects on student learning. However, studies also show some of the negative unintended consequences of accountability systems. These negative consequences have often been driven (or at least facilitated) by the poor design of NCLB-era accountability, as we discuss later.
This Outlook argues that accountability systems are only as good as the data on which they are based, which is one of the sobering lessons of the NCLB era. We therefore evaluate the approved ESEA waivers using four important professional standards of practice: construct validity, reliability, fairness, and transparency. These four criteria, while nonexhaustive, allow us to evaluate important conditions of accountability systems now, before full implementation.
In what follows, we first define these criteria and illustrate how they have played out under NCLB’s accountability system. We rate NCLB’s accountability on an A–F scale on each of the four dimensions. Next, we grade the waivers’ accountability systems using the same scale. We provide an “average” grade and also discuss examples of particularly strong and weak systems. In grading all systems, our reference is an imaginary ideal system that is transparent and fair and that accurately and reliably identifies schools’ contributions to students’ academic and social success.
The waivers provide a unique opportunity to learn from past mistakes and correct them more quickly than would be possible under the normal legislative process. At some point ESEA will be reauthorized, but the waiver period provides a unique window. State and federal policymakers should take advantage of this opportunity to develop a clear understanding of what has and has not worked and how these waivers compare to NCLB.
Dimensions of NCLB System Quality
Construct Validity: Grade = F. In general, construct validity refers to the extent to which a set of indicators actually measures what it purports to measure. Construct validity of accountability systems includes two dimensions. First, do the performance measures used in the accountability system adequately cover the full range of desired student outcomes? Second, are the inferences or decisions made on the basis of the performance measures appropriate? Typically, accountability policies rely on objective performance measures—primarily student test scores in math and English language arts (ELA)—as proxies for all the unmeasured goals of schooling.
NCLB illustrates how construct-validity problems can arise within accountability policies, earning it an F on this dimension (for our full grading criteria, see table 1). On the first construct-validity question, NCLB’s sole focus on math and ELA proficiency and graduation rates clearly ignores other important school outcomes. Indeed, this is a primary reason NCLB has led to a narrowing of the curriculum to those two subjects, to the exclusion of untested subjects.
On the second construct-validity question, NCLB’s performance measure (proficiency rate) does not account for schools’ contributions to student learning. The use of proficiency rates makes it difficult to measure school progress over time, and largely targets accountability on schools serving the most disadvantaged students. The few growth models allowed under NCLB suffer from the same faults as proficiency rates and do not appropriately credit schools for their improvement.
Reliability: Grade = B. Reliability is the consistency of a performance classification. In the context of accountability, this usually means the year-to-year stability of accountability classifications.
NCLB’s accountability system earns a B on reliability. Under NCLB, schools are primarily accountable for proficiency rates. These are highly reliable because they are strongly correlated with stable out-of-school factors. Yet schools that fall below the proficiency target are in practice accountable for changes in proficiency rates, which are highly unstable. To improve year-to-year stability, accountability systems sometimes use multiple years of data, though NCLB does not. Overall, NCLB’s accountability is fairly stable, insofar as schools that do not make Adequate Yearly Progress (AYP) in one year generally continue to fail.
Fairness: Grade = F. We define fairness in accountability systems as the extent to which performance ratings are systematically associated with features outside the control of schools. Thus, fair indicators are those that do not disproportionately target schools serving certain kinds of students (such as low-income, historically underrepresented groups). Performance classifications can be made fairer by changing schools’ comparison groups from absolute (meaning all schools) to conditional (for example, schools with similar student populations). A fair accountability system would be one that holds schools accountable for only the portion of student achievement they can control—the portion related to school policies and practices, not student characteristics.
While some pre-NCLB accountability systems used statistical adjustments to control for factors outside schools’ influence, these are rare. On fairness, NCLB’s AYP gets an F. Numerous studies show that NCLB accountability targets larger, more socioeconomically and ethnically diverse schools and those serving students from lower-achieving subgroups. Unless an accountability policy makes specific provisions to account for nonschool factors, the system will be somewhat unfair.
Transparency: Grade = C. We define transparency as the extent to which the process for creating accountability measures is clearly documented and the extent to which the measures themselves are clearly understandable. For transparent measures: 1) If indexes or weighted averages of multiple performance measures are used, then the weights should be clearly stated; 2) if schools are classified based on student assessments, then information about the error rates and quality of the assessments should be public; and 3) yearly reports to stakeholders should promote the valid interpretation of students’ assessment results and school classifications.
NCLB’s accountability measures are fairly transparent; proficiency rates are more straightforward than many other achievement measures. However, the lack of a common meaning for proficiency across states reduces transparency. Furthermore, there are numerous alternative methods to making AYP other than meeting the proficiency targets; these are not transparent, but they account for an increasingly large proportion of schools. Therefore, the transparency of NCLB’s AYP earns a C.
NCLB provisions remain in effect because the ESEA was not reauthorized in 2007 as scheduled. While waiting for reauthorization, Secretary of Education Arne Duncan offered states the opportunity to request flexibility (waivers) from certain NCLB mandates in exchange for pursuing comprehensive plans to reduce achievement gaps, improve instruction, and advance outcomes.
USDOE has identified four waiver principles. The principle relevant to school accountability is “differentiated recognition, accountability and support.” This section requires waiver states to describe plans for annual measurable objectives (AMOs), tested subjects and grades, and the use of subgroups. While AMOs must indicate a path toward improved proficiency, they are not required to be used for accountability.
The only required school accountability in the waivers is for reward, focus, and priority schools. The highest--performing or highest-progress schools are reward schools; these must not have large achievement gaps. Focus schools are Title I schools that contribute to a state’s achievement gap. Waiver states must identify 10 percent of Title I schools with the largest within-school gaps in test scores or graduation rates, as well as Title I high schools with graduation rates below 60 percent. Priority schools are the state’s lowest-performing schools; they must constitute at least 5 percent of the state’s Title I schools. A priority school may be identified based on the achievement or graduation rate of all students, or if it is implementing a School Improvement Grant (SIG) intervention model. In this Outlook, we examine the identification rules for priority and focus schools.
Dimensions of Approved Waivers
As of October 2013, 45 states and the District of Columbia have submitted ESEA flexibility requests. Of those requests, 43 have been approved and 3—Illinois, Iowa, and Wyoming—are under review. Five states—California, Montana, Nebraska, North Dakota, and Vermont—did not submit a waiver, were rejected, or withdrew their applications.
To evaluate the new accountability systems, we analyzed each approved request through multiple rounds of coding and analysis. What follows is a summary of our results pertaining to the four measurement dimensions. We exclude Washington state because its application did not describe the index to be used to identify priority or focus schools. Thus, we analyzed 42 applications. Importantly, most states identify priority and focus schools in multiple ways. For instance, half of the priority schools might be based on a composite index, while the other half might be based on graduation rates. Therefore, the numbers that follow often add up to more than 42.
Construct Validity: Grade = C. It is unquestionable that the waivers are, in aggregate, superior to NCLB’s AYP in construct validity. However, there is a wide range of quality in the waivers, and some state systems may not actually be any better. Thus, our overall grade for this measure is a C.
There are two main reasons why the identification of priority and focus schools in the waiver plans is, in aggregate, superior in construct validity to the way schools are identified under NCLB. First, many of the approved systems use nontest-based measures to identify schools, as shown in table 2. For priority schools, 26 states identify high schools using graduation rates (as is true under NCLB). And 23 states use a composite performance index to identify priority schools (for example, an A–F grade based on a combination of measures). Of these 23 indexes, 19 include graduation rates and 15 include other measures not based on state tests. The most common measure not based on state tests is a college- or career-ready indicator (12 of the 15), but states also include attendance, test participation rates, educator effectiveness, school climate, and opportunity-to-learn measures. Only Arkansas, New Hampshire, Pennsylvania, Wisconsin, and West Virginia use test scores alone in identifying priority schools. Most states also use at least one nontest-based measure for identifying focus schools: graduation rates in 20 states and a composite index including graduation rates or other nontest measures in 12 states.
Second, the test-based measures used are better than proficiency rates at identifying low-performing schools. For example, 20 states use a composite index that includes growth (the weight on the growth measure ranges from 14 percent for Kentucky’s high schools to 75 percent for Idaho’s elementary schools). Some other states count students as proficient if they have passed their growth target. The vast majority of growth measures use some variant of student growth percentiles.
These measures are closer to identifying schools’ unique contributions to student learning than NCLB’s proficiency rate measure. Even in states where achievement levels remain an important part of the identification system, 12 are moving from proficiency rates to allocating points along the achievement distribution. These systems are an improvement compared to NCLB because they incentivize schools to focus on all students. Finally, for focus classifications, 21 states use either a proficiency or graduation gap measure, directly targeting the identification and reduction of achievement gaps.
While these are promising signs, there are two main shortcomings. For one, because the waiver guidelines did not require states to include science assessment accountability, most states (28) are still using only math and ELA tests to identify schools. Of the 14 using other subjects for accountability, all 14 use science, 5 use history or social studies, and 1 uses other subjects. Still, many of these states are testing these subjects in only a few grades, and most give the preponderance of the weight to math and ELA. Thus, there will still be strong incentives to focus on math and ELA at the expense of nontested or low-stakes subjects.
Second, while some states include creative nontest measures in their indexes, these are mainly for high schools and rarely account for more than 30 percent of the total score. Again, this encourages educators to focus on test scores. Furthermore, many systems continue to emphasize proficiency rates, despite their poor quality as an indicator of school performance. Finally, some states such as North Carolina use composite indexes that are merely aggregate proficiency rates.
Overall, we conclude that the construct validity of most waiver states’ identification methods is better than under AYP, but shortcomings do exist. The strongest systems in construct validity include Massachusetts and Michigan, which use subjects other than math and ELA for accountability purposes, include nontest-based measures, and measure proficiency using points along the distribution. Arkansas and West Virginia have the weakest construct validity systems, as both use only math and ELA for accountability and rely largely on proficiency rates. These systems are not better than AYP.
Reliability: Grade = C-. Evaluating the reliability of priority and focus classifications is more difficult because the systems have not been implemented. On average, however, we expect the reliability of classifications to be somewhat lower in the waivers than under NCLB, earning the waivers a grade of C-.
On the one hand, almost all of the waiver states require their focus and priority schools to be labeled as such for two to three years. This ensures less year-to-year fluctuation in these classifications than would otherwise be the case. On the other hand, there are several reasons to think the waiver classifications will be less reliable than AYP. The first is that the priority and focus classifications are based on a fixed percentage of schools. Research shows that this kind of norm-referenced approach results in decreased reliability for schools near the cut score. There are likely not meaningful differences in performance between a fourth- and a sixth-percentile school; yet, under the waivers, one would fail and the other would not.
A second reason for decreased reliability is the use of growth models in composite indexes. With what is known about teacher-level measures of student growth, the year-to-year stability of school-level student growth measures is likely moderate at best. While states could use multiple years of data for their accountability measures, only 12 states chose to do so, and some of these states used multiple years of data for their status measures only (which marginally improves reliability). Even with multiple years of data, composite indexes that incorporate growth measures will be less stable than AYP.
In short, the use of growth data comes with an important tradeoff: it enhances construct validity but decreases reliability. The most reliable systems are those that depend on status measures of performance. Among states with more construct-valid measures, those using multiple years of data will have greater reliability.
Fairness: Grade = D+. The fairness of the waiver plans will likely be an improvement over the current AYP system; however, they will still suffer from similar biases against schools serving more students from historically low-performing subgroups due to heavy reliance on status-based measures of achievement. Thus, we give the waivers a D+. (Table 3 contains a set of fairness indicators.)
A primary problem with fairness is states’ continued reliance on proficiency rates for school identification. Even in states using composite indexes to identify priority schools, 1) in 16 states, proficiency rates or other status measures represent 50 percent or more of the index for at least some grades (and are included in the index in all states) and 2) all but 3 states identify low-proficiency SIG schools as priority schools. Proficiency rates also factor into the identification of focus schools: 25 states identify focus schools using either subgroup proficiency or a subgroup index based on proficiency rates and another 12 use their composite indexes, which have heavy status components.
Some states are also employing one of two types of proficiency gap measures for identifying focus schools. Within-school proficiency gaps measure the largest gaps between two subgroups in a school. These measures are likely unfair to diverse schools, but not as unfair as status measures. The second type of measure compares the performance of a subgroup in a school to a state average or target. Though these are called “gap” measures, they are actually subgroup status measures.
While diverse schools will be more likely to fail under almost any accountability system other than those that explicitly control for student demographics, there are some ways in which the approved waivers will decrease the diversity penalty. One way is in states using “super subgroups.” Super subgroups generally take two forms. One is a combination of subgroups based on demographics (17 states). For instance, Mississippi’s super subgroup includes all students in any traditionally low-performing NCLB subgroup. The other form is a subgroup of the lowest-performing students in a school (9 states). For instance, Michigan’s composite index includes a gap measure that compares the achievement of the top 30 percent to the bottom 30 percent in each school. Both approaches will reduce the diversity penalty, though the latter type will reduce it more since it is agnostic to demographics. While super subgroups may enhance fairness, some civil rights groups have expressed concern that the use of super subgroups will result in students from disadvantaged subgroups not receiving the attention and support they need.
Composite indexes incorporating growth models will also be fairer than those based more heavily on status measures, because the correlations of growth measures to student characteristics are smaller. However, these correlations will not be zero, and some have expressed concerns about bias. It is possible to construct growth models that explicitly control for student demographics, but USDOE waiver guidelines prohibited this. Thus, while most states use student growth in their composite indexes, none of these models control for student demographics; these models will likely be biased by student characteristics unrelated to school practices and policies.
Transparency: Grade = C. On the surface, the new grading systems in place in most states seem even more transparent than NCLB’s AYP system. Many states use either an A–F or point-based index system that condenses multiple measures into an aggregate. These indexes, because of their familiar form, should be fairly interpretable by educators and the public.
Yet there are several problems that limit the transparency of these indexes. Perhaps the most glaring is that many states have composite indexes but do not use them to identify priority or focus schools, as demonstrated in table 4. Among the 38 states that have a composite index, 12 do not use it to identify priority schools and 18 do not use it to identify focus schools. Another 3 states use either a modification of their indexes or only some portion of their indexes to identify priority schools, and 10 use a modification for focus schools. For instance, Minnesota recalculates its regular index methodology, but applied only to the historically low-performing subgroups to identify focus schools. In these cases, the schools with the lowest composite index will not necessarily be the ones identified as priority or focus. In some states, there is a third method of measuring performance for the AMOs. Only 16 states determine either priority or focus schools using the same measure as for their AMOs.
Another way in which the transparency of the measures is unclear is that many state indexes apply seemingly arbitrary weights to unrelated measures to arrive at a composite score. For instance, South Dakota’s School Performance Index is a 100-point scale. For elementary schools, 25 percent of the grade is based on proficiency rates, 25 percent on growth, 20 percent on attendance, 20 percent on educator effectiveness, and 10 percent on school climate. The points are allocated in proportion to the raw values. For instance, a school with 70 percent of educators rated “proficient” would receive 14 points for that component. Thus, while the 100-point index is conceptually transparent, it is not clear whether a school that scores 80, for example, is effective. Furthermore, it is not immediately apparent from the index what a school might do to improve its score.
A final challenge in some indexes is in the calculation of the subcomponents using contingency tables to transform continuous variables for the purposes of inclusion in the index. These approaches are conceptually unclear, and states rarely offer rationales for their use. These approaches also appear to suffer from the potential
“bubble-kid” problems associated with NCLB’s AYP system. Given all of these complications, we grade the average state waiver system a C on transparency.
Overall, the waivers provide a mixed bag of improvements over, and duplications of, NCLB’s problems. In many of the waivers, states have strengthened the construct validity of their accountability systems by using nontest measures and measures of student growth. These changes should capture more of the multidimensional nature of schooling, increasing the alignment between incentives and desired outcomes. State systems are also somewhat fairer than NCLB’s AYP, which may reduce some of the negative unintended consequences of accountability.
In most states, however, many of NCLB’s problems are duplicated. Indeed, the reliability of the waiver classifications may well be worse than NCLB’s AYP, and the transparency is probably a wash. While states were allowed to test additional subjects for accountability, only 14 of the states chose to do so, and all but perhaps Michigan maintain a focus on math and ELA. This decision may be motivated by concerns about cost or the amount of testing in schools. But given complaints about narrowing the curriculum to math and ELA, and given that states currently spend a fraction of 1 percent of education dollars on testing, it is surprising that many states did not take the opportunity to expand accountability to other content areas. Also, since proficiency rates are highly correlated with student demographics, the reliance on proficiency status to identify schools will continue to make schools serving students from historically disadvantaged groups disproportionately likely to fail, regardless of their effectiveness at improving student outcomes.
The waivers also offered an opportunity for states to incorporate growth measures into their accountability systems. However, the type and use of growth measures across the waivers varies dramatically. Further, USDOE’s decision to prohibit states from controlling for student demographics will create performance measures that are biased by factors unrelated to school policies and practices. Several states implementing growth measures are not using them to identify priority and focus schools. Finally, though states could increase the reliability of their growth measures by using multiple years of data, few made this choice.
It is likely that any accountability system will fall short on one or more dimensions. The waiver systems perfectly illustrate these tensions. While there were some positive changes over NCLB’s AYP, the net change is not as great as it could have been. Undoubtedly, political considerations shaped states’ choices. Still, several politically feasible changes would improve state systems and could mitigate the unintended consequences of the current waiver plans.
The first and most important policy recommendation is to incorporate the lessons learned from NCLB into the implementation of the waiver applications. For example, construct validity would be improved if states were to move away from using unadjusted proficiency rates to identify schools and added additional tested subjects to accountability. Since all states are required to test science, they should include science testing results in priority and focus determinations.
To further improve construct validity and fairness, USDOE ought to allow states to control for student demographics in school performance measures. Political pressures in states may push against different targets for different groups, but schools have had different targets for different groups for a decade through NCLB’s safe harbor provision, and many states opted to set USDOE-endorsed, subgroup-specific AMOs under the waivers. Thus, most states are already setting different targets for different groups. By excluding student demographics from performance measures, the system expects the same performance from all schools regardless of their student inputs, penalizing schools for factors they cannot control. This unfairness may contribute to unintended consequences such as teachers preferring to work in schools serving more affluent children.
Another way to improve construct validity is to move away from within-state achievement gaps and toward within-school or within-district gaps. This would in turn shift the focus away from low-performing subgroups and toward reducing the gap within a school or district, sending a clear message that all students within a school deserve attention and effort.
To improve the reliability of performance classifications, states should use multiple years of data for school performance measures, especially measures incorporating student growth. Reliability would also improve if states moved away from the arbitrary norm-referenced approach to identifying priority and focus schools encouraged by USDOE guidelines. Although setting the bar at the bottom 5 or 10 percent creates a more manageable sample size and reduces total costs, it also adds noise to the system. By design, these cutoffs send the message that the 10th percentile is failing but the 11th percentile is not, even though these schools may not meaningfully differ. Rather, schools may benefit from a clear operational definition of a low-performing school that is based on a set of performance criteria. Furthermore, research suggests that consequential accountability systems may be more likely than nonconsequential systems to raise student achievement. Based on the existing research, civil rights groups have expressed concern that if only a small share of schools is threatened by the priority and focus labels, then the waiver program may provide weaker incentives than NCLB.
To improve both transparency and construct validity of classification systems, states should carefully evaluate their composite measures. While A–F systems are on the surface transparent, the underlying design of these systems involves a great deal of arbitrariness that makes it difficult for educators and parents to understand performance. Keeping indicators separate may allow for a better understanding of the strengths and weaknesses of schools that can be used to tailor interventions.
Finally, states should conduct short-term analyses of the implementation of their waiver systems and make adjustments. NCLB’s issues were evident shortly after implementation, yet little was done to mitigate them.
None of these recommendations, on their own, will solve the challenges of school accountability. However, if policymakers follow them, accountability systems will have more validity, reliability, fairness, and transparency, thus reducing the unintended consequences of standards--based accountability in the future.
Morgan S. Polikoff (firstname.lastname@example.org) is an assistant professor at the University of Southern California, Andrew McEachin is an assistant professor at North Carolina State University (email@example.com), and Stephani L. Wrabel (firstname.lastname@example.org) and Matthew Duque (email@example.com) are PhD candidates at the University of Southern California.
This Outlook is based on a paper the authors published in Educational Researcher. See Morgan S. Polikoff et al., “The Waive of the Future? School Accountability in the Waiver Era,” Educational Researcher 43, no. 1 (January/February 2014): 45–54.
1. Robert Balfanz et al., “Are NCLB’s Measures, Incentives, and Improvement Strategies the Right Ones for the Nation’s Low-Performing High Schools?” American Educational Research Journal 44, no. 3 (2007): 559–93; Elizabeth Davidson et al., “Fifty Ways to Leave a Child Behind: Idiosyncrasies and Discrepancies in States’ Implementation of NCLB” (working paper no. 18988, National Bureau of Economic Research, April 2013), www.nber.org/papers/w18988.pdf?new_window=1; Andrew D. Ho, “The Problem with ‘Proficiency’: Limitations of Statistics and Policy Under No Child Left Behind,” Educational Researcher 37, no. 6 (2008): 351–60; Robert L. Linn and Carolyn Haug, “Stability of School-Building Accountability Scores and Gains,” Educational Evaluation and Policy Analysis 24, no. 1 (2002): 29–36; Morgan S. Polikoff and Stephani L. Wrabel, “When is 100% Not 100%? The Use of Safe Harbor to Make Adequate Yearly Progress,” Education Finance and Policy 8, no. 2 (2013): 251–70; and Andrew C. Porter et al., “The Effects of State Decisions about NCLB Adequate Yearly Progress Targets,” Educational Measurement: Issues and Practice 24, no. 4 (2005): 32–39.
2. David N. Figlio and Helen F. Ladd, “School Accountability and Student Achievement” in Handbook of Research in Education Finance and Policy, ed. Helen F. Ladd and Edward Fiske (New York, NY: Routledge, 2007); and Marshall S. Smithand Jennifer A. O’Day, “Systemic School Reform” in The Politics of Curriculum and Testing, ed. Susan H. Fuhrman and Betty Malen (New York: Falmer Press, 1990).
3. Martin Carnoy and Susanna Loeb, “Does External Accountability Affect Student Outcomes? A Cross-State Analysis,” Educational Evaluation and Policy Analysis 24, no. 4 (2002): 305–31; Thomas Dee, Brian Jacob, and Nathanial L. Schwartz, “The Effects of NCLB on School Resources and Practices,” Educational Evaluation and Policy Analysis 35, no. 2 (2013): 252–79; Eric A. Hanushek and Margaret E. Raymond, “Does School Accountability Lead to Improved Student Performance?” Journal of Policy Analysis and Management 24, no. 2 (2005): 297–327; and Cecilia Elena Rouse et al., “Feeling the Florida Heat? How Low-Performing Schools Respond to Voucher and Accountability Pressure,” American Economic Journal: Economic Policy 5, no. 2 (2013): 251–81.
4. Balfanz et al., “NCLB’s Measures, Incentives, and Improvement Strategies;” Jennifer Booher-Jennings, “Below the Bubble: ‘Educational Triage’ and the Texas Accountability System,” American Educational Research Journal 42, no. 2 (2005): 231–68; Derek Neal and Diana Whitmore Schanzenbach, “Left Behind by Design: Proficiency Counts and Test-Based Accountability,” Review of Economics and Statistics 92, no. 2 (2010): 263–83.
5. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, Standards for Educational and Psychological Testing (Washington, DC: 1999); Eva L. Baker and Robert L. Linn, “Validity Issues for Accountability Systems,” in Redesigning Accountability Systems for Education, ed. Susan H. Fuhrman and Richard F. Elmore (New York: Teachers College Press, 2004); Thomas J. Kane and Douglas O. Staiger, “The Promise and Pitfalls of Using Imprecise School Accountability Measures,” Journal of Economic Perspectives 16, no. 4 (2002): 91–114; Robert L. Linn, “Assessments and Accountability,” Educational Researcher 29, no. 2 (2000): 4–16; and Robert L. Linn, “Accountability Models,” in Redesigning Accountability Systems for Education, ed. Susan H. Fuhrman and Richard F. Elmore (New York: Teachers College Press, 2004).
6. Ronald H. Heck, “Assessing School Achievement Progress: Comparing Alternative Approaches,” Educational Administration Quarterly 42, no. 5 (2006): 667–99; John M. Krieg and Paul Storer, “How Much Do Students Matter? Applying the Oaxaca Decomposition to Explain Determinants of Adequate Yearly Progress,” Contemporary Economic Policy 24, no. 4 (2006): 563–81; and Michael J. Weiss and Henry May, “A Policy Analysis of the Federal Growth Model Pilot Program’s Measures of School Performance: The Florida Case,” Education Finance and Policy 7, no. 1 (2012): 44–73.
7. Linn, “Accountability Models;” Kane and Staiger, “The Promise and Pitfalls of Using Imprecise School Accountability Measures.”
8. Weiss and May, “A Policy Analysis ;” and Polikoff and Wrabel, “When is 100% Not 100%?”
9. Linda Crocker and James Aligna, Introduction to Classical and Modern Test Theory (Independence: Wadsworth Publishing Company, 2006).
10. Kane and Staiger, “The Promise and Pitfalls of Using Imprecise School Accountability Measures.”
11. Andrew McEachin and Morgan S. Polikoff, “We Are the 5%: Which Schools Would Be Held Accountable under a Proposed Revision of the Elementary and Secondary Education Act?” Educational Researcher 41, no. 7 (2012): 243–51.
12. Gadi Barlevy and Derek Neal, “Pay for Percentile,” American Economic Review 102, no. 5 (2011): 1805–831.
13. Charles Clotfelter and Helen F. Ladd, “Recognizing and Rewarding Success in Public Schools,” in Holding Schools Accountable: Performance-Based Reform in Education, ed. Helen F. Ladd (Washington, DC: Brookings Institution, 1996).
14. Balfanz et al., “NCLB’s Measures, Incentives, and Improvement Strategies;” Krieg and Storer, “How Much Do Students Matter?” David P. Sims, “Can Failure Succeed? Using Racial Subgroup Rules to Analyze the Effect of School Accountability Failure on Student Performance,” Economics of Education Review 32 (2013): 262–74; and Wayne Riddle and Nancy Kober, State Policy Differences Greatly Impact AYP Numbers (Washington, DC: Center on Education Policy, 2011), 1–22.
15. Balfanz et al., “NCLB’s Measures, Incentives, and Improvement Strategies;” and Mark Elhert et al., “Selecting Growth Measures for School and Teacher Evaluations” (working paper, Department of Economics, University of Missouri–Columbia, 2013), http://economics.missouri.edu/working-papers/2012/WP1210_koedel.pdf.
16. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, Standards for Educational and Psychological Testing.
17. Baker and Linn, “Validity Issues for Accountability Systems.”
18. US Department of Education, National Center for Education Statistics, Mapping 2005 State Proficiency Standards onto the NAEP Scales: Research and Development Report (Washington, DC, 2007).
19. Polikoff and Wrabel, “When Is 100% Not 100%?”
20. Student growth percentile is a percentile rank of students’ achievement growth conditional on at least one year of prior achievement. To use student growth potential (SGP) in accountability, states generally take the median SGP of a school’s or teacher’s students. See Damian W. Betebenner, A Technical Overview of the Student Growth Percentile Methodology: Student Growth Percentiles and Percentile Growth Trajectories/Projections (Dover, NH: National Center for the Improvement of Educational Assessment, 2011), www.nj.gov/education/njsmart/performance/SGP_Technical_Overview.pdf.
21. Dee, Rockoff, and Schwartz, “The Effects of NCLB on School Resources and Practices.”
22. Kane and Staiger, “The Promise and Pitfalls of Using Imprecise School Accountability Measures.”
23. Dan Goldhaber and Michael Hansen, “Is It Just a Bad Class? Assessing the Long-Term Stability of Estimated Teacher Performance,” Economica 80, no. 318 (2013): 589–612; McCaffrey et al., “The Intertemporal Variability of Teacher Effect Estimates,” Education Finance and Policy 4, no. 4 (2009): 572–606.
24. McEachin and Polikoff, “We Are the 5%.”
25. For example, see Rufina A. Hernández, Maintaining a Focus on Subgroups in an Era of Elementary and Secondary Education Act Waivers (Washington, DC: Campaign for High School Equity, 2013), www.highschoolequity.org/images/WaiversReport_R8.pdf.
26. Elhert et al., “Selecting Growth Measures for School and Teacher Evaluations.”
27. Booher-Jennings, “Below the Bubble: ‘Educational Triage’ and the Texas Accountability System.”
28. Matthew M. Chingos, Strength in Numbers: State Spending on K–12 Assessment Systems (Washington, DC: Brown Center on Education Policy, Brookings Institution, 2012), www.brookings.edu/~/media/research/files/reports/2012/1/29%20cost%20of%20assessment%20chingos/11_assessment_chingos_final_new.pdf.
29. Polikoff and Wrabel, “When Is 100% Not 100%?”
30. Neal and Schanzenbach, “Left Behind by Design: Proficiency Counts and Test-Based Accountability.”
31. Hanushek and Raymond, “Does School Accountability Lead to Improved Student Performance?”
32. For example, see Hernández, Maintaining a Focus on Subgroups in an Era of Elementary and Secondary Education Act Waivers.