As the Obama administration’s evidence agenda rolls forward it is beginning to stir countervailing views on the appropriate use of evidence. What counts as evidence in the field of social policy? How should it be used? How will the challenges of replicating evidence-based programs be overcome?
This article asks those questions of three different experts in the field of evidence-based policy, each with a distinct view on the evidence debate: Jon Baron from the Coalition for Evidence-Based Policy, Lisbeth Schorr of the Center for the Study of Social Policy, and David Muhlhausen of the Heritage Foundation.
Rise of the Randomized Controlled Trial (RCT)
If you were looking for someone outside government whose thinking on evidence seems to have gained significant traction with executive branch officials in both the Bush and Obama administrations, it would be difficult to find a better organization than the Coalition for Evidence-Based Policy, headed by Jon Baron.
The nonpartisan Coalition’s views are so well respected by the current administration that the group’s work was prominently cited in a 2012 White House memo to federal agencies on the use of evidence in their budget requests. The reference was to the use of administrative data for low-cost, rigorous evaluations, something that received a leg up early this year when OMB released a new memo endorsing its use.
Baron’s organization is known for advocating the use of randomized controlled trials (RCTs), where feasible, to evaluate the effectiveness of social policies. In RCTs, program participants are randomly assigned to one of two groups: a treatment group that receives the tested program services, or another group, called the control group, that receives standard services. Randomly assigning participants to one of the two groups ensures that they are comparable in both observable characteristics, such as age, income level, and educational background, and unobservable characteristics, such as motivation and family support. At the end of the study, comparisons of outcomes between the groups are used to determine whether the tested program has produced results.
RCTs are sometimes called the “gold standard” for evaluating program effectiveness — although, as we shall see, there is some disagreement about this. Baron believes that large, well-conducted RCTs are needed because the results of other, more preliminary studies — including small, short-term RCTs and many non-randomized forms of evaluation — are too often not confirmed when they are subjected to more definitive RCT evaluations.
For example, in medicine early evaluations of new drugs that suggest they are effective are typically overturned by subsequent, more rigorous RCTs 50-80 percent of the time. Baron sees little reason to believe that evaluations of social programs are much different. According to a brief released by the Coalition earlier this month:
[I]n education policy, programs such as the Cognitive Tutor, Project CRISS, and LETRS teacher professional development – whose initial research findings were promising (e.g., met IES’s What Works Clearinghouse standards) – have unfortunately not been able to reproduce those findings in large replication RCTs sponsored by IES. In employment and training policy, positive initial findings for the Quantum Opportunity Program, and Center for Employment Training – programs once widely viewed as evidence based – have not been reproduced in replication RCTs sponsored by the Department of Labor. A similar pattern occurs across other diverse areas of policy and science where rigorous RCTs are carried out.
Baron concedes that RCTs are not possible in every situation. For example, it would not be legally allowable or desirable to assess veterans’ programs by withholding benefits from veterans who were otherwise entitled to them. Similarly, laws cannot generally be applied randomly to some people and not to others.
In such cases, Baron says that other types of studies can be conducted that are nearly as good. For example, potential program enrollees who are disqualified because their incomes are just above a defined income threshold can be compared to other enrollees whose incomes fall just below it. Such quasi-experimental designs that mimic random assignment can come close to RCTs.
But overall, he says, the most widely used evaluation methods are not as rigorous and studies show that they can produce erroneous conclusions. There are a host of reasons. Some rely on methods that examine changes in outcomes over time, after a service has been provided. While such changes can be indicative of a program’s impact, they are not considered solid evidence because there is no comparison group.
Other studies may include comparison groups, but the number of participants may be too small to be statistically reliable. In others, the methods used to create the two groups (for example, comparing participants who volunteer for a program to others who do not) may cause the two groups to differ in important (and sometimes unobservable) ways, skewing the results. Results may also be localized and only applicable to a particular geographic area or demographic group.
“Such analyses can be valuable for identifying promising programs worthy of further study,” he said. “But they are not enough on their own to prove significant impact. That’s why corroboration in a more definitive study is needed.”
Baron is not suggesting that RCTs are infallible, however. All studies, including RCTs, can be vulnerable to poor measurement and poor statistical analysis, among other potential problems. That’s why Baron insists that RCTs need to be “well conducted.”
To help researchers avoid such errors, the Coalition has developed a check list. The various federally-supported research clearinghouses, such as the What Works Clearinghouse at the U.S. Department of Education, have similar check lists for evaluating existing research.
Baron’s views appear to have found substantial support in the Obama administration. It has already created several programs, such as the Social Innovation Fund and Investing in Innovation (i3) program, that allocate funding on the basis of evidence, with RCTs commonly judged to be the highest form. There are also several federal evaluation clearinghouses in addition to the one at the Department of Education, including Crime Solutions at the Department of Justice and the Clearinghouse of Labor Evaluation and Research (CLEAR) at the Department of Labor. Moreover, every year the White House has directed federal agencies to justify both their existing programs and new proposals with evidence as part of the annual budget process.
“The Obama administration is doing a whole bunch of things that, taken together, make it by far the most evidence-based administration ever,” Ron Haskins, a policy director with the Brookings Institution and former Republican congressional staffer, told The New York Times earlier this year. Haskins has a new book coming out in December (Show Me the Evidence: Obama’s Fight for Rigor and Results in Social Policy) that reviews the administration’s efforts in more detail.
In another article, Haskins wrote that while i3 and the other evidence-based initiatives have recognized other forms of evidence, “senior officials at OMB and in the agencies believed that the most reliable form of evidence on program effectiveness is provided by randomized control trials (RCTs).”
Baron agrees, but adds that the administration is simply recognizing a principle that is well-established in the scientific community. “Large, well-conducted RCTs are widely regarded as the strongest, most credible method of evaluating effectiveness by the Institute of Education Sciences, National Science Foundation, Congressional Budget Office, Food and Drug Administration, and other respected scientific bodies,” he said.
Support for RCTs is not limited to the administration, however. Rep. Paul Ryan (R-WI), the former vice-presidential candidate and possible future presidential contender in 2016, released an anti-poverty plan in July that stressed the need for new evidence for social programs. Seeing this, staff from the White House reached out to his office to explore possible areas of agreement. Congressional Republicans have also been supportive of social impact bonds, which proponents say will further increase the use of evidence-based programs.
Do We Need A Broader Evidence Base?
The administration’s and related congressional efforts to elevate RCTs above other methods of evaluating program effectiveness are not without their critics. One is Peter York, CEO of Algorhythm, an evaluation firm that works with nonprofits and public agencies. He questioned RCTs in a recent blog post on the Markets for Good web site, calling them expensive, difficult to replicate, and potentially unethical without being substantially more reliable than non-experimental observational studies.
Another prominent critic is Lisbeth Schorr, a former Harvard academic who is affiliated with the Center for the Study of Social Policy. In 2011, she co-authored a report with her colleague Frank Farrow, titled Expanding the Evidence Universe, which criticized what they see as an excessive reliance upon RCTs as the highest form of evidence. Their organization is hosting an event devoted to the topic next month.
Schorr agrees that RCTs can be valuable in some cases, but they face problems of their own. One is that RCTs tend to be relatively simple and binary in their conclusions — does a program work, yes or no — with little explanation of how or why.
“RCTs are not useful in understanding interventions that are complex and interactive,” she said, “or that involve collaboration across professional and organizational boundaries, or that are not targeted only on individuals, but may aim at changing neighborhoods, norms, and systems.”
She says RCTs are also inadequate for judging initiatives that are tailored to a particular time, place, or population. This can create significant problems when efforts are made to scale-up and replicate successful programs in other settings, where strict adherence to the original program model may not be appropriate.
This has contributed to the highly uneven track record of replication efforts, she said, even when substantial resources are invested in training and strenuous efforts are made to maintain fidelity to the original model.
Maintaining such fidelity is far from easy. According to Schorr:
Spreading an effective program is a resource-intensive job. At a minimum … it requires implementers to: identify and document the essential and adaptive program elements; create implementation and training guides that encourage consistency across sites; establish dedicated staff positions to support the replication effort; develop a network of strategic partnerships that spans from the local to the national level; establish a universal data collection system to monitor results; and establish standardized training and ongoing technical assistance.
Moreover, she said, the expense and time required to conduct RCTs means that there are relatively few of them and the ones that exist are often dated.
“The [Nurse-Family Partnership] is widely considered the most rigorously proven early childhood intervention,” she writes. “But to reap the benefits of being a “proven” program, this model, designed three decades ago, might be considered frozen in time since it has not been adapted to take account of the explosion of knowledge in the last two decades.”
But she sees an alternative:
One way to expand the evidence base about “what works” is by identifying the commonalities among successful interventions aimed at similar goals. We would not have to rely on copying programs that may or may not work when brought to bear in different environments and with different populations, because good practices and programs can then be scaled up ever more strategically.
She cited two examples, one in juvenile justice and another for afterschool programs, that used this approach. She sees such methods as more adaptive and flexible than rigid programs implemented with strict fidelity, particularly if they are coupled with performance management practices that track outcomes in real time.
York agrees. Writing about RCTs, he argued that “comparative group designs do not lend themselves to the real-time learning and fast program adaptations demanded by the complex and tumultuous environment in which nonprofits operate today.”
The Conservative Case for More Flexibility
Another set of criticisms comes from the conservative end of the political spectrum. David Muhlhausen, a research fellow with the Heritage Foundation, testified at a House committee hearing on evidence-based policy last year.
He believes that more federal programs need large-scale, multi-site experimental impact evaluations. To date, only twenty-one have been conducted, one of which evaluated Head Start. He described that program in an article published earlier this year:
Head Start, a federal program that funds preschool initiatives for the poor, was based on a modest number of small-scale, randomized experiments showing positive cognitive outcomes associated with preschool intervention. These limited evaluations helped trigger expenditures of over $200 billion since 1965. Yet the scaled-up national program never underwent a thorough, scientifically rigorous evaluation of its effectiveness until Congress mandated a study in 1998. Even then, the publication of the study’s results (documenting the program’s effects as measured in children in kindergarten, first grade, and third grade) was delayed for four years after data collection was completed. When finally released, the results were disappointing, with almost all of the few, modest benefits associated with Head Start evaporating by kindergarten. It seems the program had been running for decades without achieving all that much.
While Muhlhausen supports the administration’s efforts to expand the use of evaluation, he is more skeptical of the administration’s follow through. “When they find out something doesn’t work, they don’t act on it or reduce funding for the program,” he said. “Head Start should be eliminated.”
The administration has made a production of looking for savings from failed programs, he said, but it has achieved very little in real savings. “It’s mostly for show.”
Like Schorr, Muhlhausen believes that replicating successful models is a serious challenge. But Muhlhausen’s answer is very different. He thinks the administration’s current approach is too top-down and too driven from Washington. Moreover, he said, efforts to faithfully replicate successful models are inherently focused on inputs rather than outcomes. He believes such efforts are ultimately doomed to fail.
Instead, he said, programs need to be block granted to give states and local governments maximum freedom to innovate. Accountability for results should be traced back to state and local elected officials, not bureaucrats in Washington.
Common Ground?
When asked, Baron agreed with some, but not all, of the points made by Schorr and Muhlhausen. He believes RCTs are more flexible than Schorr says. For example, he does not agree that the Nurse-Family Partnership is “frozen in time.” He says it has adapted its model over time through a continuous improvement process involving RCT testing, as described in a recent paper published in the peer-reviewed journal Pediatrics.
Moreover, recent advances in the allowable use of administrative data may make RCTs much easier and cheaper to conduct, with large samples and both short and long-term follow-up. He points to a Coalition brief that describes examples of sizable, well-conducted RCTs costing between $50,000 and $320,000. He also cites a recent paper authored by Scott Cody and Andrew Asher at Mathematica Policy Research that describes the use of something called rapid-cycle evaluation, which can quickly determine the effectiveness of changes to existing programs.
Baron agrees with Muhlhausen that more evaluation is needed and that more flexibility to innovate is desirable, citing the success of federal waivers in TANF as a model. He is more skeptical, however, of Muhlhausen’s views on block granting and completely devolving programs to states and localities.
Nevertheless, Baron is encouraged by the across-the-board belief in the need for more evidence. He sees substantial agreement among the various points of view.