
Since Lipton, Martinson, and Wilks surveyed the effects of rehabilitation programs on recidivism, program evaluation has become much more sophisticated. A leading scholar in the field recently presented textbooks in education as an example:
the question that is sometimes left unanswered is, “Do the textbooks make a difference in children’s learning?” … What we want to do here is compare what actually happens with the textbooks to what would have happened without textbooks.^
In considering this type of question with respect to rehabilitating prisoners, Lipton, Martinson and Wilks limited their review of rehabilitation programs to findings of evaluation research. Martinson described evaluation research as:
a special kind of research which was applied to criminal justice on a wide scale for the first time in California during the period immediately following World War II. This research is experimental – that is, offenders are often randomly allocated to treatment and nontreatment groups so that comparison can be made of outcomes.^
This experimental structure remains at the foundations of scientific program evaluation. In terms of leading scholarly evaluation techniques for the hypothetical textbook program:
We argue that we need to follow the example of medicine and set up randomized experiments. Since resources are generally limited at the beginning of a program, it makes sense to select twice the number of people, or schools, and introduce the program to half the sample, randomly selected. In this way, we can be sure that those who benefited from the program are no different from those who did not. If we collect data on both groups and find a difference between those who were exposed and those who weren’t, we can conclude it’s the effect of the program. Everybody can then use this evidence to decide whether to take this program up in other contexts – the knowledge becomes a shared resource.^
Such an evaluation procedure provides a propitious structure for the competitive development of scientific treatment expertise. It has generated highly successful research programs. Unlike the received understanding of Martinson’s evalutation (“nothing works”), more recent program evaluation tends to produce specific, nuanced results that aren’t easily summarized in a slogan. Leading research programs do, however, show clear concern for poverty, inequality, and oppression.
Meaningfully interpreting and using program evaluations in new contexts is difficult. The Nobel Prize in Economic Sciences in 2000 was awarded for work that, among other contributions, discovered “evidence on the pervasiveness of heterogeneity and diversity in economic life.” This work emphasized carefully separating two questions:
(1) “What is the effect of a program in place on participants and nonparticipants compared to no program at all or some alternative program?”
This is what is now called the “treatment effect” problem. The second and the more ambitious question raised is
(2) “What is the likely effect of a new program or an old program applied to a new environment?”
The second question raises the same type of problems as arise from estimating the demand for a new good. Its answer usually requires structural estimation.^
The problem in practice is to distinguish between a “new program” and an “old program,” and the “same environment” and a “new environment.” The pervasiveness of heterogeneity and diversity in economic life underscores exactly this problem. Does new staff make for a new program? Does the passage of time produce a new environment? Such questions are crucial for evaluating program evaluations. An economist rationally answers such questions by calling for more economic research.
Specific actions that can be easily measured daily are more amenable to program evaluation than are broad purposes realized over years. For example, a website owner might seek a greater number of visitors and a higher click-through rate on ads. Content and ad experiments, along with measurements of resulting traffic and ad-click-through rates, can be cheaply and quickly realized. Some such local optimization steps, e.g. duping and exploiting visitors, can generate bad long-term results. Nonetheless, at least the instrumental, short-term effects of the experiments can be meaningfully measured.
Treatment instruments are much more difficult to evaluate with respect to broad purposes realized over years. Reducing recidivism and improving education are purposes with a time horizon of years. Recidivism and education outcomes might might be measured over years using incarceration and earnings records. Doing so would require highly sophisticated controls for changes in circumstances over years. Even if that could be done convincingly and generalizably, the purposes of reducing recidivism and improving education are not merely to keep persons out of jail and earning income. Free, knowledgeable persons are at the core of ideals of persons well-governed personally and collectively. Programs of punishment and education cannot be adequately evaluated using just feasible measurements of their instrumental effects.