Project Follow Through

The Follow Through evaluation
Project Follow Through was the largest and most expensive federally funded experiment in education ever conducted. The most extensive evaluation of Follow Through data covers the years 1968-1977; however, the program continued to receive funding from the government until 1995 (Egbert, 1981, p. 7). Follow Through was originally intended to be an extension of the government’s Head Start program, which delivered important educational, health, and social services to typically disadvantaged preschool children and their families. The function of Follow Through, therefore, was to provide a continuation of these services to students in their early elementary years.

In President Lyndon Johnson’s 1967 state of the union address, he proposed 120 million dollars for the program (to serve approximately 200,000 children from disadvantaged backgrounds). However, when funding for the project was approved by Congress, a fraction of that amount—merely 15 million dollars—was authorized. This necessitated a change in strategy by the Office of Economic Opportunity (OEO), the government agency charged with oversight of the program (Egbert, 1981, pp. 3-4; Stebbins, et al, 1977, p. 2; Watkins, 1997, p. 4). Instead, program administrators made the “brilliant decision… (to) convert Follow Through from a service program to a research and development program (Evans, 1981, p. 5).

Follow Through planners felt that they were responding to an important challenge in the education of disadvantaged students. It was generally hypothesized that the mere provision of specific supports in the form of federal compensatory programs—such as Head Start and Title I— would result in increased academic achievement for disadvantaged children, if implemented faithfully by committed teachers. However, studies had shown that despite its successes, in general any gains that children made from Head Start (in measures of academic achievement) “faded out” during the first few years of elementary school (Maccoby & Zellner, 1970, p. 4; Stebbins, et al., 1977, p. 1). It was unclear to policy makers and others if the elementary school experience itself caused this phenomenon, or if specific approaches to instruction within schools were the problem. Follow Through intended to solve the problem by literally identifying what whole-school approaches to curriculum and instruction worked, and what did not. Subsequently, effective models were to be promulgated by the government as exemplars of innovative and proven methods of raising the academic achievement of historically disadvantaged students.

The sociopolitical context of Follow Through
Conceived and implemented in the midst of President Johnson’s War on Poverty campaign of the 1960s, Follow Through “came out of larger plan which attempted to lay out the causes and cures of poverty in American society on a systemic basis” (Evans, 1981, p. 2). In addition to the landmark Elementary and Secondary Education Act of 1965, other initiatives included economic policies designed to maintain levels of high employment and federally subsidized job training specifically targeted at people from disadvantaged backgrounds. These programs were implemented amidst the turbulent era of the 1960s and 1970s; marked by the struggles and eventual enfranchisement of a number of formerly excluded constituencies “including African Americans, feminists, Hispanics, Native Americans, and parents of handicapped children” (Tyack & Cuban, 1995, p. 26; Rhine, 1981, p. 293).

Planned variation
In typical scientific experiments, treatment and control groups are selected through random assignment. Because Follow Through was an experiment designed to evaluate the efficacy of a number of interventions in local districts, districts chose the interventions they wanted implemented in their schools. This method of selecting a desired intervention among several candidates is called planned variation  One publication refers to it as “random innovation” (Rhine, 1981, p. 292). Actually, there is nothing random about it. Planned variation was thought to be advantageous over random assignment for two reasons. First, it is incredibly difficult to conduct pure experimental research in education because there is nothing random about the communities in which children live, the nuances of the teachers that taught them, or the schools they attended. Second, Follow Through planners largely believed that the planned variation model would give local communities (e.g. community members, parents, administrators, and teachers) an element of local control over the implementation of their program (Elmore, 1977, pp. 187, 190; Rivlin, et al., 1975). In fact, Hill (1981) believed that programs like Follow Through “can be…permanent sources of advocacy pressure on behalf of the goals and beneficiaries of federal programs” (p. 7).

Goals
Follow Through, like Head Start, was enormous in scope, and designed to remedy the fact that “poor children tend to do poorly in school” (Stebbins, et al., 1977, p. xxxiii). Despite the cut in funding, it nevertheless served a substantial number of students. At its height, the program would grow to encompass twenty different sponsored interventions and approximately 352,000 Follow Through and comparison children in 178 projects nationwide (Egbert, 1981, p. 7; Stebbins, et al., 1977, pp. xix, 19). If one considers that funding for the last Follow Through sites ended in 1995, the project was indeed comprehensive in both depth and breadth.

In addition to identifying the most effective instructional practices and disseminating them to schools and districts, it was also hoped that Follow Through would help reduce the number of potentially conflicting federal intervention programs in schools, which was thought by some to be counterproductive and expensive (Hill, 1981, pp. -12, 20). Moreover, if models could be identified that were effective with needy children, these interventions could be staged in regular education classrooms as well (Hill, 1981, p. 20).

Program Administration
Because Follow Through came into existence because of executive and not legislative action, overall control for the program rested in President Lyndon Johnson’s Office of Economic Opportunity (OEO), which spearheaded Johnson’s War on Poverty policy. A major component of the policy was community involvement. The Community Action Program (CAP) was charged with fulfilling this function through the establishment of local agencies and programs that carried out various federally sponsored initiatives for disadvantaged populations. However, CAP (and, to some extent, OEO) fell into disrepute among legislators and others, because “it resulted in the political mobilization of the poor and the undermining of local government agencies” (Watkins, 1997, pp. 4-5). Follow Through was intended to be an extension of the Head Start community action program. Since Head Start was politically popular, a program associated with Head Start would put the OEO “back in the good graces of Congress” (p. 5). Although Follow Through, like Head Start, was initially intended as a social action program, the decision to transform Follow Through from a social action program to a social experiment was not correspondingly changed in the congressional legislation (Evans, 1981, pp. 4-5). Head Start personnel remained involved in the design and implementation of Follow Through, although it appeared they were working at separate ends from the planning group of the OEO, who viewed Follow Through as an empirical investigation (Egbert, 1981, p. 8). Much of what occurred during the planning stage—which Egbert (1981) describes “as a time of haste and confusion”—was an attempt to satisfy constituencies of both perspectives (p. 9).

Debates about purpose
Due largely to the sociocultural context in which Follow Through was born, planners deliberately structured the program to minimize the involvement of federal officials in the implementation effort (Watkins, 1997, p. 16; Elmore, 1977, p. 191). The more Follow Through could be perceived as a locally controlled effort, the better. OEO hoped “idealistically” that Follow Through could satisfy both empirical and social action purposes (Egbert, 1981, p. 4; Hill, 1981, pp. 8, 10).

It seems doubtful that any form of experiment could realistically and faithfully serve both aims. According to Hill (1981), true program evaluators should be “technical rather than political or programmatic, and their attitudes skeptical and dispassionate” (pp. 8-9). The planning group of OEO wanted a true empirical investigation that would determine the most effective models. Conversely, CAP and Head Start personnel advising the Follow Through planners viewed it as a social program. Thus, “neither set of constituent groups was fully satisfied with this solution” (Egbert, 1981, pp. 4-5).

Sponsors and models
If the planners of Follow Through had conflicting views on the real purpose of the program, the selection of sponsors was equally imprecise. Follow Through sponsors were an eclectic mix of individuals or groups conducting research on instructional methods. Some came from universities, including schools of education. Others were involved in private or grant-based research efforts (Watkins, 1997, p. 16). The selection method was unclear. According to Watkins (1997), “invitations were apparently extended to any group conducting research on instructional methods” (p. 16). Some of the sponsors had fairly well developed interventions based on theories of instruction. Others merely had ideas for what might constitute effective interventions. The sponsors also differed widely on the outcomes they expected as a result of their programs. Some sponsors had very specific goals which they believed would lead to very specific outcomes, such as improved literacy skills on measures of reading achievement. Others had more general goals, such as increased self-esteem, or heightened parental involvement in schooling. Most of the programs were in a very early stage of development and had not been extensively (or even moderately) field-tested or piloted. Some programs were so ambiguous that Elmore (1977) wrote that “most program developers were simply not clear what their programs would actually look like in a fully operational form” (p. 199). Many sponsors could not explain precisely which aspects of their models would lead to the stated outcome goals of the model.

Despite ambiguities among many of the models (and the minute shades of distinction between some models) the Follow Through literature classified models according to the degree of structure they offered, and where they place emphasis on learning.

The “degree of structure” (e.g. “low”, “medium”, or “high”) offered by a particular model is evidenced by how closely teachers were instructed to adhere to specific procedures, including: ways of arranging the classroom and delivering instruction, the degree of interaction between adults and children, the level of parental involvement, and so forth. Below are brief examples of two models that represent extremes of the spectrum.

Direct Instruction model. Developed by Siegfried Engelmann and Wesley Becker of the University of Oregon, Direct Instruction is scripted and specifies precisely what the teacher says and what the students’ responses should be. Moreover, the program designers carefully sequenced the instruction so that students do not progress to higher-order skills unless they have mastered prerequisite basic skills. There is a high degree of interaction between teachers and students so the teacher may receive continuous feedback about how well the students are doing, and adjusts instruction accordingly. The program makes a specific distinction between on-task and off-task behavior: instruction is arranged so that students are fully engaged in learning (via frequent checking for understanding and praises by the teacher) the majority of the time. According to the program sponsors, anything presumed to be learned by students must first be taught by the teacher (Maccoby & Zellner, 1970, p. 8).

Bank Street model. The Bank Street model was developed by Elizabeth Gilkerson and Herbert Zimiles of the Bank Street College of Education in New York. In this model, the students themselves direct learning: they select what tasks they wish to engage in, alone or with peers. The teacher arranges the classroom in ways that the sponsors believe will create the conditions for successful learning: various objects and media are available for children to interact with, and the teacher acts as a facilitator, guiding students through activities. According to the program sponsors, students use previously learned knowledge to construct new knowledge. According to the sponsors, given a safe and stable environment, learning is a process that occurs naturally, (Maccoby & Zellner, 1970, pp. 10-11).

In his evaluation of the operational facets of Follow Through, Elmore (1977) expressed concern that the shades of distinction among models in terms of structure made comparisons and final analysis among models problematic. Descriptions of the interventions derived from the sponsors themselves. There was no other reliable source from which the program administrators could obtain information about them. Indeed, had they been able to see examples of the different models being implemented, they might have been able to ask clarifying questions in order to better distringuish between them—and for purposes of assessment.

Program models were also classified by where they place emphasis on learning, according to three educational orientations: basic skills, cognitive conceptual skills, and affective/cognitive behavior (also see Appendix A).


 * Basic Skills Models- Concerned primarily with the teaching of basic skills (e.g., the “elementary skills of vocabulary, arithmetic computation, spelling, and language” (Stebbins, et al., 1977, p. xxiii))
 * Cognitive Conceptual Skills Models- Emphasized so-called “higher-order thinking skills” and problem-solving skills (Stebbins, et al., 1977, p. xxiii)
 * Affective/Cognitive Skills Models- Focused on students’ affect (i.e., self-esteem), on the premise that feelings of positive self-worth lead to success in cognitive skills (Stebbins, et al., 1977, p. xxiv)

Despite the differences, there were points of agreement among all sponsors. Sponsors agreed that their interventions should be developmentally appropriate—that is, models take account of where students are in their development as learners. Second, everyone agreed that teaching and learning should be responsive to the needs of individual learners. Third, they agreed that all students—even those from the most disadvantaged backgrounds—could learn to the level of their more fortunate peers. Fourth, classroom management procedures that create an appropriate learning environment should be emphasized. Fifth, school should be a place where students experience both high self-esteem and academic success. Ironically, the last point of agreement—as far as Maccoby and Zellner (1970, pp. 23-25) were concerned—was that all interventions should have very clear objectives about the content and skills that students should know and be able to do. This last detail is worth noting for two reasons. First, the program outcome goals that were provided by sponsors appeared relatively broad. For example, the sponsors of the Tucson Early Education Model explain that “there us relatively less emphasis on which items are taught and on the transmission of specific content, and more emphasis on ‘learning to learn’” (Maccoby & Zellner, 1970, pp. 15-16). Likewise, teachers of the Cognitive Curriculum design their own approaches to instruction (including the specification of learning goals), with assistance from sponsors and fellow staff members (Maccoby & Zellner, 1970, pp. 20-21). While the outcome goals might commonly be described as high levels of academic achievement or mastery of basic and higher-order thinking skills, exactly how students demonstrate these skills is missing in the Follow Through literature. During sponsor meetings, there were several heated arguments between some sponsors about the degree of specificity to which they should link facets of their models to student outcomes or behaviors (Watkins, 1997, p. 17). Follow Through administrators could not investigate models more thoroughly because of limited time; indeed, only eight months separated the selection of the sponsored model approach and the start of the experiment. Because Congress had already reduced the program budget, there was legitimate concern among planners that a delay in implementation could be disastrous to the program (Elmore, 1977, p. 174). Another reality was simply the lack of alternate interventions. Because such a large-scale experiment in education had never been done before, the Office of Education had no arsenal of interventions to try out (Elmore, 1977, p. 186).

Selection of Follow Through communities
Selection of Follow Through implementation sites proceeded in concert with the selection of models. With the assistance of various state and federal education agencies, 100 communities were invited to apply to the program, based on criteria established by the OEO. According to Egbert (1981), 90 districts applied, of which 30 were chosen for participation in Follow Through. However, due to pressure from influential politicians, additional sites were later added. Not coincidentally, the inclusion of several additional districts appears to have been an attempt to satisfy a number of local political figures by including their communities in the program (Egbert, 1981, p. 9). While Elmore (1977) laments that sites could have been chosen with a greater degree of scientific rigor (e.g. stratified random sampling), it was obvious that this would have been impossible, for at least two reasons. First, Follow Through administrators had the obligation to select a minimum number of sites with Head Start programs, because the ostensible purpose of Follow Through was to complement Head Start. Second—aside from political pressures—communities had to be willing participants in the process (ostensibly) to preserving the fidelity of the implementations. On this point, Elmore (1977, p. 215) tends to agree.

Measurement instruments and analytical methods
Because of the range of models, a broad range of instruments was selected in order to measure the targeted outcomes of basic skills, affective behavior, and cognitive behavior. Adams and Engelmann (1996) write, “while critics have complained about test selection and have usually suggested more testing, the assessment effort in this study went well beyond any other educational study conducted before, or since” (p. 71). In all, 14 instruments were selected and administered at various times throughout a student’s participation in Follow Through. Three groups of students, known as Cohorts (i.e., Cohorts I, II, and III) were followed longitudinally from the time they entered Follow Through (e.g., Kindergarten or Grade 1) until they exited the program (Grade 3). While the Stebbins, et al. evaluation rates the instruments high in terms of reliability, some sponsors questioned the validity of the instruments in measuring the varied orientations of the models. Other critics (e.g., House, et al. (1978) have criticized the instruments as well. However, the evaluators believed that the instrument battery represented the “best compromise” given the range of models (Stebbins, et al., 1977, pp. 35, 43). Despite the relatively large number of students who participated in Follow Through, the evaluators imposed rigorous restrictions on the sample that was actually in the statistical analysis. The comparison group—students from the community identified as not participating in Follow Through—was not subject to precisely the same restrictions as the control group, as long as they entered and exited school in the same districts and at the same time as Follow Through students.

Analytical methods
Due to the number of intervention sites and range of instruments the analysis was complex and extensive. According to Watkins (1997, p. 32), there were over 2,000 comparisons between Follow Through and Non-Follow Through groups alone. In 1968, Stanford Research Institute (SRI) was awarded the contract for the Follow Through evaluation. However, due to a variety of factors—including, perhaps, SRIs underestimation of the complexity involved in such a comprehensive analysis—Abt Associates, Inc. later inherited the evaluative duties in the summer of 1972. The summary of results, entitled Education as Experimentation: A Planned Variation Model (Stebbins, St. Pierre, Proper, Anderson, & Cerva) was published in 1977.

The empirical goal of the Follow Through evaluation was to determine which models were effective in raising student achievement in the three domains as evidenced by positive effects using the selected instruments. Within models, the evaluators compared performance on the various instruments between Follow Through (FT) and non-Follow Through (NFT) comparison groups at each site. Within groups, the evaluators averaged students’ scores on each measure (or outcome variable) in order to yield a “group” score. Thus, the group scores of FT students were compared to the group scores of NFT students. These scores were then adjusted using a statistical technique called analysis of covariance (ANCOVA; explained below). The difference between the FT and NFT students was then used to measure the effects of a given model (Watkins, 1997, pp. 32-33). Sites where models met the criterion for “educational effectiveness” were assigned a value of 1; negative effects were assigned -1; and null effects—“insignificant educationally, statistically, or both” (Wisler, et al., 1978, p. 176) —were assigned a zero.

An important—and later controversial—statistical technique was employed by the evaluators in order to improve the integrity of the results. Because there were differences between treatment and comparison groups (e.g. the average score on an outcome measure for a NFT group might have been higher than the corresponding average score for a FT group), the evaluators employed a method known as Analysis of Covariance (ANCOVA) in order to adjust for these and other differences. According to Elmore (1977, pp. 329-330), adjusted results using the ANCOVA technique should be interpreted cautiously for two reasons. First, ANCOVA “is not a substitute for random assignment, but it has become a conventionally accepted technique for handling initial group differences in quasi-experimental data” (p. 329). Second, the larger the initial differences between treatment and control groups, the weaker the strength of the results (p. 329).

Follow Through results
The results of Follow Through did not show how models that showed little or no effects could be improved. But they did show which models—even under the less than ideal conditions of the experiment—had some indications of success. The most notable critique of Follow Through (described in detail below) takes issue with the fact that the models which showed positive effects were largely basic skills models. Stebbins, et al. (1977, pp. xxiv-xxviii) reported the principal empirical findings of the experiment as follows:
 * “The effectiveness of each Follow Through model varied substantially from site group to site group” (p. xxiv)
 * “Models that emphasize basic skills succeed better than other models in helping children gain these skills” (p. xxv)
 * “Where models have put their primary emphasis elsewhere than on the basic skills, the children they served have tended to score better lower on tests of these skills than they would have done without Follow Through” (p. xxvi)
 * “No type of model was notably more successful than the others in raising scores on cognitive conceptual skills” (p. xxvi)
 * “Models that emphasize basic skills produced better results on tests of self-concept than did other models” (p. xxvi)
 * To the extent that Follow Through children have ‘caught up’ with their peers in arithmetic skills, they have tended to do it during the first two years of their involvement in the program” (p. xxvii)
 * “Follow Through has been relatively effective with the most disadvantaged children it has served” (p. xxviii)

Critiques
Wisler, et al. (1978), in his review of the Follow Through experience, said it was likely that no other educational data has been examined more extensively, excepting the landmark Equality of Educational Opportunity Survey (p. 177). At least three major reevaluations of the Follow Through data exist in the Follow Through literature: House, et al. (1978); Bereiter & Kurland (1981); and Kennedy (1981). All largely confirm the original statistical analysis conducted by Abt Associates. Generally, the consensus among most researchers is that structured models tended to perform better than unstructured ones (Evans, 1981, pp. 13-14), and that the Direct Instruction and Behavior Analysis models performed better on the instruments employed than did the other models (Rhine, 1981, p. 302, Wisler, et al., 1978, p. 180, Adams & Engelmann, 1996, p. 72). Most critiques of the Follow Through experiment have tended to focus on the operational and design problems that plagued the experiment (e.g., Elmore, 1977). In particular, these critiques note that there was more variation within a particular model than there was from model to model. This problem has largely been attributed to the problem of how to measure the effectiveness of a particular implementation; the measures used were largely qualitative and anecdotal (Stebbins, et al., 1977). In some instances, sites were included in the analysis that had ceased to implement specific models, or the model sponsors had serious reservations about the way particular models were implemented (Engelmann, 1992; Adams & Engelmann, 1996).

The most vocal critique was the House, et al. (1978) reanalysis. The article—along with several rebuttals from the original evaluation team and other researchers—was published by the Harvard Educational Review in 1978. The authors were extremely dissatisfied with the pronouncement of the evaluators that the basic skills models outperformed the other models. The authors approach the critique on the assumption that basic skills are decidedly just that—basic. The authors imply that basic skills are only taught through “rote methods”—a decidedly negative connotation (p. 137). Regarding the finding that “models that emphasize basic skills produced better results on tests of self-concept than did other models” (Stebbins, et al., 1977, p. xxvi), the authors question the efficacy of the self-esteem measures; implying, among other things, that young students cannot possibly have a concrete understanding of self-concept (pp. 138-139). While the article intended to review the operational design of the Follow Through evaluation, instead it appears to (1) refute the finding that cognitive-conceptual and affective-cognitive models were largely failures, and (2) unilaterally condemn the models that emphasize basic skills. The implication is that the goal of education should not be increased student achievement in solely basic skills, and that Follow Through would have been better employed to discover how measures of all three orientations could be made successful. Absent from the critique is the finding that, for third graders, only the Direct Instruction model demonstrated positive effects in all three domains, and that one of the remaining two models (Behavior Analysis; the other was the Parent Education model) that had positive effects in at least two domains was also a self-described “basic skills model” (Adams & Engelmann, 1996, p. 72).

Dissemination of results
In 1972, the OE created the Joint Dissemination Review Panel (JDRP) and the National Diffusion Network (NDN) to disseminate information about effective models to schools and districts nationwide (Watkins, 1997, p. 47; Rhine, 1981, p. 307). JDRP reviewed programs for effectiveness according to a mixture of empirical and holistic criteria. NDN was responsible for disseminating the results based on the recommendations of JDRP. Watkins (1997) criticizes the dissemination criteria for two reasons. First, the organizations identified programs for dissemination that were not part of the Follow Through experiment with no empirical validation. Second, JDRP and NDN endorsed programs that showed improvement in areas “such as self-concept, attitude, and mental or physical health (of students) …(or) if it has a positive impact on individuals other than students, for example if it results in improved instructional behavior of teachers” (p. 47), but did not raise students’ academic achievement. Thus, programs “that had been incapable of demonstrating improved academic performance in the Follow Through evaluation” were recommended for adoption by schools and districts. Watkins cites the former Commissioner of Education, Ernest Boyer, who wrote with dismay that “Since only one of the sponsors (Direct Instruction) was found to produce positive results more consistently than any of the others, it would be inappropriate and irresponsible to disseminate information on all of the models” (Watkins, 1997, p. 48).

Of course, it would have been ideal to have the kind of conclusiveness associated with laboratory experiments when we conduct social experiments in communities and schools. Andy B. Anderson (1975) wrote that “the idea of a controlled experiment has long been recognized as a goal worth pursuing in the social and behavioral sciences for the same obvious reason that made this mode of inquiry the predominant research strategy of the natural and physical sciences:  the controlled experiment permits the most unequivocal assessment of a variable’s influence on another variable” (p. 13). Particularly when experimentation is used as a tool for informing policy decisions (e.g., in recommending the efficacy of some instructional approaches with disadvantaged students over other, less effective interventions), the design should be of the highest degree of rigor possible. For a variety of reasons, Follow Through did not have the classic characteristics of a true experiment.

Operational and Design Issues
Lack of systematic selection of interventions and lack of specificity of treatment effects. Due to a variety of circumstances detailed earlier, the Follow Through programs were not systematically developed or selected according to any type of uniform criteria (Evans, 1981, pp. 6, 15). Given more time, sponsors may have been able to better identify the types of treatment effects that an observer might expect to occur under controlled conditions. More importantly, program sponsors might also have been required to show those specific facets of their interventions (e.g., particular pedagogical techniques) which would produce the intended effects. Despite these flaws, the sponsors agreed on being subject to the same evaluation instruments. Unfortunately, the instruments shed little light on what about the ineffective programs made them so unsuccessful. The converse is also true. Since structured programs tended to show better effects than the unstructured ones, efforts could certainly have been made to identify commonalities among the effective structured programs. With further funding, these shared characteristics could have informed the development of additional effective programs or made the ineffective approaches better. Unfortunately, funding was in fact reduced for those programs that were identified as successful in Follow Through, perhaps on the presumption that funding would be better diverted to investigating failed programs (Watkins, 1997). Programs that had no empirical validation at all were recommended for dissemination along with the successful models.

Lack of random assignment. Random assignment of subjects into treatment and control groups is the ideal method of attributing change in a sample to an intervention and not to some other effect (including the pre-existing capabilities of students, teachers, or school systems) (Evans, 1981, p. 15). However, for a variety of practical reasons, this procedure was not done in Follow Through (Stebbins, et al., 1977, p. 11). Instead, sites were selected “opportunistically” (Watkins, 1997, p. 19), on their readiness to participate in the evaluation, and on their unique circumstances of need. As Stebbins, et al. (1977), points out, the treatment groups were often the neediest children. To randomly select some of the most disadvantaged children (many of whom participated in Head Start prior to Follow Through) out of the evaluation would certainly have been negatively perceived by community members (p. 61). Stebbins, et al. (1977) point out that there were “considerable variations in the range of children served”; yet despite the presence of “many of the problems inherent in field social research…evaluations of these planned variations provides us with an opportunity to examine the educational strategies under real life conditions as opposed to contrived and tightly controlled laboratory conditions” (pp. 12-13).

Narrowness of instruments. Adams and Engelmann (1996, p. 71) note that many critics have suggested the use of more instruments in the Follow Through evaluation. Egbert (1981, p.7) agrees with Adams and Engelmann (1996) that the data collection efforts were extensive. Despite the agreement among model sponsors on a uniform set of instruments to evaluate the effectiveness of their models—that model sponsors believed their programs achieved gains on more intrinsic, less measurable indicators of performance, such as increased self-worth or greater parental involvement. To the extent that these desired outcomes occurred, and benefited the lives of students in ways that might never be measurably through quantitative means, those aspects of many models were successful. Both the House, et al. critique (1978) and others (cited in Wisler, et al., 1978) express concerns about the inadequacy of the instruments used to measure self-esteem the Follow Through evaluation (i.e., the Intellectual Achievement Responsibility Scale (IARS) and the Coopersmith Self-Esteem Inventory). But it was better, according to many researchers, to measure outcomes imperfectly rather than not to measure them at all (Wisler, et al., 1978, p. 173). Thus, while “perfect” measures of desired outcomes might never exist, one should not let the perfect be the enemy of the good—in other words, one could call into question the efficacy of conducting any experiment at all on the basis that some bias or imperfection exists.

Political and Philosophical Issues
Was Follow Through a social or scientific program? An inevitable conflict exists when one attempts to operationalize a federal program in education that possesses both service delivery and research and development objectives (Egbert, 1981, pp. 8-9). Rivlin, et al. (1975) points out that “the byzantine complexity of the public policymaking process makes the conduct of social experiments extremely difficult” (p. 24). Given the reduction in funding, the decision to engage in an effort to evaluate the effectiveness of various interventions in an empirical experiment appears appropriate and straightforward. However, if the change is not reflected in Congressional legislation or communicated clearly at the local level, issues of implementation and conflict with deeply-held values inevitably result (Rivlin, et al., 1975, pp. 24-25; Watkins, 1997, pp. 13-15). There is much evidence that indicates confusion about the intent of the Follow Through evaluation at the administrative level (Maccoby & Zellner, 1970, p. 4; Elmore, 1977, pp. 182, 255; Egbert, 1981, pp. 4-5; Evans, 1981, pp. 5-6; House, 1981, pp. 14-15).

Issues of local control. The planned variation aspect of Follow Through was thought to be beneficial—perhaps superior—to other forms of experimentation (e.g., selection of sites based on randomized assignment) because it would give local communities and schools an element of ownership integral to the successful implementation of the models (Watkins, 1997, p. 16; Elmore, 1977, pp. 190-191). Despite the planned variation design, local communities in many sites were nevertheless deeply critical of the program. In some ways, criticism of Follow Through had preceded directly from Head Start. Ostensibly, the social service purpose and goals of the Head Start program were clearer than those of the Follow Through evaluation. Nevertheless, community leaders had felt that Head Start did not give enough decision-making responsibility to parents and community members (Egbert, 1981, pp. 1-3). Local interests wanted to make curricular decisions, including the changing of facets of some program models (Watkins, 1997, p. 25). Evans (1981, p. 16) cautioned that “educational communities and contexts vary”, which can have a direct effect on the implementation of a model. More problematic, however, is Elmore’s (1977, p. 381) and Hill’s (1981, p. 16) assertions that the Follow Through models interfered with local teaching methods and practices. As Elmore (1977) writes, “for Follow Through, the problem was how to implement program variations in a system where most day-to-day decisions about program content are made at the school or classroom level” (p. 381). Rhine, et al. (1981) suggests that it is difficult to get teachers to modify their behavior. And if the objective of changing behavior is achieved, teachers feel little ownership on the model—a decidedly dubious investment. What inevitably seems to happen is that teachers reject programs outright, while others “surrender to the program” (p. 62).

The “fact-value dichotomy.” Ernest House, co-author of the 1978 critique of the Follow Through evaluation, penned an article about what he calls the “fact-value dichotomy” in social experimentation and educational research: “the belief that facts refer to one thing and values refer to something totally different” (2001, pp. 312-313). House elucidates the writings of Donald Campbell, a researcher in the field of evaluation. According to Campbell, facts cannot exist outside the framework of one’s values because inevitably, an investigation that uncovers a certain fact is either consistent with the researcher’s internal values or against them. What results is a difficult choice: the researcher must either reject the fact, or modify his or her value to accommodate the fact. Campbell also believed, according to House, that values—as opposed to facts—could be chosen rationally. House agrees with Campbell’s assertion in part, but departs from Campbell in that he believes that facts and values cannot exist in isolation; rather, they “blend together in the conclusions of evaluation studies, and, indeed, blend together throughout evaluation studies” (p. 313). House suggests that the reader envision facts and values as existing on a continuum from "bute facts to "Bare values." Accordingly, rarely do “fact claims” or “value claims” fall entirely at one end of the spectrum or the other. House provides examples: “Diamonds are harder than steel” might fall at the left of the spectrum, while “Cabernet is better than Chardonnay” falls to the right (p. 313). In conclusion, House proposes an entirely a new method of empirical investigation called “deliberative democratic evaluation.” In it, evaluators arrive at “unbiased claims” through “inclusion of all relevant stakeholder perspectives, values, and interests in the study; extensive dialogue between the evaluator and stakeholders…and extensive deliberation to reach valid conclusions in the study” (p. 314). House decries the use of entirely rational methods when applied to evaluations; indeed, he recommends a degree of subjectiveness, because evaluations like Follow Through cannot exist outside deeply held values (House, 1981, pp. 10, 20).

Hill (1981) writes: “There is seldom anyone at the local level whose commitment to an externally-imposed curricular innovation, planning process, or financial management scheme springs spontaneously from deeply held personal values (p. 12).” House argues that all decision-making that stems from evaluations in education be the result of a compromise. Watkins (1997, p. 60) argues that Follow Through resulted in a clash over values based on different beliefs about how children learn, which can be boiled down to “natural growth” or “unfolding” theories versus. theories of “changing behavior.” Watkins asserts that most education experts today do not judge programs by their relative effectiveness with different student populations, but rather by their “congruence with prevailing philosophies of education” (p. 61).