Criteria for the assessment of clinical studies

A number of criteria have been proposed to aid in the evaluation of clinical studies. These criteria are to be used in designation of interventions as "evidence-based practice" (EBP) or as belonging to a lesser category.

The criteria for "evidence-based practice" originally suggested by Chambless and Hollon on behalf of a task force of the American Psychological Association were very stringent. In order for a treatment to qualify as evidence-based, two studies using randomized designs had to be reported in a peer reviewed journal, with one being a replication by an independent researcher

Since that time,the definition of the term "evidence-based" has become looser. Some less challenging criteria have been suggested as means of assessing the evidentiary foundation of mental health interventions, and some authors have downplayed the role of rigorous research in evaluation of practice, stating that the EBP category must include the use of practice wisdom and family values As other authors have noted, the present definition of "evidence-based practice" is a matter of considerable disagreement among social services practitioners [citation: Glasby, J., Walshe, K., & Harvey, G. (2007). What counts as 'evidence' in 'evidence-based practice'? Evidence & Policy, 3(3), 325-327; Sempik, J., Becker. S., & Bryman, A. (2007). The quality of research evidence in social policy: Consensus and dissension among researchers. Evidence & Policy, 3(3), 407-423.]

Summaries of proposed evaluative methods
It is not realistic to expect that all clinical studies will be able to be in compliance with the demanding criteria established by Chambless and Hollon. The criterion of randomization may be very difficult to meet.However, the goal of randomization is to control confounding variables, and other methods may achieve a measure of success in this. For this reason, the contribution of nonrandomized designs to outcome research has been recognized. A number of further sets of criteria have been proposed. In an effort to consider different levels of acceptability for research evidence, some authors have proposed taxonomies in which categories of research design were equated with the evidentiary power of the studies. Other writers have simply considered whether a research report did or did not meet standards.

Some of these (e.g., Saunders et al.) have received severe criticism None of these methods appears to be perfect for evaluation of mental health treatments, yet all have something to offer.

The Kaufman Best Practices Project
The Kaufman Best Practices Project can be summarized as follows: At least one randomized design is required to be eligible for consideration, and nonrandomized designs are not considered acceptable. If multiple outcome studies have been undertaken the overall weight of evidence must indicate efficacy. For full criteria see 

"TO BE CONSIDERED A CANDIDATE FOR BEST PRACTICE, A TREATMENT PROTOCOL HAD TO MEET THE FOLLOWING CRITERIA CONCERNING ITS CLINICAL UTILITY: 1. The treatment has a sound theoretical basis in generally accepted psychological principles indicating that it would be effective in treating at least some problems known to be outcomes of child abuse.

2. The treatment is generally accepted in clinical practice as appropriate for use with abused children, their parents, and/or their families. Candidate practices were sought that met several minimum criteria.

3. A substantial clinical-anecdotal literature exists indicating the treatment’s value with abused children, their parents, and/or their families from a variety of cultural and ethnic backgrounds.

4. There is no clinical or empirical evidence, or theoretical basis indicating that the treatment constitutes a substantial risk of harm to those receiving it, compared to its likely benefits.

5. The treatment has at least one randomized, controlled treatment outcome study indicating its efficacy with abused children and/or their families.

6. If multiple treatment outcome studies have been conducted, the overall weight of evidence supports the efficacy of the treatment.

IN ADDITION TO THESE CRITERIA SUPPORTING THE EFFICACY OF THE TREATMENT, THE FOLLOWING CRITERIA ABOUT ITS TRANSPORTABILITY TO COMMON CLINICAL SETTINGS HAD TO BE MET:

7. The treatment has a book, manual, or other writings available to clinical professionals that specifies the components of the treatment protocol and describes how to conduct it.

8. The treatment can be delivered in common service delivery settings serving abused children and their families with a reasonable degree of treatment fidelity.

9. The treatment can be delivered by typical mental health professionals who have received a reasonable level of training and supervision in its use. Once these candidates were identified, the project focused on reviewing the support for each one, understanding their weaknesses in clinical or empirical support, and building consensus among the participants." The nature of comparison groups is not specified, nor does this approach discuss confounding variables or reliability and validity of outcome measures. Intervention fidelity (generally, the existence of a treatment manual) is a factor. The approach does not consider statistical methods, the question of missing data or attrition, or intention-to-treat analysis, neither does it examine blinding issues. The Kaufman approach does examine the theoretical background of an intervention,consider evidence of harm, and examine the appropriateness of a treatment for the setting and practitioners for which it is proposed.

Saunder, Berliner and Hanson
The approach suggested by Saunders, Berliner, and Hanson is a taxonomic approach. In order to be assessed under this method, an intervention must have a manual available. The only category that requires randomized clinical trials is category 1. This approach does not examine the nature of comparison groups, treatment of confounding variables, the reliability and validity of outcome measures, or whether blinding methods were used. Neither does the method consider the choice of statistical test, missing data or attrition, intention-to-treat analysis, or appropriateness of a treatment for the setting and practitioners involved. Theoretical background and evidence of harm are considered. This system uses the following categories:


 * Category 1: Well-supported, efficacious treatment.


 * Category 2: Supported and probably efficacious


 * Category 3: Supported and acceptable treatment


 * Category 4: Promising and acceptable treatments


 * Category 5: Novel and experimental treatments


 * Category 6: Concerning treatment

For the full criteria see  Category 1:Well-supported, efficacious treatment - 1. The treatment has a sound theoretical basis in generally accepted - psychological principles. - 2. A substantial clinical, anecdotal literature exists indicating the - treatment’s efficacy with at-risk children and foster children. - 3. The treatment is generally accepted in clinical practice for at-risk - children and foster children. - 4. There is no clinical or empirical evidence or theoretical basis indicating - that the treatment constitutes a substantial risk of harm to - those receiving it, compared to its likely benefits. - 5. The treatment has a manual that clearly specifies the components - and administration characteristics of the treatment that allows - for replication. - 6. At least two randomized, controlled outcome studies have demonstrated - the treatment’s efficacy with at-risk children and foster - children. This means the treatment was demonstrated to be - better than placebo or no different or better than an already established - treatment. - 7. If multiple outcome studies have been conducted, the large majority - of outcome studies support the efficacy of the treatment. - Category 2: Supported and probably efficacious - 1. The treatment has a sound theoretical basis in generally accepted - psychological principles. - 2. A substantial clinical, anecdotal literature exists indicating the - treatment’s efficacy with at-risk children and foster children. - 3. The treatment is generally accepted in clinical practice for at risk - children and foster children. - 4. There is no clinical or empirical evidence or theoretical basis indicating - that the treatment constitutes a substantial risk of harm to - those receiving it, compared to its likely benefits. - 5. The treatment has a manual that clearly specifies the components - and administration characteristics of the treatment that allows - for implementation. - 6. At least two studies utilizing some form of control without randomization - (e.g., wait list, untreated group, placebo group) have - established the treatment’s efficacy over the passage of time, efficacy - over placebo, or found it to be comparable to or better than - already established treatment. - 7. If multiple treatment outcome studies have been conducted, the - overall weight of evidence supported the efficacy of the treatment. - Category 3: Supported and acceptable treatment - 1. The treatment has a sound theoretical basis in generally accepted - psychological principles. - 2. A substantial clinical, anecdotal literature exists indicating the - treatment’s efficacy with at-risk children and foster children. - 3. The treatment is generally accepted in clinical practice for at-risk - children and foster children. - 4. There is no clinical or empirical evidence or theoretical basis indicating - that the treatment constitutes a substantial risk of harm to - those receiving it, compared to its likely benefits. - 5. The treatment has a manual that clearly specifies the components - and administration characteristics of the treatment that allows - for replication. - 6a. At least one group study (controlled or uncontrolled), or a series - of single subject studies have demonstrated the efficacy of the - treatment with at-risk children and foster children; - or - 6b. A treatment that has demonstrated efficacy with other populations - has a sound theoretical basis for use with at-risk children - and foster children, but has not been tested or used extensively - with these populations. - 7. If multiple treatment outcome studies have been conducted, the - overall weight of evidence supported the efficacy of the treatment. This system was specifically criticised by Gambrill (2006) in her critique of evidence based systems who stated "It would be hard to create a hierarchy more likely to hide ineffective or harmful practices". She compared it with Grays (2001) hierarchy.

Grays Hierarchy
1. Intervention programs that have been critically tested and found to help clients.

2. Intervention programs that have not been critically tested and are not in a good experimental trial.

3. Intervention programs that have been critically tested and shown to harm clients.

4. Intervention programs of unknown effectiveness that are in a rigorous experimental trial.

Khan et al
The standards set by Khan et al. differed in their requirements from the previous approaches. To summarize, this group preferred randomized designs but accepted nonrandomized approaches as well. They required that comparison groups be specified and confounding variables taken into consideration.Blinding measures were emphasized. Missing data and consideration of attrition were to be discussed, and an intention-to-treat analysis was to be done. However, no evidence about outcome measures was included. no manual was required, and there was no consideration of theoretical background, evidence of harm, or appropriateness of the treatment for the setting and practitioners.

U.S. National Registry of Evidence-Based Practices and Programs criteria
A fourth approach is used by the U.S. National Registry of Evidence-Based Practices and Programs. This evaluative method prefers randomized designs but accepts nonrandomized studies. The criteria include the specification of comparison groups, the consideration of confounding variables, evidence for the validity and reliability of outcome measures,appropriateness of statistical measures, consideration of missing data and attrition, and the existence of a treatment manual. However, the NREPP standards do not require an intention-to-treat analysis, blinding designs, an examination of theoretical background, or consideration of evidence of harm or appropriateness of a treatment for a setting or for practitioners.

U.S. Preventive Services Task Force
Systems to stratify evidence by quality have been developed, such as this one by the U.S. Preventive Services Task Force:
 * Level I: Evidence obtained from at least one properly designed randomized controlled trial.
 * Level II-1: Evidence obtained from well-designed controlled trials without randomization.
 * Level II-2: Evidence obtained from well-designed cohort or case-control analytic studies, preferably from more than one center or research group.
 * Level II-3: Evidence obtained from multiple time series with or without the intervention. Dramatic results in uncontrolled trials might also be regarded as this type of evidence.
 * Level III: Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees.

The UK National Health Service uses a similar system with categories labelled A, B, C, and D.

Categories of recommendations
In guidelines and other publications, recommendations are classified according to the level of evidence on which they are based. The U.S. Preventive Service Task Force uses:
 * Level A: Recommendations are based on good and consistent scientific evidence.
 * Level B: Recommendations are based on limited or inconsistent scientific evidence.
 * Level C: Recommendations are based primarily on consensus and expert opinion.

This is a distinct and conscious improvement on older fashions in recommendation and the interpretation of recommendations where it was less clear which parts of a guideline were most firmly established.

Oxford Centre for Evidence-based Medicine
Oxford Centre for Evidence-based Medicine uses these "grades of recommendations" according to the study designs and critical appraisal of prevention, diagnosis, prognosis, therapy, and harm studies:


 * Level A: consistent Randomised Controlled Clinical Trial, Cohort Study, All or None, Clinical Decision Rule validated in different populations.


 * Level B: consistent Retrospective Cohort, Exploratory Cohort, Ecological Study, Outcomes Research, Case-Control Study; or extrapolations from level A studies.


 * Level C: Case-series Study or extrapolations from level B studies


 * Level D: Expert opinion without explicit critical appraisal, or based on physiology, bench research or first principles 

Levels of Evidence From the Centre for Evidence-Based Medicine, Oxford For the most up-to-date levels of evidence, see


 * Therapy/Prevention/Etiology/Harm:
 * 1a:	Systematic reviews (with homogeneity ) of randomized controlled trials
 * 1a-:	Systematic review of randomized trials displaying worrisome heterogeneity
 * 1b:	Individual randomized controlled trials (with narrow confidence interval)
 * 1b-:	Individual randomized controlled trials (with a wide confidence interval)
 * 1c:	All or none randomized controlled trials
 * 2a:	Systematic reviews (with homogeneity) of cohort studies
 * 2a-:	Systematic reviews of cohort studies displaying worrisome heterogeneity
 * 2b:	Individual cohort study or low quality randomized controlled trials (<80% follow-up)
 * 2b-:	Individual cohort study or low quality randomized controlled trials (<80% follow-up / wide confidence interval)
 * 2c:	'Outcomes' Research; ecological studies
 * 3a:	Systematic review (with homogeneity) of case-control studies
 * 3a-:	Systematic review of case-control studies with worrisome heterogeneity
 * 3b:	Individual case-control study
 * 4:	Case-series (and poor quality cohort and case-control studies)
 * 5:	Expert opinion without explicit critical appraisal, or based on physiology, bench research or 'first principles'


 * Diagnosis:
 * 1a:	Systematic review (with homogeneity) of Level 1 diagnostic studies; or a clinical rule validated on a test set.
 * 1a-:	Systematic review of Level 1 diagnostic studies displaying worrisome heterogeneity
 * 1b:	Independent blind comparison of an appropriate spectrum of consecutive patients, all of whom have undergone both the diagnostic test and the reference standard; or a clinical decision rule not validated on a second set of patients
 * 1c:	Absolute SpPins And SnNouts (An Absolute SpPin is a diagnostic finding whose Specificity is so high that a Positive result rules-in the diagnosis. An Absolute SnNout is a diagnostic finding whose Sensitivity is so high that a Negative result rules-out the diagnosis).
 * 2a:	Systematic review (with homogeneity) of Level >2 diagnostic studies
 * 2a-:	Systematic review of Level >2 diagnostic studies displaying worrisome heterogeneity
 * 2b:	Any of: 1)independent blind or objective comparison; 2)study performed in a set of non-consecutive patients, or confined to a narrow spectrum of study individuals (or both) all of whom have undergone both the diagnostic test and the reference standard; 3) a diagnostic clinical rule not validated in a test set.
 * 3a:	Systematic review (with homogeneity) of case-control studies
 * 3a-:	Systematic review of case-control studies displaying worrisome heterogeneity
 * 4:	Any of: 1)reference standard was unobjective, unblinded or not independent; 2) positive and negative tests were verified using separate reference standards; 3) study was performed in an inappropriate spectrum of patients.
 * 5:	Expert opinion without explicit critical appraisal, or based on physiology, bench research or 'first principles'


 * Prognosis:
 * 1a:	Systematic review (with homogeneity) of inception cohort studies; or a clinical rule validated on a test set.
 * 1a-:	Systematic review of inception cohort studies displaying worrisome heterogeneity
 * 1b:	Individual inception cohort study with > 80% follow-up; or a clinical rule not validated on a second set of patients
 * 1c:	All or none case-series
 * 2a:	Systematic review (with homogeneity) of either retrospective cohort studies or untreated control groups in RCTs.
 * 2a-:	Systematic review of either retrospective cohort studies or untreated control groups in RCTs displaying worrisome heterogeneity
 * 2b:	Retrospective cohort study or follow-up of untreated control patients in an RCT; or clinical rule not validated in a test set.
 * 2c:	'Outcomes' research
 * 4:	Case-series (and poor quality prognostic cohort studies)
 * 5:	Expert opinion without explicit critical appraisal, or based on physiology, bench research or 'first principles'


 * Key to interpretation of practice guidelines
 * Agency for Healthcare Research and Quality:
 * A:	There is good research-based evidence to support the recommendation.
 * B:	There is fair research-based evidence to support the recommendation.
 * C:	The recommendation is based on expert opinion and panel consensus.
 * X:	There is evidence of harm from this intervention.


 * USPSTF Guide to Clinical Preventive Services:
 * A:	There is good evidence to support the recommendation that the condition be specifically considered in a periodic health examination.
 * B:	There is fair evidence to support the recommendation that the condition be specifically considered in a periodic health examination.
 * C:	There is insufficient evidence to recommend for or against the inclusion of the condition in a periodic health examination, but recommendations may be made on other grounds.
 * D:	There is fair evidence to support the recommendation that the condition be excluded from consideration in a periodic health examination.
 * E:	There is good evidence to support the recommendation that the condition be excluded from consideration in a periodic health examination.


 * University of Michigan Practice Guideline:
 * A:	Randomized controlled trials.
 * B:	Controlled trials, no randomization.
 * C:	Observational trials.
 * D:	Opinion of the expert panel.


 * Other guidelines:
 * A:	There is good research-based evidence to support the recommendation.
 * B:	There is fair research-based evidence to support the recommendation.
 * C:	The recommendation is based on expert opinion and panel consensus.
 * X:	There is evidence that the intervention is harmful.

"Extrapolations" are where data is used in a situation which has potentially clinically important differences than the original study situation. Other explanations described elsewhere in the Centre’s pages

Reporting issues
Evaluation of research on the basis of a published report can only be done if sufficient information is included in the report. Guidelines for reporting of randomized studies have been suggested by Moher, and Des Jarlais and colleagues have outlined criteria for reports of nonrandomized studies

Criticism of evidence-based approaches
Critics of evidence-based approaches maintain that good evidence is often deficient in many areas, that lack of evidence and lack of benefit are not the same, and that evidence-based medicine applies to populations, not necessarily to individuals. In The limits of evidence-based medicine, Tonelli argues that "the knowledge gained from clinical research does not directly answer the primary clinical question of what is best for the patient at hand." Tonelli suggests that proponents of evidence-based medicine discount the value of clinical experience.

Although evidence-based practice is quickly becoming the "gold standard" for clinical practice and treatment guidelines, there are a number of reasons why most current medical, psychological, and surgical practices do not have a strong literature base supporting them. First, in some cases, conducting randomized controlled trials would be unethical--such as in open-heart surgery--although observational studies are designed to address these problems to some degree. Second, certain groups have been historically under-researched (women, racial minorities, people with many co-morbid diseases) and thus the literature is very sparse in areas that do not allow for generalizeability. Third, the types of trials considered 'gold standard' (i.e. randomized double-blind placebo-controlled trials) are very expensive and thus funding sources play a role in what gets investigated. For example, the government funds a large number of preventive medicine studies that endeavor to improve public health as a whole, while pharmaceutical companies fund studies intended to demonstrate the efficacy and safety of particular drugs. Fourth, the studies that are published in medical journals may not be representative of all the studies that are completed on a given topic (published and unpublished) or may be misleading due to conflicts of interest (i.e. publication bias). Thus the array of evidence available on particular therapies may not be well-represented in the literature. Fifth, there is an enormous range in the quality of studies performed, making it difficult to generalize about the results.

Large randomized controlled trials are extraordinarly useful for examining discrete interventions for carefully defined conditions. The more complex the patient population, the conditions and diagnoses, and the intervention, the more difficult it is to separate the treatment effect from random variation. Because of this, a number of studies obtain insignificant results, either because there is insufficient power to show a difference, or because the groups are not well-enough 'controlled'. Evidence-based medicine has been most practised when the intervention tested is a drug. Applying the methods to other forms of treatment may be harder, particularly those requiring the active participation of the patient because blinding is more difficult.

In managed healthcare systems evidence-based guidelines have been used as a basis for denying insurance coverage for some treatments some of which are held by the physicians involved to be effective, but of which randomized controlled trials have not yet been published.