Test reliabilty

Test reliability is established as part of the psychometric procedure of test standardization. A test is said to be reliable when it can be shown to produce consistent results under varying conditions.


 * Interrater reliability
 * Split-half reliability
 * Test-retest reliability

Classical test theory
Note that reliability is not, as is often suggested in textbooks, a fixed property of tests, but a property of test scores that is relative to a particular population, and computed for this sample. This is because test scores will not be equally reliable in every population or even every sample. For instance, as is the case for any correlation, the reliability of test scores will be lowered by restriction of range. Thus, IQ-test scores that are highly reliable in the general population will be less reliable in a population of college students and even less reliable in a sample of sophomores. Also note that test scores are perfectly unreliable for any given individual $$i$$, because, as has been noted above, the true score is a constant at the level of the individual, which implies it has zero variance, so that the ratio of true score variance to observed score variance, and hence reliability, is zero. The reason for this is that, in the classical test theory model, all observed variability in $$i$$'s scores is random error by definition (see Eq. 2). Classical test theory is relevant only at the level of populations and samples, not at the level of individuals.

Reliability cannot be estimated directly since that would require one to know the true scores, which according to classical test theory is impossible. However, estimates of reliability can be obtained by various means. One way of estimating reliability is by constructing a so-called parallel test. The fundamental property of a parallel test is that it yields the same true score and the same observed score variance as the original test for every individual. If we have parallel tests x and x', then this means that

and

Under these assumptions, it follows that the correlation between parallel test scores is equal to reliability (see Lord & Novick, 1968, Ch. 2, for a proof).

Using parallel tests to estimate reliability is cumbersome because parallel tests are very hard to come by. In practice the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's $${\alpha}$$. Consider a test consisting of $$k$$ items $$u_{j}$$, $$j=1,\ldots,j,\ldots,k$$. The total test score is defined as the sum of the individual item scores, so that for individual $$i$$

Then Cronbach's alpha equals

Cronbach's $${\alpha}$$ can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the reliability of test scores in a population is always higher than the value of Cronbach's $${\alpha}$$ in that population. Thus, this method is empirically feasible and, as a result, it is very popular among researchers.

As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability. Reliability is supposed to say something about the general quality of the test scores in question. The general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is supposed to be. Too high a value for $${\alpha}$$, say over .9, indicates redundancy of items. Around .8 is recommended for research. It must be noted that these 'criteria' are not based on reasonable arguments but the result of convention. Whether they make any sense or not is unclear.