Item response theory

Item response theory (IRT) is a body of related psychometric theory that provides a foundation for scaling persons and items based on responses to assessment items. The central feature of IRT models is that they relate item responses to characteristics of individual persons and assessment items. Expressed in somewhat more technical terms, IRT models are functions relating person and item parameters to the probability of a discrete outcome, such as a correct response to an item. Among other things, as a body of theory, IRT provides a basis for estimating parameters, ascertaining how well data fits a model, and investigating the psychometric properties of assessments. Psychometricians apply IRT in order to achieve tasks such as developing and refining exams, maintaining banks of items for exams, and equating for the difficulties of successive versions of exams (for example, to allow comparisons between results over time).

IRT is often referred to as latent trait theory, strong true score theory, or modern mental test theory. The term latent is used to emphasise that discrete item responses are taken to be observable manifestations of the trait or attribute, the existence of which is hypothesized and must be inferred from the manifest responses. The other major body of psychometric theory of relevance to IRT is classical test theory. For tasks that can be accomplished using classical test theory, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as computerized adaptive testing are enabled by IRT and cannot reasonably be performed using only classical test theory.

Overview
IRT models are used as a basis for statistical estimation of parameters that represent the 'locations' of persons and items on a latent continuum or, more correctly, the magnitude of the latent trait attributable to the persons and items. For example, in attainment testing, estimates may be of the magnitude of a person's ability within a specific domain, such as reading comprehension. Once estimates of relevant parameters have been obtained, statistical tests are usually conducted to gauge the extent to which the parameters predict item responses given the model used. Stated somewhat differently, such tests are used to ascertain the degree to which the model and parameter estimates can account for the structure of and statistical patterns within the response data, either as a whole, or by considering specific subsets of the data such as response vectors pertaining to individual items or persons. This approach permits the central hypothesis represented by a particular model to be subjected to empirical testing, as well as providing information about the psychometric properties of a given assessment, and therefore also the quality of estimates.

From the perspective of more traditional approaches, such as Classical test theory, an advantage of IRT is that it potentially provides information that enables a researcher to improve the reliability of an assessment. This is achieved through the extraction of more sophisticated information regarding psychometric properties of individual assessment items. IRT is sometimes referred to using the word strong as in strong true score theory or modern as in modern mental test theory because IRT is a more recent body of theory and makes more explicit the hypotheses that are implicit within Classical test theory.

It should be noted that although the Rasch model for dichotomous data is often regarded as a particular IRT model, the Rasch model has a formal property that distinguishes it from IRT models. This property arises from the specification of uniform discrimination for all interactions between persons and items in the model. Related to this, although models such as the 2 Parameter Logistic Model (2PLM) are often considered 'generalisations' of the Rasch model, this is not actually the case. The reason is that the 2PLM contains a discrimination parameter for each item, which inherently implies the hypothesis that discrimination is attributable to characteristics of items alone. In contrast, the structure of the Rasch model implies that discrimination arises as a consequence of interactions between persons and items (discrimination is required to be uniform across interactions between a particular class of persons and class of items in an experimental context). Estimation problems that arise in application of models such as the 2PLM do not arise in applications of the Rasch model (Wright, 1992). Thus, although the Rasch model is referred to below, this distinction should be kept in mind.

IRT models
Much of the literature on IRT centers around item response models. A given model constitutes a mathematized hypothesis that the probability of a discrete response to an item is a function of a person parameter (or, in the case of multidimensional item response theory, a vector of person parameters) and one or more item parameters. For example, in the 3 Parameter Logistic Model (3PLM), the probability of a correct response to an item i is:



p_i({\theta})=c_i + \frac{(1-c_i)}{1+e^{-Da_i({\theta}-b_i)}} $$

where $${\theta}$$ is the person parameter and $$a_i$$, $$b_i$$, and $$c_i$$ are item parameters. The parameter $$b_i$$ represents the item location which, in the case of attainment testing, is referred to as the item difficulty. The item parameter $$a_i$$ represents the discrimination of the item: that is, the degree to which the item discriminates between persons in different regions on the latent continuum. This parameter characterizes the slopes of ICCs. For items such as multiple choice, the parameter $$c_i$$ is used in attempt to account for the effects of guessing on the probability of a correct response.

This logistic model relates the level of the person parameter and item parameters to the probability of responding correctly. The constant D has the value 1.701 which rescales the logistic function to closely approximate the cumulative normal ogive; specifically, such that the probability differs by no more than 0.01 across the range of the function. The model was originally developed using the normal ogive but the logistic model with the rescaling provides virtually the same model while greatly simplifying computations involved with its application.

The graph that maps location on the latent continuum to probability for a given item, across levels of the trait, is called the item characteristic curve (ICC) or, less commonly, item response function.

The person parameter represents the magnitude of latent trait of the individual. The estimate of the person parameter is derived from the individual's total score on the assessment, which is a weighted score when the model contains item discrimination parameters. The latent trait is the human capacity or attribute measured by the test. It might be a cognitive ability, physical ability, skill, knowledge, attitude, personality characteristic, etc. In a one dimensional model such as the one above, this trait is analogous to a single factor in factor analysis. Individual items or individuals might have secondary factors but these are assumed to be mutually independent and collectively orthogonal.

The item parameters simply determine the shape of the ICC and in some cases may not have a direct interpretation. In this case, however, the parameters are commonly interpreted as follows. The parameter b is considered to reflect an item's difficulty. Note that this model scales the item's difficulty and the person's trait onto the same continuum. Thus, it is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty, in the sense that successful performance of the task involved with an item reflects a specific level of ability. The parameter a reflects how steeply the ICC rises and thus indicates the degree to which the item distinguishes between the levels of trait of individuals across the continuum. The final parameter, c, is the asymptote of the ICC on the left-hand side. Thus it indicates the probability that very low ability individuals will get this item correct by chance.

This model requires a single trait dimension and a binary outcome; i.e. it is a dichotomous, one dimensional model. Another class of models apply to polytomous outcomes. For example, the Polytomous Rasch model is a generalisation of the Rasch model for dichotomous data. In addition, a class of functions model response data hypothesized to arise from multiple traits.

Information
One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance. This index is helpful in characterizing a test's average reliability, for example in order to compare two tests. But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the test's range, for example, generally have more error associated with them than scores closer to the middle of the range.

Item response theory advances the concept of item and test information to replace reliability. Information is also a function of the model parameters. For example, according to Fisher information theory, the item information supplied in the case of the Rasch model for dichotomous response data is simply the probability of a correct response multiplied by the probably of an incorrect response, or,



I({\theta})=p_i({\theta}) q_i({\theta}).\, $$

The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the



\mbox{SE}({\theta})=1/\sqrt{I({\theta})}.\, $$

Thus more information implies less error of measurement.

For other models, such as the two and three parameters models, the discrimination parameter plays an important role in the function. The item information function for the two parameter model is



I({\theta})=a_i^2 p_i({\theta}) q_i({\theta}).\, $$

In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range.

Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range. Because of local independence, item information functions are additive. Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely.

Scoring
After the model is fit to data, each person has a theta estimate. This estimate is their score on the exam. This "IRT score" is computed and interpreted in a very different manner as compared to traditional scores like number or percent correct. However, for most tests, the (linear) correlation between the theta estimate and a traditional score is very high (often it is .95 or more). A graph of IRT scores against traditional scores shows an ogive shape implying that the IRT estimates separate individuals at the borders of the range more than in the middle.

It is worth noting the implications of IRT for test-takers. Tests are imprecise tools and the score achieved by an individual (the observed score) is always the true score occluded by some degree of error. This error may push the observed score higher or lower.

Also, nothing about these models refutes human development or improvement. A person may learn skills, knowledge or even so called "test-taking skills" which may translate to a higher true-score.

A comparison of classical and modern test theory
Classical test theory (CTT) and IRT are largely concerned with the same problems but employ very different methods. Although the two paradigms are generally consistent and complementary, there are a number of points of difference:


 * IRT makes stronger assumptions than CTT and in many cases provides correspondingly stronger findings; primarily, characterizations of error (see "Information and reliability" below). Of course, these results only hold when the assumptions of the IRT models are actually met.
 * Although CTT results have allowed important practical results, the model-based nature of IRT affords many advantages over analogous CTT findings.
 * CTT test scoring procedures have the advantage of being simple to compute (and to explain) whereas IRT scoring generally requires relatively complex estimation procedures (except for IRT models such as the Rasch model where the number-correct score is a sufficient estimator of theta).
 * IRT provides several improvements in scaling items and people. The specifics depend upon the IRT model, but most models scale the difficulty of items and the ability of people on the same metric. Thus the difficulty of an item and the ability of a person can be meaningfully compared.
 * Another improvement provided by IRT is that the parameters of IRT models are generally not sample- or test-dependant whereas true-score is defined in CTT in the context of a specific test. Thus IRT provides significantly greater flexibility in situations where different samples or test forms are used. These IRT findings are foundational for computerized adaptive testing.
 * The discrimination parameter of an item is somewhat analogous to a point-biserial correlation. Specifically, when there are relatively equal numbers of persons with locations below and above the location of a given item on the latent trait, then a point-biserial correlation will generally be greater when the item discrimination is greater. There is, however, generally more uncertainty associated with a point-biserial correlation for items with extreme locations because there are relatively few persons in the correct and incorrect groups (for an attainment item).

Information and reliability
Characterizing the accuracy of test scores is perhaps the central isue in psychometric theory and is a chief difference between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification. In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta.

These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of ability by including carefully chosen items. For example, in a certification situation in which a test can only be passed or failed, where there is only a single "cut-score," and where the actually passing score is unimportant, a very efficient test can be developed by selecting only items that have high information near the cut-score. These items generally correspond to items whose difficulty is about the same as that of the cut-score.

Although IRT provides for a standard error of each estimate and an information function, it is also possible to obtain an index for a test as a whole which is directly analogous to Cronbach's alpha, called the separation index. To do so, it is necessary to begin with a decomposition of an IRT estimate into a true location and error, analogous to decomposition of an observed score into a true score and error in CTT. Let


 * $$\hat{\theta} = \theta + \epsilon$$

where $$\theta$$ is the true location, and $$\epsilon$$ is the error association with an estimate. Then $$\mbox{SE}({\theta})$$ is an estimate of the standard deviation of $$\epsilon$$ for person with a given weighted score and the separation index is obtained as follows



R_\theta=\frac{\mbox{var}[\theta]}{\mbox{var}[\hat{\theta}]}=\frac{\mbox{var}[\hat{\theta}]-\mbox{var}[\epsilon]}{\mbox{var}[\hat{\theta}]} $$

where the mean squared standard error of person estimate gives an estimate of the variance of the errors, $$\epsilon_n$$, across persons. The standard errors are normally produced as a by-product of the estimation process (see, for example, Rasch model estimation). The separation index is typically very close in value to Cronbach's alpha (Andrich, 1982).

Additional reading
Many books have been written that address item response theory or contain IRT or IRT-like models. This is a partial list, focusing on texts that provide more depth.

This book summaries much of Lord's IRT work, including chapters on the relationship between IRT and classical methods, fundamentals of IRT, estimation, and several advanced topics. Its estimation chapter is now dated in that it primarily discusses joint maximum likelihood method rather than the marginal maximum likelihood method implemented by Darrell Bock and his colleagues. This book is an accessible introduction to IRT, aimed, as the title says, at psychologists. This book provides a comprehensive overview regarding various popular IRT-models. It is well suited for persons who already have gained basic understanding of IRT. This volume shows an integrated introduction to item response models, mainly aimed at practitioners (researchers and graduate students).
 * Lord, F.M. (1980). Applications of item response theory to practical testing problems.  Mahwah, NJ: Erlbaum.
 * Embretson, S. and Reise, S. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.
 * Van der Linden, W.J. & Hambleton, R.K. (Eds.) (1997). Handbook of modern item response theory. New York: Springer.
 * De Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory Item Response Models. A Generalized Linear and Nonlinear Approach. New York: Springer.