When creating a test one generally uses a subset of items to represent a larger construct

Chapter 5 covered topics that rely on statistical analyses of data from educational and psychological measurements. These analyses are used to examine the relationships among scores on one or more test forms, in reliability, and scores based on ratings from two or more judges, in interrater reliability. Aside from coefficient alpha, all of the statistical analyses introduced so far focus on composite scores. Item analysis focuses instead on statistical analysis of the items themselves that make up these composites.

As discussed in Chapter 4, test items make up the most basic building blocks of an assessment instrument. Item analysis lets us investigate the quality of these individual building blocks, including in terms of how well they contribute to the whole and improve the validity of our measurement.

This chapter extends concepts from Chapters 2 and 5 to analysis of item performance within a CTT framework. The chapter begins with an overview of item analysis, including some general guidelines for preparing for an item analysis, entering data, and assigning score values to individual items. Some commonly used item statistics are then introduced and demonstrated. Finally, two additional item-level analyses are discussed, differential item functioning analysis and option analysis.

Learning objectives

  1. Explain how item bias and measurement error negatively impact the quality of an item, and how item analysis, in general, can be used to address these issues.
  2. Describe general guidelines for collecting pilot data for item analysis, including how following these guidelines can improve item analysis results.
  3. Identify items that may have been keyed or scored incorrectly.
  4. Recode variables to reverse their scoring or keyed direction.
  5. Use the appropriate terms to describe the process of item analysis with cognitive vs noncognitive constructs.
  6. Calculate and interpret item difficulties and compare items in terms of difficulty.
  7. Calculate and interpret item discrimination indices, and describe what they represent and how they are used in item analysis.
  8. Describe the relationship between item difficulty and item discrimination and identify the practical implications of this relationship.
  9. Calculate and interpret alpha-if-item-deleted.
  10. Utilize item analysis to distinguish between items that function well in a set and items that do not.
  11. Remove items from an item set to achieve a target level of reliability.
  12. Evaluate selected-response options using option analysis.

In this chapter, we’ll run item and option analyses on PISA09 data using epmr, with results plotted, as usual, using ggplot2.

# R setup for this chapter
# Required packages are assumed to be installed,
# see chapter 1
library["epmr"]
library["ggplot2"]
# Functions we'll use in this chapter
# str[] for checking the structure of an object
# recode[] for recoding variables
# colMeans[] for getting means by column
# istudy[] from epmr for running an item analysis
# ostudy[] from epmr for running an option analysis
# subset[] for subsetting data
# na.omit[] for removing cases with missing data

Preparing for item analysis

Item quality

As noted above, item analysis lets us examine the quality of individual test items. Information about individual item quality can help us determine whether or not an item is measuring the content and construct that it was written to measure, and whether or not it is doing so at the appropriate ability level. Because we are discussing item analysis here in the context of CTT, we’ll assume that there is a single construct of interest, perhaps being assessed across multiple related content areas, and that individual items can contribute or detract from our measurement of that construct by limiting or introducing construct irrelevant variance in the form of bias and random measurement error.

Bias represents a systematic error with an influence on item performance that can be attributed to an interaction between examinees and some feature of the test. Bias in a test item leads examinees having a known background characteristic, aside from their ability, to perform better or worse on an item simply because of this background characteristic. For example, bias sometimes results from the use of scenarios or examples in an item that are more familiar to certain gender or ethnic groups. Differential familiarity with item content can make an item more relevant, engaging, and more easily understood, and can then lead to differential performance, even for examinees of the same ability level. We identify such item bias primarily by using measures of item difficulty and differential item functioning [DIF], discussed below and again in Chapter 7.

Bias in a test item indicates that the item is measuring some other construct besides the construct of interest, where systematic differences on the other construct are interpreted as meaningful differences on the construct of interest. The result is a negative impact on the validity of test scores and corresponding inferences and interpretations. Random measurement error on the other hand is not attributed to a specific identifiable source, such as a second construct. Instead, measurement error is inconsistency of measurement at the item level. An item that introduces measurement error detracts from the overall internal consistency of the measure, and this is detected in CTT, in part, using item analysis statistics.

Piloting

The goal in developing an instrument or scale is to identify bias and inconsistent measurement at the item level prior to administering a final version of our instrument. As we talk about item analysis, remember that the analysis itself is typically carried out in practice using pilot data. Pilot data are gathered prior to or while developing an instrument or scale. These data require at least a preliminary version of the educational or psychological measure. We’ve written some items for our measure, and we want to see how well they work.

Nunnally and Bernstein [1994] and others recommend that the initial pilot “pool” of candidate test items should be 1.5 to 2 times as large as the final number of items needed. So, if you’re envisioning a test with 100 items on it, you should aim to pilot 150 to 200 items. This may not be feasible, but it is a best-case scenario, and should at least be followed in large-scale testing. By collecting data on up to twice as many items as we intend to actually use, we’re acknowledging that, despite our best efforts, many of our preliminary test items may either be low quality, for example, biased or internally inconsistent, and they may address different ability levels or content than intended.

An adequate sample size of test takers is essential if we hope to obtain item analysis results that generalize to the population of test takers. Nunnally and Bernstein [1994] recommend that data be collected on at least 300 individuals from the population of interest, or 5 times as many individuals as test items, whichever is larger. A more practical goal for smaller scale testing applications, such as with classroom assessments, is 100 to 200 test takers. With smaller or non-representative samples, our item analysis results must be interpreted with caution. As with inferences made based on other types of statistics, small samples more often lead to erroneous results. Keep in mind that every statistic discussed here has a standard error and confidence interval associated with it, whether it is directly examined or not. Note also that bias and measurement error arise in addition to this standard error or sampling error, and we cannot identify bias in our test questions without representative data from our intended population. Thus, adequate sampling in the pilot study phase is critical.

The item analysis statistics discussed here are based on the CTT model of test performance. In Chapter 7 we’ll discuss the more complex item response theory [IRT] and its applications in item analysis.

Data entry

After piloting a set of items, raw item responses are organized into a data frame with test takers in rows and items in columns. The str[] function is used here to summarize the structure of the unscored items on the PISA09 reading test. Each unscored item is coded in R as a factor with four to eight factor levels. Each factor level represents different information about a student’s response.

# Recreate the item name index and use it to check the 
# structure of the unscored reading items
# The strict.width argument is optional, making sure the
# results fit in the console window
ritems 

Bài Viết Liên Quan

Chủ Đề