knowt logo

Reliability of Test Scores and Test Items

Reliability

  • an umbrella term under which different types of scores stability are assessed

  • suggests trustworthiness and stability

  • can pertain to stability of scores over time (test-retest), stability of item scores across items (internal consistency), or stability of ratings across judges, or raters, of a person, object, event, and so on (interrater reliability)

  • a quality of test scores that suggests they are sufficiently consistently and free form measurement error to be useful

  • the evaluation of score reliability involves a 2-step process that consists of (a) determining what possible sources of error may enter into test scores and (b) estimating the magnitude of those errors

Sources of Error in Psychological Testing

  • Error can enter into the scores of psychological tests due to an enormous number of reasons, many of which are outside the purview of psychometric estimates of reliability.

  • Generally speaking, however, the error that enter into test scores may be categorized as stemming from one or more of the 3 following sources

    • the context in which the testing takes place

    • test taker

    • test itself

Sources of Measurement with Typical Reliability Coefficients Used to Estimate Them

Interscorer or Interrater Differences

  • test scored with a degree of subjectivity

  • the label assigned to the error that may enter into scores whenever the element of subjectivity plays a part in scoring a test

  • it is assumed that different judges will not always assign the same exact scores or ratings to a given test performance even if:

    • the scoring differences specified in the test manual are explicit and detailed

    • the scorers are conscientious in applying those directions

  • it refers to variations in scores that stem from differences in the subjective judgement of the scorers

  • Scorer Reliability

The Sampling Error

  • refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another

  • whereas a certain amount of time sampling error is assumed to enter into all test scores, as a rule, one should expect less of it in the scores of tests that assess relatively stable traits

  • Test-Retest Reliability

Content Sampling Error

  • the term used to label the trait-irrelevant variability that can enter into test scores as a result of fortuitous factors related to the content of the specific items included in a test

  • Alternate-Form Reliability

    • To investigate this kind of reliability, 2 or more different forms of the test -- identical in purpose but differing in specific content -- need to be prepared and administered to the same group of subjects. The test taker’s scores on each of the versions are then correlated to obtain alternate-form reliability coefficients

  • Split-Half Reliability

    • Administer a test to a group of individuals and to create 2 scores for each person by splitting the test into halves

Interim Inconsistency

  • refers to error in scores that results from fluctuations in items across an entire test, as opposed to the content sample error emanating from the particular configuration of items included in the test as a whole

  • Such inconsistencies can be due to a variety of factors, including content sampling error and content hetergeneity

  • Content Homogeneity

    • results from the inclusion of items or set of items that tap content knowledge or psychological functions that differ from those tapped by other items in the same test

    • can be checked using split-half reliability or interim reliability

    • the 2 most frequently used formulas used to calculate interim consistency are the Kuder-Richardson Formula 20 (KR-20) and Coefficient Alpha (a) or Cronbach’s alpha

Tests for Reliability

Test-Retest Reliability

  • An estimate of reliability obtained by correlating pairs of scores from the same people on 2 different administrations of the same test

  • Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait

Alternate-Forms Reliability

  • The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms coefficient of reliability, which is often termed the coefficient of equivalence

  • Alternate-Forms

    • simply different versions of a test that have been constructed so as to be parallel

Split-Half Reliability

  • An estimate of split-half reliability is obtained by correlating 2 pairs of scores obtained from equivalent halves of a single test administered once

  • It is a useful measure of reliability when it is impractical or undesirable to assess reliability with 2 tests or to administer a test twice (because of factors such as time and expense)

    • Step 1: Divide the test into equivalent halves

    • Step 2: Calculate a Pearson r between scores on the 2 halves of the test

    • Step 3: Adjust the half-test reliability using the Spearman-Brown formula

Inter-Item Consistency

  • refers to the degree of correlation among all the items on a scale

  • An index interim consistency, in turn, is useful in assessing the homogeneity of the test.

Interrater Reliability

  • the degree of agreement or consistency between 2 or more scores (judges or raters with regard to a particular measure)

Heterogeneity

  • describes the degree to which a test measures different factors. A heterogenous or non homogeneous test is composed of items that measure more than one trair

What to do when reliability of a test is low

  • increase the number of items

  • factor and item analysis

S

Reliability of Test Scores and Test Items

Reliability

  • an umbrella term under which different types of scores stability are assessed

  • suggests trustworthiness and stability

  • can pertain to stability of scores over time (test-retest), stability of item scores across items (internal consistency), or stability of ratings across judges, or raters, of a person, object, event, and so on (interrater reliability)

  • a quality of test scores that suggests they are sufficiently consistently and free form measurement error to be useful

  • the evaluation of score reliability involves a 2-step process that consists of (a) determining what possible sources of error may enter into test scores and (b) estimating the magnitude of those errors

Sources of Error in Psychological Testing

  • Error can enter into the scores of psychological tests due to an enormous number of reasons, many of which are outside the purview of psychometric estimates of reliability.

  • Generally speaking, however, the error that enter into test scores may be categorized as stemming from one or more of the 3 following sources

    • the context in which the testing takes place

    • test taker

    • test itself

Sources of Measurement with Typical Reliability Coefficients Used to Estimate Them

Interscorer or Interrater Differences

  • test scored with a degree of subjectivity

  • the label assigned to the error that may enter into scores whenever the element of subjectivity plays a part in scoring a test

  • it is assumed that different judges will not always assign the same exact scores or ratings to a given test performance even if:

    • the scoring differences specified in the test manual are explicit and detailed

    • the scorers are conscientious in applying those directions

  • it refers to variations in scores that stem from differences in the subjective judgement of the scorers

  • Scorer Reliability

The Sampling Error

  • refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another

  • whereas a certain amount of time sampling error is assumed to enter into all test scores, as a rule, one should expect less of it in the scores of tests that assess relatively stable traits

  • Test-Retest Reliability

Content Sampling Error

  • the term used to label the trait-irrelevant variability that can enter into test scores as a result of fortuitous factors related to the content of the specific items included in a test

  • Alternate-Form Reliability

    • To investigate this kind of reliability, 2 or more different forms of the test -- identical in purpose but differing in specific content -- need to be prepared and administered to the same group of subjects. The test taker’s scores on each of the versions are then correlated to obtain alternate-form reliability coefficients

  • Split-Half Reliability

    • Administer a test to a group of individuals and to create 2 scores for each person by splitting the test into halves

Interim Inconsistency

  • refers to error in scores that results from fluctuations in items across an entire test, as opposed to the content sample error emanating from the particular configuration of items included in the test as a whole

  • Such inconsistencies can be due to a variety of factors, including content sampling error and content hetergeneity

  • Content Homogeneity

    • results from the inclusion of items or set of items that tap content knowledge or psychological functions that differ from those tapped by other items in the same test

    • can be checked using split-half reliability or interim reliability

    • the 2 most frequently used formulas used to calculate interim consistency are the Kuder-Richardson Formula 20 (KR-20) and Coefficient Alpha (a) or Cronbach’s alpha

Tests for Reliability

Test-Retest Reliability

  • An estimate of reliability obtained by correlating pairs of scores from the same people on 2 different administrations of the same test

  • Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait

Alternate-Forms Reliability

  • The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms coefficient of reliability, which is often termed the coefficient of equivalence

  • Alternate-Forms

    • simply different versions of a test that have been constructed so as to be parallel

Split-Half Reliability

  • An estimate of split-half reliability is obtained by correlating 2 pairs of scores obtained from equivalent halves of a single test administered once

  • It is a useful measure of reliability when it is impractical or undesirable to assess reliability with 2 tests or to administer a test twice (because of factors such as time and expense)

    • Step 1: Divide the test into equivalent halves

    • Step 2: Calculate a Pearson r between scores on the 2 halves of the test

    • Step 3: Adjust the half-test reliability using the Spearman-Brown formula

Inter-Item Consistency

  • refers to the degree of correlation among all the items on a scale

  • An index interim consistency, in turn, is useful in assessing the homogeneity of the test.

Interrater Reliability

  • the degree of agreement or consistency between 2 or more scores (judges or raters with regard to a particular measure)

Heterogeneity

  • describes the degree to which a test measures different factors. A heterogenous or non homogeneous test is composed of items that measure more than one trair

What to do when reliability of a test is low

  • increase the number of items

  • factor and item analysis