knowt logo

Reliability and Validity

The Concept of Reliability

  • Reliability: consistency in measurement

  • observed score = true score + error (X=T+E)

  • Error: refers to the component of the observed score that does not have to do with the test taker’s true ability or trait being measured

  • Measurement Error

    • Random Error: a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process (i.e., noise)

    • Systematic Error: a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured

Sources of Error

  • Test Construction: variation may exist within items on a test or between tests

  • Test Administration

    • sources of error may stem from the testing environment

    • test taker variable such as pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication

    • examiner-related variables such as physical appearance and demeanor may play a role

  • Test Scoring and Interpretation

    • computerizing reduces error in test scoring but many tests still require expert interpretation

    • subjectivity in scoring can enter into behavioral assessment

Reliability Estimates

Test-Retest Reliability

  • an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test

  • Most appropriate for variables that should be able over time (e.g., personality) and not appropriate for variables expected to change over time (e.g., mood)

  • with intervals over 6 months the estimate of test-retest reliability is called the coefficient of stability

  • Carryover Effect: this effect occurs when the first testing session influences scores from the second session

  • Practice Effect: when a test is given a second time, test takers score better because they have sharpened their skills by having taken the test the first time

Alternate-Forms

  • Coefficient of Equivalence: the degree of the relationship between various forms of a test

  • Alternate Forms

    • different versions of a test that have been constructed so as to be parallel

    • item content and difficulty is similar between tests

  • Reliability is checked by administering 2 forms of a test to the same group. Scores may be affected by error related to the state of test takers (e.g., practice, fatigue, etc.) or item sampling

Split-Half Reliability

  • obtained by correlating 2 pairs of scores obtained from equivalent halves of a single test administered once

  • entails 3 steps

    1. divide the test into equivalent halves

    2. calculate a Pearson r between scores on the 2 halves of the test

    3. adjust the half-test reliability using the Spearman-Brown formula

    • Spearman-Brown Formula: allows a test developer or user to estimate internal consistency reliability from a correlation of 2 halves of a test

Other Methods of Estimating Internal Consistency

Inter-Item Consistency

  • the degree of relatedness of items on a test

  • able to gauge the homogeneity of a test

Kuder-Richardson Formula 20

  • statistics of choice for determining the inter-item consistency of dichotomous items

KR-21

  • may be used if there is reason to assume that all test items have approximately the same degree of difficulty

Coefficient Alpha

  • mean of all possible split-half correlations, corrected by the Spearman-Brown formula

  • most popular approach for internal consistency

  • value range from 0 to 1

    • 0.90 alpha is already a redundancy of items

Measures of Inter-Scorer Reliability

Inter-Scorer Reliability

  • the degree of agreement or consistency between 2 or more scorers (judges or raters) with regard to a particular measure

  • interrater or interscorer or interobserver or interjudge reliability

    • it is often used with behavioral measures

    • guards against biases or idiosyncrasies in scoring

    • Coefficient of Inter-Score Reliability

      • the scores from different raters are correlated with one another

    • Kappa Statistic

      • best method for assessing the level of agreement among several observers

The Concept of Validity

  • Validity: a judgment or estimate of how well a test measures what it purports to measure in a particular context

  • Validation

    • the process of gathering and evaluating evidence about validity

    • both test developers and test users may play a role in the validating of a test

    • test users may validate a test with their own group of test takers -- local validation

  • 3 Categories

    • Content Validity: measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test

    • Criterion-Related Validity: measure of validity obtained by evaluating the relationships of scores obtained on the test to scores on others tests or measures

    • Construct Validity: a measure of validity that is arrived at by executing a comprehensive analysis of

      • how scores on the test relate to other test scores and measures

      • how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure

Face Validity

  • a judgment concerning how relevant the test items appear to be

  • if a test appears to measured what it purports to measure “on the face of it”, it could be said to be high on face validity

  • many self-report personality tests are high in face validity, whereas projective tests, such as the Rorschach tend to be low in face validity (i.e., it is not apparent what is being measured)

  • a perceived lack of face validity may lead to a lack of confidence in the test measuring what it purports to measure

Content Validity

  • a judgment of how adequately a test samples behaviors representative of the universe of behavior that the test was designed to sample

  • Do the test adequately represent the content that should be included in the test?

  • Test Blueprint: a plan regarding the types of information to be covered by the items, the number of items in tapping each area of coverage, the organization of the items in the test, etc.

  • if more than half the raters indicate that an item is essential, the item has at least some content validity

  • 2 Concepts Relevant to Content Validity

    • Construct Underrepresentation

      • describes the failure to capture important components of a construct

    • Construct-Irrelevant Variance

      • occurs when scores are influenced by factors irrelevant to the construct

Criterion-Related Validity

  • Criterion: the standard against which a test or a test is evaluated

    • Characteristics

      • relevant for the matter at hand

      • valid for the purpose for which it is being used

      • uncontaminated: not part of the predictor

    • a judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest

      • Concurrent Validity: an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently)

      • Predictive Validity: an index of the degree to which a test score predicts some criterion, or outcome, measure in the future

        • tests are evaluated as to their predictive validity

Construct Validity

  • the ability of a test to measure a theorized construct (e.g., intelligence, aggression, personality, etc.) that it purports to measure

  • if a test is a valid measure of a construct, high scorers and low scorers should behave as theorized

  • all types of validity evidence, including evidence from the content- and criterion-related varieties of validity, come under the umbrella of construct validity

  • Evidence

    • Homogeneity: how uniform a test is in measuring a single concept

    • Changes with Age: some constructs are expected to change over time (e.g., reading rate)

    • Pretest of Posttest Changes: test scores change as a result of some experience between a pretest and a posttest (e.g., therapy)

    • Distinct Groups: scores on a test vary in a predictable way as a function of membership in some group (e.g., scores on the Psychological Checklist for prisoners vs civilians)

    • Convergence: scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, tests designed to measure the same (or similar) construct

    • Divergent: validity coefficient showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated

    • Factor Analysis: a new test should load on a common factor with other tests of the same construct

S

Reliability and Validity

The Concept of Reliability

  • Reliability: consistency in measurement

  • observed score = true score + error (X=T+E)

  • Error: refers to the component of the observed score that does not have to do with the test taker’s true ability or trait being measured

  • Measurement Error

    • Random Error: a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process (i.e., noise)

    • Systematic Error: a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured

Sources of Error

  • Test Construction: variation may exist within items on a test or between tests

  • Test Administration

    • sources of error may stem from the testing environment

    • test taker variable such as pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication

    • examiner-related variables such as physical appearance and demeanor may play a role

  • Test Scoring and Interpretation

    • computerizing reduces error in test scoring but many tests still require expert interpretation

    • subjectivity in scoring can enter into behavioral assessment

Reliability Estimates

Test-Retest Reliability

  • an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test

  • Most appropriate for variables that should be able over time (e.g., personality) and not appropriate for variables expected to change over time (e.g., mood)

  • with intervals over 6 months the estimate of test-retest reliability is called the coefficient of stability

  • Carryover Effect: this effect occurs when the first testing session influences scores from the second session

  • Practice Effect: when a test is given a second time, test takers score better because they have sharpened their skills by having taken the test the first time

Alternate-Forms

  • Coefficient of Equivalence: the degree of the relationship between various forms of a test

  • Alternate Forms

    • different versions of a test that have been constructed so as to be parallel

    • item content and difficulty is similar between tests

  • Reliability is checked by administering 2 forms of a test to the same group. Scores may be affected by error related to the state of test takers (e.g., practice, fatigue, etc.) or item sampling

Split-Half Reliability

  • obtained by correlating 2 pairs of scores obtained from equivalent halves of a single test administered once

  • entails 3 steps

    1. divide the test into equivalent halves

    2. calculate a Pearson r between scores on the 2 halves of the test

    3. adjust the half-test reliability using the Spearman-Brown formula

    • Spearman-Brown Formula: allows a test developer or user to estimate internal consistency reliability from a correlation of 2 halves of a test

Other Methods of Estimating Internal Consistency

Inter-Item Consistency

  • the degree of relatedness of items on a test

  • able to gauge the homogeneity of a test

Kuder-Richardson Formula 20

  • statistics of choice for determining the inter-item consistency of dichotomous items

KR-21

  • may be used if there is reason to assume that all test items have approximately the same degree of difficulty

Coefficient Alpha

  • mean of all possible split-half correlations, corrected by the Spearman-Brown formula

  • most popular approach for internal consistency

  • value range from 0 to 1

    • 0.90 alpha is already a redundancy of items

Measures of Inter-Scorer Reliability

Inter-Scorer Reliability

  • the degree of agreement or consistency between 2 or more scorers (judges or raters) with regard to a particular measure

  • interrater or interscorer or interobserver or interjudge reliability

    • it is often used with behavioral measures

    • guards against biases or idiosyncrasies in scoring

    • Coefficient of Inter-Score Reliability

      • the scores from different raters are correlated with one another

    • Kappa Statistic

      • best method for assessing the level of agreement among several observers

The Concept of Validity

  • Validity: a judgment or estimate of how well a test measures what it purports to measure in a particular context

  • Validation

    • the process of gathering and evaluating evidence about validity

    • both test developers and test users may play a role in the validating of a test

    • test users may validate a test with their own group of test takers -- local validation

  • 3 Categories

    • Content Validity: measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test

    • Criterion-Related Validity: measure of validity obtained by evaluating the relationships of scores obtained on the test to scores on others tests or measures

    • Construct Validity: a measure of validity that is arrived at by executing a comprehensive analysis of

      • how scores on the test relate to other test scores and measures

      • how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure

Face Validity

  • a judgment concerning how relevant the test items appear to be

  • if a test appears to measured what it purports to measure “on the face of it”, it could be said to be high on face validity

  • many self-report personality tests are high in face validity, whereas projective tests, such as the Rorschach tend to be low in face validity (i.e., it is not apparent what is being measured)

  • a perceived lack of face validity may lead to a lack of confidence in the test measuring what it purports to measure

Content Validity

  • a judgment of how adequately a test samples behaviors representative of the universe of behavior that the test was designed to sample

  • Do the test adequately represent the content that should be included in the test?

  • Test Blueprint: a plan regarding the types of information to be covered by the items, the number of items in tapping each area of coverage, the organization of the items in the test, etc.

  • if more than half the raters indicate that an item is essential, the item has at least some content validity

  • 2 Concepts Relevant to Content Validity

    • Construct Underrepresentation

      • describes the failure to capture important components of a construct

    • Construct-Irrelevant Variance

      • occurs when scores are influenced by factors irrelevant to the construct

Criterion-Related Validity

  • Criterion: the standard against which a test or a test is evaluated

    • Characteristics

      • relevant for the matter at hand

      • valid for the purpose for which it is being used

      • uncontaminated: not part of the predictor

    • a judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest

      • Concurrent Validity: an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently)

      • Predictive Validity: an index of the degree to which a test score predicts some criterion, or outcome, measure in the future

        • tests are evaluated as to their predictive validity

Construct Validity

  • the ability of a test to measure a theorized construct (e.g., intelligence, aggression, personality, etc.) that it purports to measure

  • if a test is a valid measure of a construct, high scorers and low scorers should behave as theorized

  • all types of validity evidence, including evidence from the content- and criterion-related varieties of validity, come under the umbrella of construct validity

  • Evidence

    • Homogeneity: how uniform a test is in measuring a single concept

    • Changes with Age: some constructs are expected to change over time (e.g., reading rate)

    • Pretest of Posttest Changes: test scores change as a result of some experience between a pretest and a posttest (e.g., therapy)

    • Distinct Groups: scores on a test vary in a predictable way as a function of membership in some group (e.g., scores on the Psychological Checklist for prisoners vs civilians)

    • Convergence: scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, tests designed to measure the same (or similar) construct

    • Divergent: validity coefficient showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated

    • Factor Analysis: a new test should load on a common factor with other tests of the same construct