knowt logo

Norms

What’s a Good Test?

  • Reliability

    • The consistency of the measuring tool; the precision with which the test measures and the extent to which error is present in measurements

  • Validity

    • The test measures what it purports to measure

  • Other Considerations

    • Administration, scoring, interpretation should be straightforward for trained examiners. A good test is a useful test that will ultimately benefit individual test takers or society at large.

Norms

  • Norm-Referenced Testing and Assessment

    • A method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker’s score and comparing it to scores of a group of test takers

    • The meaning of an individual test score is understood relative to other scores on the same test

  • Norms

    • the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individual test scores

  • A normative sample is the reference group to which test takers are compared

Sampling to Developing Norms

  • Standardization

    • the process of administering a test to a representative sample of test takers for the purpose of establishing norms

  • Sampling

    • test developers select a population, for which the test is intended, that has at least one common, observable characteristics

  • Stratified Sampling

    • sampling that includes different subgroups, or strata, from the population

  • Stratified-Random Sampling

    • every member of the population has an equal opportunity of being included in a sample

  • Purposive Sample

    • arbitrarily selecting a sample that is believed to be representative of the population

  • Incidental or Convenience Sample

    • a sample that is convenient or available for use; may not be representative of the population

      • generalization of findings from convenience samples must be made with caution

Developing Norms

  • having obtained a sample test developers

    • administer the test with standard set of instructions

    • recommend a setting for test administration

    • collect and analyze data

    • summarize data using descriptive statistics including measures of central tendency and variability

    • provide a detailed description of the standardization sample itself

Types of Norms

  • Percentile

    • the percentile of people whose score on a test or measure falls below a particular raw score

    • a popular method for organizing test-related data because they are easily calculated

  • Age Norms

    • average performance of different samples of test takers who were at various ages when the test was administered

  • Grade Norms

    • the average test performance of test takers in a given school grade

  • National Norms

    • derived from a normative sample that was nationally representative of the population at the time the norming study was conducted

  • Local Norms

    • provide normative information with respect to the local population’s performance on some test

Norm-Referenced vs Criterion-Referenced

  • Norm-Referenced Tests

    • involve comparing individuals to the normative group

  • Criterion-Referenced Tests

    • test takers are evaluated as to whether they meet a set standard

Culture and Inference

  • In selecting a test for use, responsible test users should research the test’s available norms to check how appropriate they are for use with the targeted test taker population.

  • When interpreting test results it helps to know about the culture and era of the test taker

  • It is important to conduct culturally informed assessment

PAP Code of Ethics

  • We develop tests and other assessment tools using current scientific findings and knowledge, appropriate psychometric properties, validation, and standardization procedures

Critiquing a Test

  • Purpose: What does the measure overall? What assumptions are these based on?

  • Design: What are the individual constructs that it measures? What logical or theoretical assumptions underpin these constructs? What is the empirical basis is there for the constructs, i.e., how were they developed, and what evidence is there of this?

  • Bias, Validity, and Reliability: Examine the test for item, wording, ordering biases. How would you test for content, construct, and predictive validity? Make a plan for one of these. Overall -- what are the main flaws and strengths of this test?

Ethics

  • Purpose: be ethical about why and how the test is given

  • Authenticity: the test must reflect “real life” psychology

  • Generalizability: when reporting results, be realistic about who these can be extended to

  • Subjectivity: be honest about how much personal judgment is included in the design of your test and results analysis

Mismatched Validity

  • selecting assessment instruments involves similarly complex questions

  • It is important to note that as the population, task, or circumstances change, the measures of validity, reliability, sensitivity, etc. will also tend to change

  • To determine whether tests are well-matched to the task, individual, and situation at hand, it is crucial that the psychologist ask a basic question at the outset: Why -- exactly -- am I conducting this assessment?

Confirmation Bias

  • We give preference to information that confirms our expectations

  • To help protect ourselves against this, it is useful to search actively for data that disconfirm our expectations, and to try out alternative interpretations of the available data

Understanding Standardized Tests

  • Standardized tests gain their power from their standardization.

  • When we change the instructions, or the test items themselves, or the way items are administered or scored, we depart from that standardization and our attempts to draw on the actuarial base become questionable

  • The professional conducting the assessment must be alert to situational factors, how they can threaten the assessment’s validity, and how to address them effectively

Perfect Conditions Fallacy

  • Especially when we’re hurried, we like to assume that “all is well” that in fact “conditions are perfect”

  • If we don’t check, we may not discover that the person we’re assessing for a job, a custody hearing, a disability claim, a criminal case, asylum status, or a competency hearing took standardized psychological tests and completed other phases of formal assessment under conditions that significantly distorted the results

Financial Bias

  • a financial conflict of interest can subtly -- and sometimes not so subtly -- affect the ways in which we gather, interpret, and present even the most routine data

Ignoring Effects of Audio-Recording, Video-Recording, or the Presence of Third Party Observers

  • Ignoring these potential effects can create an extremely misleading assessment. Part of adequate preparation for an assessment that will involve recording or the presence of third parties is reviewing the relevant research and professional guidelines.

Uncertain Gatekeeping

  • Psychologists who conduct assessments are gatekeepers of sensitive information that may have profound and lasting affects on the life of the person who was assessed.

Theories in Psychological Testing

Classical Test Theory

  • main purpose is to recognize and develop the reliability of psychological tests and assessment

  • An individual’s observed score is the sum of a true score and an error score

X = T + E

  • The true component is due to true differences among persons, and the error part is an aggregate of variation due to sources of errors

Generalizability Theory

  • extends CTT by providing a framework for increasing measurement precision by estimating different sources of variations unique to particular testing or measurement conditions

Item Response Theory or Latent Traits Theory

  • to generate items that provide the maximum amount of information possible or trait levels of test takers who respond to them in one fashion to another

  • to give test taker items that are tailored to their ability or trait levels

  • to reduce the number of items needed to pinpoint any given test takers’ standing on the ability or latent trait while minimizing measurement error

  • reducing the number of items in a test

CTT vs IRT on a Matter of Test Length and Reliability

  • In terms of split-half reliability and the Spearman-Brown formula, CTT holds that, all other things being equal, a larger number of observations will produce more reliable results than a smaller number of observations will produce more reliable results than a smaller number of observations

  • In IRT, item selection is optimally suited to test taker’s levels on the trait being assessed

S

Norms

What’s a Good Test?

  • Reliability

    • The consistency of the measuring tool; the precision with which the test measures and the extent to which error is present in measurements

  • Validity

    • The test measures what it purports to measure

  • Other Considerations

    • Administration, scoring, interpretation should be straightforward for trained examiners. A good test is a useful test that will ultimately benefit individual test takers or society at large.

Norms

  • Norm-Referenced Testing and Assessment

    • A method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker’s score and comparing it to scores of a group of test takers

    • The meaning of an individual test score is understood relative to other scores on the same test

  • Norms

    • the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individual test scores

  • A normative sample is the reference group to which test takers are compared

Sampling to Developing Norms

  • Standardization

    • the process of administering a test to a representative sample of test takers for the purpose of establishing norms

  • Sampling

    • test developers select a population, for which the test is intended, that has at least one common, observable characteristics

  • Stratified Sampling

    • sampling that includes different subgroups, or strata, from the population

  • Stratified-Random Sampling

    • every member of the population has an equal opportunity of being included in a sample

  • Purposive Sample

    • arbitrarily selecting a sample that is believed to be representative of the population

  • Incidental or Convenience Sample

    • a sample that is convenient or available for use; may not be representative of the population

      • generalization of findings from convenience samples must be made with caution

Developing Norms

  • having obtained a sample test developers

    • administer the test with standard set of instructions

    • recommend a setting for test administration

    • collect and analyze data

    • summarize data using descriptive statistics including measures of central tendency and variability

    • provide a detailed description of the standardization sample itself

Types of Norms

  • Percentile

    • the percentile of people whose score on a test or measure falls below a particular raw score

    • a popular method for organizing test-related data because they are easily calculated

  • Age Norms

    • average performance of different samples of test takers who were at various ages when the test was administered

  • Grade Norms

    • the average test performance of test takers in a given school grade

  • National Norms

    • derived from a normative sample that was nationally representative of the population at the time the norming study was conducted

  • Local Norms

    • provide normative information with respect to the local population’s performance on some test

Norm-Referenced vs Criterion-Referenced

  • Norm-Referenced Tests

    • involve comparing individuals to the normative group

  • Criterion-Referenced Tests

    • test takers are evaluated as to whether they meet a set standard

Culture and Inference

  • In selecting a test for use, responsible test users should research the test’s available norms to check how appropriate they are for use with the targeted test taker population.

  • When interpreting test results it helps to know about the culture and era of the test taker

  • It is important to conduct culturally informed assessment

PAP Code of Ethics

  • We develop tests and other assessment tools using current scientific findings and knowledge, appropriate psychometric properties, validation, and standardization procedures

Critiquing a Test

  • Purpose: What does the measure overall? What assumptions are these based on?

  • Design: What are the individual constructs that it measures? What logical or theoretical assumptions underpin these constructs? What is the empirical basis is there for the constructs, i.e., how were they developed, and what evidence is there of this?

  • Bias, Validity, and Reliability: Examine the test for item, wording, ordering biases. How would you test for content, construct, and predictive validity? Make a plan for one of these. Overall -- what are the main flaws and strengths of this test?

Ethics

  • Purpose: be ethical about why and how the test is given

  • Authenticity: the test must reflect “real life” psychology

  • Generalizability: when reporting results, be realistic about who these can be extended to

  • Subjectivity: be honest about how much personal judgment is included in the design of your test and results analysis

Mismatched Validity

  • selecting assessment instruments involves similarly complex questions

  • It is important to note that as the population, task, or circumstances change, the measures of validity, reliability, sensitivity, etc. will also tend to change

  • To determine whether tests are well-matched to the task, individual, and situation at hand, it is crucial that the psychologist ask a basic question at the outset: Why -- exactly -- am I conducting this assessment?

Confirmation Bias

  • We give preference to information that confirms our expectations

  • To help protect ourselves against this, it is useful to search actively for data that disconfirm our expectations, and to try out alternative interpretations of the available data

Understanding Standardized Tests

  • Standardized tests gain their power from their standardization.

  • When we change the instructions, or the test items themselves, or the way items are administered or scored, we depart from that standardization and our attempts to draw on the actuarial base become questionable

  • The professional conducting the assessment must be alert to situational factors, how they can threaten the assessment’s validity, and how to address them effectively

Perfect Conditions Fallacy

  • Especially when we’re hurried, we like to assume that “all is well” that in fact “conditions are perfect”

  • If we don’t check, we may not discover that the person we’re assessing for a job, a custody hearing, a disability claim, a criminal case, asylum status, or a competency hearing took standardized psychological tests and completed other phases of formal assessment under conditions that significantly distorted the results

Financial Bias

  • a financial conflict of interest can subtly -- and sometimes not so subtly -- affect the ways in which we gather, interpret, and present even the most routine data

Ignoring Effects of Audio-Recording, Video-Recording, or the Presence of Third Party Observers

  • Ignoring these potential effects can create an extremely misleading assessment. Part of adequate preparation for an assessment that will involve recording or the presence of third parties is reviewing the relevant research and professional guidelines.

Uncertain Gatekeeping

  • Psychologists who conduct assessments are gatekeepers of sensitive information that may have profound and lasting affects on the life of the person who was assessed.

Theories in Psychological Testing

Classical Test Theory

  • main purpose is to recognize and develop the reliability of psychological tests and assessment

  • An individual’s observed score is the sum of a true score and an error score

X = T + E

  • The true component is due to true differences among persons, and the error part is an aggregate of variation due to sources of errors

Generalizability Theory

  • extends CTT by providing a framework for increasing measurement precision by estimating different sources of variations unique to particular testing or measurement conditions

Item Response Theory or Latent Traits Theory

  • to generate items that provide the maximum amount of information possible or trait levels of test takers who respond to them in one fashion to another

  • to give test taker items that are tailored to their ability or trait levels

  • to reduce the number of items needed to pinpoint any given test takers’ standing on the ability or latent trait while minimizing measurement error

  • reducing the number of items in a test

CTT vs IRT on a Matter of Test Length and Reliability

  • In terms of split-half reliability and the Spearman-Brown formula, CTT holds that, all other things being equal, a larger number of observations will produce more reliable results than a smaller number of observations will produce more reliable results than a smaller number of observations

  • In IRT, item selection is optimally suited to test taker’s levels on the trait being assessed