Literally every term that I had in my notes for AP Stats. Doesn't have formulas, just terms and definitions.
statistics
science of data, information gathered into useful figures
individuals
one singular object described by a set of data
variables
any characteristics measured or collected in a data set
categorical variable
places an individual into a category or group
quantitative variable
measures a specific numerical value that can be used for analysis
marginal distributions
the frequency distributions of values of cat. variables among all individuals in a two-way table
conditional distributions
gives the frequency distribution of a specified group
association
relationship between two variables, value of one variable occurs in combination with values from another variable
SOCS
shape, outliers, center, spread
shape
graph's shape, whether it's symmetric or skewed left/right
outliers
any unusual data that doesn't fit, Q1-1.5IQR and Q3 +1.5IQR
center
described by mean or median, use median unless the data is symmetric
spread
variability in data, range & standard deviation
unimodal
one mode
bimodal
two modes
multimodal
multiple modes
uniform
no distinct mode
resistance
how much a measure is influenced by extreme values, median is resistant, mean, s.d. and range are not
five-number summary
minimum, first quartile, median, third quartile, maximum
standard deviation
average distance of a value from the mean
variance (r^2)
average squared distance from the mean
skew
data skewed to the right have the "tail" of the data on the right and vice versa
z-score
how many standard deviations you are away from the mean
percentiles
the nth percentile is the lowest score that is greater than a certain percentage
adding/subtracting numbers from data
changes the mean but not the standard deviation or range
multiplying/dividing data by numbers
changes range, mean, median, standard deviation
measures of center
median, mean, quartiles, percentiles
measures of spread
standard deviation, range
density curve
a curve on or above the horizontal axis, total area underneath is equal to one, 100% of observations
normal distribution
shown by a normal density curve with the mean, median, and mode at the center of the curve
68-95-99.7 rule/empirical rule
68% of values within one standard deviation of the mean, 95% within two, 99.7% within three
way to describe normal distribution
N(u, o) a.k.a. (mean, standard deviation)
standard normal probabilities table
a table of areas under the standard normal curve, table entry for each value of z is the area under the curve to the left of the z-score
central limit theorem
when n is large, the sampling distribution of the sample mean is approximately normal, n is greater than or equal to 30
bivariate data
quantitative data that has two variables, often represented w/ a scatterplot
explanatory variables
independent variable, used to explain or to predict changes in values of another variable
response variable
dependent variable, measures the outcome in response to the explanatory variable
correlation
if a graph has a negative/positive "slope", ranges from -1 to 1, r-value
DOFS
direction, outliers, form, strength - all things that should be addressed when describing the relationship between two quantitative variables on a scatterplot
regression line
a line describing how the response variable changes as the explanatory variable changes
least-squares regression
the line that makes the sum of the squared residuals as small as possible
extrapolation
a pitfall of statistical prediction, use of a regression line to predict values outside the data interval
residual
the difference between the observed value of the response variable and the value predicted by the regression, outliers have a large residual
influential point
any point that, if removed, changes the relationship/regression significantly, a regression will change up or down if you remove/replace influential points
LSRL
least squares regression line
residual plot
a graph showing the residuals on the vertical axis and the explanatory variable on the horizontal axis, you want it to be random to show a linear relationship
coefficient of determination
r^2, tells what percent of the variation in data values is explained by the regression line
linear transformations
preserve linear relationship, e.g. addition, subtraction, multiplication, division
non-linear transformations
don't preserve linear relationship, e.g. roots, exponents, logarithms
goal of transformations
to increase linear relationship
procedure for transformations
conduct standard regression
construct residual plot and transform it if the plot has a pattern
evaluate r^2
choose transformation method
transform one or both variables
conduct another regression analysis
find the new r^2 and it should be higher than the original r^2
population
the entire collection of objects or individuals about which information is desired
sample
a subset of the population being studied
census
a survey that collects information from every member of a population
observational study
observe and measure variables
experiment
manipulate variables & see results
inferential statistics
statistical data from a sample that are used to draw conclusions about the entire population
things well-designed surveys do
define the population, tell what the researcher wants to measure, show how data members of a sample set are chosen, represent the population accurately in the sample, are free from bias
sampling errors
biased design, convenience sample, voluntary response, undercoverage, nonresponse, response bias
bias
over/under estimating the desired response in a survey consistently, similar to accuracy
convenience
choosing only individuals for a survey who are easy to access
voluntary response
when sample members are allowed to volunteer
undercoverage
when members of the population are represented inadequately in a sample
nonresponse
when the individual selected for the sample is left out or refuses to participate
representative sampling
a group or set chosen to replicate characteristics of a larger population
random sample
a group or set chosen in a random manner that allows for each member of the population to have an equal chance at being selected
SRS
simple random sample, every individual has the same chance of being selected and every possible sample has the same chance of being selected
stratified random sampling
divide population into subgroups called strata then conduct SRS from each subgroup
systematic sampling
form of SRS, first person is random & the rest are chosen systematically
cluster sampling
select clusters randomly from population, selecting groups randomly, not individuals from groups
experiments
observe responses to variables, administer a treatment to observe response, attempt to determine causation
observational studies
only observe responses, don't attempt to influence responses
key aspects of experimental design
replication, randomization, control
replication
assigning treatment to many experimental units to lower variability
randomization
using chance to assign groups, use a random number tables/number generators
control
control for confounding variables or outside influences
blinding
not telling subjects or researcher what treatments subjects are receiving to eliminate bias
sample survey
analyzes data from a subset of a population, can be costly & time-consuming
cross-sectional study
analyzes data from a certain point in time, focused type of sample survey
blocking
method of dividing subjects into subgroups called blocks so the variability in blocks is less than the variability between blocks, controls for variables you know (randomization controls for variables you don't know)
matched pairs design
experimental method where subjects are grouped into pairs based on a blocking variable, one is control and one is treatment
lurking variable
variables other than independent/dependent variables that may affect experimental outcomes
placebo effect
a subject's positive response to receiving a placebo when no treatment has actually been applied
confounding variables
variables that affect the response variable under consideration
causation
cause and effect relationship between or among variables, experiments can determine causation
ethical research
participants must give informed consent, data collected must be private if it's personal info, risks to participants must be minimized
simpson's paradox
a paradox in statistics in which a trend appears in different groups of data but is reversed when those groups are combined
probability
the likelihood that an event will occur, the mathematics of chance
law of large numbers
as we observe more and more repetitions of any chance process, the proportion of times a specific outcome occurs approaches a single value
empirical probability
probability in actual trials
theoretical probability
probability calculated with a formula, not based on observed sample
independent event
when the occurrence of one event does not change the probability that the other event will happen
probability rules
always between zero and 1
sum of possible outcomes of a trial must equal 1
complement rule: the probability of an event occurring is one minus the probability that it doesn't occur
event
any outcome or collection of outcomes that is a subset of the sample space
determining the number of outcomes in a sample space
raise the number of outcomes in one trail to the power of the total number of trials
simulation
a random process of numerous trails used to estimate probability and imitate chance behavior
steps to conduct a simulation
state the problem
state the assumptions
describe the process for one repetition, including possible outcomes, assigned representations, & measured variables
simulate many repetitions
state conclusions
complement
the probability that the event won't occur
disjoint/mutually exclusive
two events have no outcomes in common
conditional probability
the probability of an event given another event has occurred