Notifications

stats exam one

0.0(0) Reviews
Duplicate
Report Flashcard set

Spaced Repetition

spaced repetition

Flashcards

flashcards

Learn

learn

Practice Test

exam
153 Terms
😃 Not studied yet (153)
design
plan how to obtain the data
description
summarize the data with graphs and numerical summaries
inference
use data from a random and representative sample to draw conclusions about the population of interest
representative sample
random
two basic types of statistical inference
confidence intervals significance tests
confidence intervals
set of numbers of 90% confident that represent the true number of the population
significance tests
used reject or prove a hypothesis
subjects
persons, animals, or objects in our study/experiment
variables
the characteristics that we measure on each subject they can take on different values for each individual
population
all subjects of interest
sample
subjects for whom we have data
random sampling
each member of the population has the same chance of being included in the sample tend to be representative of the population
parameters
numerical summary of the population
statistics
numerical summary of the sample
categorical variables
place each observation into groups and they are usually summarized by the percentage of observations in each group use pie and bar charts
quantitative variables
take on numerical values
discrete variables
take only a finite list of possible outcomes
continuous variables
has an infinite list of possible values that form an interval, even though sometimes we are limited in our ability to measure them
measures of center
mean median mode
mean
the average of all observations x bar
median
observation right in the middle M 50% of observations on either side
mode
most frequently occuring value x value or category with highest bar
measures of variability
measures of spread or dispersion
range
max - min
variance and standard deviation
measures of spread around the mean particularly useful for bell shaped and symmetric distributions
variance
averaged squared deviation from the mean s^2 units of measurement are those of the original data squared we need to take the square root before interpretation
standard deviation
square root of the variance s units of measurement are the same as those of the original data
n
number of observations in the data set
x1,x2,x3....,xn
first, second, third,...., last observation
Σ
summation notation
mean
x̅ = x1...+xn/n
variance
s^z = Σ (x1 - x̅)^2/n-1
standard deviation
s=√s^2
why n-1 in denominator?
it represents the number of degrees of freedome the number of independent quantities we are adding up in the numerator
interpreting the standard deviation
the larger the standard deviation, the more spread out the data set is s can never be negative s can only be zero if there is no variability in the data s is very much affected by outliers s works best for bell-shaped and symmetric distributions
empirical rule
in any bell shaped and symmetric distribution you will find approximately 68% of observations within one stdev of the mean 95% within two stdev 99.7% within three stdev
lower quartile (q1)
25% observations below it 75% above 25th percentile
median quartile (q2)
50% below 50% above 50th percentile
upper quartile (q3)
75% below 25% above 75th percentile
q1 will be
the median of the lower half
q3 will be
the median of the upper half
interquartile range
measures of the spread of the central 50% of the data q3-q1
five number summary of positions
minimum, q1, median, q3, maximum
boxplots
graphs based on the five number summary contains the central 50% of the data line crossing the box represents the median whiskers extend to max and min
leaf unit
first column represents the cumulative counts from top and bottom the line with the parentheses contains the median second column contains the stems and the rest contain the leaves
0.10
decimal b/w stem and lead example: 2 and 01 mean 2.0 and 2.1
1.0
no decimal example from the first row 2 and 01 mean 20 and 21
10.0
add a zero after the leaf 2 01 means 200 and 210
0.01
move decimal left of stem 2 01 means 0.20 and 0.21
scatterplots
plot of y vs x two quantitative variables measured on the same individual
x
explanatory variable independent variable
y
response variable dependent variable
interpreting scatterplots
direction: positive or negative? linear trend? how strong? any outliers?
correlation
summarizes the direction and strength of the straight line relationship between x and y
variables and correlation
two variables have the same correlation regardless of which one is called the explanatory or the response
r
symbol to represent correlation coefficient
r is always between
-1 and +1 with no units
interpretation of correlation
positive/negative strong/weak
do outliers have a strong effect on r?
yes
straight lines
all points are exactly on the line line extends forever in both directions equation: y=mx+b
regression line
points are scattered the line that best fits through the middle of the points predicted values are ŷ
what is a regression line used for?
to predict the response variable y for a particular of x
regression equation
ŷ = a + bx
interpreting the regression line
b is the slope (the slope represents the average or predicted change in y for a one unit change in x) -a is the y-intercept ( the point where the regression line crosses the y axis) the y intercept corresponds to the predicted value of y when x=0 and it is necessary to complete the equation we only interpret if x=0 makes sense and it is close to the values of x observed
residuals
the prediction errors for each observation the vertical distance from the point to the line
residual equation
difference between the observed and predicted values of y y - ŷ
least squares regression method
goes through the middle of the points sum of residuals is zero also minimizes the sum of squared residuals or prediction errors
least square regression passes through the point
(x-hat, y-hat)
least squares regression formulas
y-hat = a + bx b = r x sy/sx a = y-bar - b(x-bar)
extrapolation
predicting outside the observed range using the regression equation
influential outliers
points that have an x value far away from the rest and fall far from the trend that the rest of the data follow
correlation does not imply
causation
graphs for categorical variables
bar charts and pie charts
contingency tables
both explanatory and response variables are categorical display counts and frequencies on the table
conditional proportions
find percentages by dividing each cell count by the total number of observations in their group
simpsons paradox
the direction of an association between two categorical variables can be reversed if we include a third variable and re-analyze the data
relative risk
the ratio of two proportions
statistical inference
a statement about the population based on a random and representative sample and includes a measure of how confident we are in the statement
experiments
the researcher assigns subjects to certain experimental treatments
observational studies
researcher does nothing to the subjects but observe x and y
random samples (probability samples)
GOOD, subjects chosen by chance, so they are REPRESENTATIVE of the entire population of interest
simple random sample (srs)
every set of n individuals has an equal chance to be the sample actually selected
sampling frame
a list of individuals from whom the sample is drawn
non probability samples
instead rely on easy and inexpensive methods to collect data like apps, social media, people with accounts on websites its biased because its not representative of a population
biased samples
bad, systematically favor certain outcomes so they are not representative of the population of interest
examples of biased samples
volunteer sample like polls, questionnaires, call ins convenience samples that occur in classes or public settings
sample surveys
personal interview, telephone interview, questionnaires
margin of error equation
1/square root of n
sources of potential bias in sample surveys
undercoverage, nonresponse bias, response bias, wording of questions
undercoverage
a sampling scheme that biases the sample in a way that gives a part of the population less representation than it has in the population
nonresponse bias
bias introduced to a sample when a large fraction of those sampled fails to respond
response bias
tendency of subjects to systematically respond to a stimulus in a particular way
wording of questions
confusing or leading questions can introduce bias
response variable
variable that we measure, so we can draw conclusions about it
experimental units
individuals (subjects) involved in the experiment
treatments
experimental conditions given to the subjects
control of variability
to avoid lurking variables and confounding effects make sure the conditions are as similar as possible for all variables except the factors being studied
comparative experiment
compare two or more groups to eliminate confounding
placebo
a "dummy" treatment that can have no physical effect
placebo effect
people really get better with a dummy treatment
control group
the group that receives the placebo helps determine the true effect of the treatment
blind study
when the subjects don't know if they are receiving the treatment or a placebo
double blind study
the subject and the person administrating the treatment dont know if its the treatment or the placebo
random samples
use a mechanical method to select subjects and assign them to treatments
advantages of using random samples
can use probability to analyze the results based on random samples avoids sample bias
replications increase
confidence in conclusion
number of replications
number of experimental units that get each treatment
statistically significant differences
differences not due to chance differences between two or more treatments are called significant if they are too large to be attributed to chance
multifactor experiments
experiments that have more than one factor each with several levels
factors
categorical explanatory variables in an experiment
matched pair designs
a more advances form of control of variability used when there are two treatment groups
matched experimental units
with another one on every possible confounder you can think of one person from each pair gets randomly assigned to one treatment and they are compared only against each other
cross over design
each person can serve as their own perfect match, and receive two treatments in random order
blocked design
the same idea as matched pairs, extended to three or more treatments each set of matched experimental units
cross sectional studies
sample surveys that just want to take a snapshot of the population at the current time
case control studies
retrospective studies in which we match each case with a control and then ask questions about the explanatory variable
prospective studies
forward looking, and follow subjects into the future
internal review board
a group that oversees experiments to ensure that the risks to human subjects is minimal to none and that the benefits exceeds the risks required for human subject experiments
random phenomenon
we cannot predict the next outcome however after many outcomes a predictable pattern appears order that only emerges in the long run
probability of an outcome
the proportion of time the outcome would occur in a very long series of independent trials
independent trials
the outcome of any one trial is not affected by the outcome of any other
subjective probabilities
sometimes necessary when there is no "long series" of independent trials determined using all info available
sample space
the set of all possible outcomes
event
an outcome or group of outcomes a subset of the sample space
probabilities are between
0 and 1 all outcomes must equal one
complement
the rest of the sample space besides event a
intersection
the overlap of two events where outcomes that are in both a and b are listed
union
consists of all outcomes that are in a or b or both
disjoint events
if two events a and b have no outcomes in common
complement formula
P(A^c) = 1 - P(A)
intersection formula
P(A∩B) = P(A and B)
union formula
P(AUB) = P(A or B) = P(A) + P(B) - P(A∩B)
disjoint formula
P(AUB) = P(A) + P(B)
independencce4
if two events are independent, knowledge about one event tells us nothing about the other event
multiplication rule
if a and b are independent events, then the probability of a and b is P(A and B) = P(A) x P(B)
conditional probability formula
P(A | B) = P(A and B) / P(B)
random variable
a numerical measurement of the outcome of a random phenomenon
μ
population mean
σ
population standard deviation
continuous random variable
one that has an infinite number of possible outcomes
smooth curve
represents an infinite number of people in the population
probabilities can be represented as
areas under a density curve
area under density curve must equal
to 1
notation of normal distribution
x ~ n(μ, σ)
z score formula
z = (x - μ)/σ
positive z scores
values above the mean
negative z scores
values below the mean
z score
tells you how many standard deviations above or below the mean an observation x is
standard normal distribution
a normal distribution of z scores
standard normal distribution notation
Z~N(0,1)
cumulative probabilities
areas under the normal curve to the left of z
finding normal probabilities
determine what the problem is asking for in terms of x standardize x by using the z score draw the curve and shade the area asked for in the problem determine the area by using the z table
finding the value of x given a proportion
draw a picture with the area given shaded on it look up the cumulative area in the middle of the z table and look at the margins to find the z-score