153 Terms

ðŸ˜ƒ Not studied yet (153)

design

plan how to obtain the data

description

summarize the data with graphs and numerical summaries

inference

use data from a random and representative sample to draw conclusions about the population of interest

representative sample

random

two basic types of statistical inference

confidence intervals
significance tests

confidence intervals

set of numbers of 90% confident that represent the true number of the population

significance tests

used reject or prove a hypothesis

subjects

persons, animals, or objects in our study/experiment

variables

the characteristics that we measure on each subject
they can take on different values for each individual

population

all subjects of interest

sample

subjects for whom we have data

random sampling

each member of the population has the same chance of being included in the sample
tend to be representative of the population

parameters

numerical summary of the population

statistics

numerical summary of the sample

categorical variables

place each observation into groups and they are usually summarized by the percentage of observations in each group
use pie and bar charts

quantitative variables

take on numerical values

discrete variables

take only a finite list of possible outcomes

continuous variables

has an infinite list of possible values that form an interval, even though sometimes we are limited in our ability to measure them

measures of center

mean
median
mode

mean

the average of all observations
x bar

median

observation right in the middle
M
50% of observations on either side

mode

most frequently occuring value
x value or category with highest bar

measures of variability

measures of spread or dispersion

range

max - min

variance and standard deviation

measures of spread around the mean
particularly useful for bell shaped and symmetric distributions

variance

averaged squared deviation from the mean
s^2
units of measurement are those of the original data squared
we need to take the square root before interpretation

standard deviation

square root of the variance
s
units of measurement are the same as those of the original data

n

number of observations in the data set

x1,x2,x3....,xn

first, second, third,...., last observation

Î£

summation notation

mean

xÌ… = x1...+xn/n

variance

s^z = Î£ (x1 - xÌ…)^2/n-1

standard deviation

s=âˆšs^2

why n-1 in denominator?

it represents the number of degrees of freedome
the number of independent quantities we are adding up in the numerator

interpreting the standard deviation

the larger the standard deviation, the more spread out the data set is
s can never be negative
s can only be zero if there is no variability in the data
s is very much affected by outliers
s works best for bell-shaped and symmetric distributions

empirical rule

in any bell shaped and symmetric distribution you will find approximately
68% of observations within one stdev of the mean
95% within two stdev
99.7% within three stdev

lower quartile (q1)

25% observations below it
75% above
25th percentile

median quartile (q2)

50% below
50% above
50th percentile

upper quartile (q3)

75% below
25% above
75th percentile

q1 will be

the median of the lower half

q3 will be

the median of the upper half

interquartile range

measures of the spread of the central 50% of the data
q3-q1

five number summary of positions

minimum, q1, median, q3, maximum

boxplots

graphs based on the five number summary
contains the central 50% of the data
line crossing the box represents the median
whiskers extend to max and min

leaf unit

first column represents the cumulative counts from top and bottom
the line with the parentheses contains the median
second column contains the stems and the rest contain the leaves

0.10

decimal b/w stem and lead
example: 2 and 01 mean 2.0 and 2.1

1.0

no decimal
example from the first row 2 and 01 mean 20 and 21

10.0

add a zero after the leaf
2 01 means 200 and 210

0.01

move decimal left of stem
2 01 means 0.20 and 0.21

scatterplots

plot of y vs x
two quantitative variables measured on the same individual

x

explanatory variable
independent variable

y

response variable
dependent variable

interpreting scatterplots

direction: positive or negative?
linear trend? how strong?
any outliers?

correlation

summarizes the direction and strength of the straight line relationship between x and y

variables and correlation

two variables have the same correlation regardless of which one is called the explanatory or the response

r

symbol to represent correlation coefficient

r is always between

-1 and +1 with no units

interpretation of correlation

positive/negative
strong/weak

do outliers have a strong effect on r?

yes

straight lines

all points are exactly on the line
line extends forever in both directions
equation: y=mx+b

regression line

points are scattered
the line that best fits through the middle of the points
predicted values are Å·

what is a regression line used for?

to predict the response variable y for a particular of x

regression equation

Å· = a + bx

interpreting the regression line

b is the slope (the slope represents the average or predicted change in y for a one unit change in x)
-a is the y-intercept ( the point where the regression line crosses the y axis)
the y intercept corresponds to the predicted value of y when x=0 and it is necessary to complete the equation we only interpret if x=0 makes sense and it is close to the values of x observed

residuals

the prediction errors for each observation
the vertical distance from the point to the line

residual equation

difference between the observed and predicted values of y
y - Å·

least squares regression method

goes through the middle of the points
sum of residuals is zero
also minimizes the sum of squared residuals or prediction errors

least square regression passes through the point

(x-hat, y-hat)

least squares regression formulas

y-hat = a + bx
b = r x sy/sx
a = y-bar - b(x-bar)

extrapolation

predicting outside the observed range using the regression equation

influential outliers

points that have an x value far away from the rest and fall far from the trend that the rest of the data follow

correlation does not imply

causation

graphs for categorical variables

bar charts and pie charts

contingency tables

both explanatory and response variables are categorical
display counts and frequencies on the table

conditional proportions

find percentages by dividing each cell count by the total number of observations in their group

simpsons paradox

the direction of an association between two categorical variables can be reversed if we include a third variable and re-analyze the data

relative risk

the ratio of two proportions

statistical inference

a statement about the population based on a random and representative sample and includes a measure of how confident we are in the statement

experiments

the researcher assigns subjects to certain experimental treatments

observational studies

researcher does nothing to the subjects but observe x and y

random samples (probability samples)

GOOD, subjects chosen by chance, so they are REPRESENTATIVE of the entire population of interest

simple random sample (srs)

every set of n individuals has an equal chance to be the sample actually selected

sampling frame

a list of individuals from whom the sample is drawn

non probability samples

instead rely on easy and inexpensive methods to collect data like apps, social media, people with accounts on websites
its biased because its not representative of a population

biased samples

bad, systematically favor certain outcomes so they are not representative of the population of interest

examples of biased samples

volunteer sample like polls, questionnaires, call ins
convenience samples that occur in classes or public settings

sample surveys

personal interview, telephone interview, questionnaires

margin of error equation

1/square root of n

sources of potential bias in sample surveys

undercoverage, nonresponse bias, response bias, wording of questions

undercoverage

a sampling scheme that biases the sample in a way that gives a part of the population less representation than it has in the population

nonresponse bias

bias introduced to a sample when a large fraction of those sampled fails to respond

response bias

tendency of subjects to systematically respond to a stimulus in a particular way

wording of questions

confusing or leading questions can introduce bias

response variable

variable that we measure, so we can draw conclusions about it

experimental units

individuals (subjects) involved in the experiment

treatments

experimental conditions given to the subjects

control of variability

to avoid lurking variables and confounding effects make sure the conditions are as similar as possible for all variables except the factors being studied

comparative experiment

compare two or more groups to eliminate confounding

placebo

a "dummy" treatment that can have no physical effect

placebo effect

people really get better with a dummy treatment

control group

the group that receives the placebo
helps determine the true effect of the treatment

blind study

when the subjects don't know if they are receiving the treatment or a placebo

double blind study

the subject and the person administrating the treatment dont know if its the treatment or the placebo

random samples

use a mechanical method to select subjects and assign them to treatments

advantages of using random samples

can use probability to analyze the results based on random samples
avoids sample bias

replications increase

confidence in conclusion

number of replications

number of experimental units that get each treatment

statistically significant differences

differences not due to chance
differences between two or more treatments are called significant if they are too large to be attributed to chance

multifactor experiments

experiments that have more than one factor each with several levels

factors

categorical explanatory variables in an experiment

matched pair designs

a more advances form of control of variability used when there are two treatment groups

matched experimental units

with another one on every possible confounder you can think of
one person from each pair gets randomly assigned to one treatment and they are compared only against each other

cross over design

each person can serve as their own perfect match, and receive two treatments in random order

blocked design

the same idea as matched pairs, extended to three or more treatments
each set of matched experimental units

cross sectional studies

sample surveys that just want to take a snapshot of the population at the current time

case control studies

retrospective studies in which we match each case with a control and then ask questions about the explanatory variable

prospective studies

forward looking, and follow subjects into the future

internal review board

a group that oversees experiments to ensure that the risks to human subjects is minimal to none and that the benefits exceeds the risks
required for human subject experiments

random phenomenon

we cannot predict the next outcome however after many outcomes a predictable pattern appears
order that only emerges in the long run

probability of an outcome

the proportion of time the outcome would occur in a very long series of independent trials

independent trials

the outcome of any one trial is not affected by the outcome of any other

subjective probabilities

sometimes necessary when there is no "long series" of independent trials
determined using all info available

sample space

the set of all possible outcomes

event

an outcome or group of outcomes
a subset of the sample space

probabilities are between

0 and 1
all outcomes must equal one

complement

the rest of the sample space besides event a

intersection

the overlap of two events where outcomes that are in both a and b are listed

union

consists of all outcomes that are in a or b or both

disjoint events

if two events a and b have no outcomes in common

complement formula

P(A^c) = 1 - P(A)

intersection formula

P(Aâˆ©B) = P(A and B)

union formula

P(AUB) = P(A or B) = P(A) + P(B) - P(Aâˆ©B)

disjoint formula

P(AUB) = P(A) + P(B)

independencce4

if two events are independent, knowledge about one event tells us nothing about the other event

multiplication rule

if a and b are independent events, then the probability of a and b is P(A and B) = P(A) x P(B)

conditional probability formula

P(A | B) = P(A and B) / P(B)

random variable

a numerical measurement of the outcome of a random phenomenon

Î¼

population mean

Ïƒ

population standard deviation

continuous random variable

one that has an infinite number of possible outcomes

smooth curve

represents an infinite number of people in the population

probabilities can be represented as

areas under a density curve

area under density curve must equal

to 1

notation of normal distribution

x ~ n(Î¼, Ïƒ)

z score formula

z = (x - Î¼)/Ïƒ

positive z scores

values above the mean

negative z scores

values below the mean

z score

tells you how many standard deviations above or below the mean an observation x is

standard normal distribution

a normal distribution of z scores

standard normal distribution notation

Z~N(0,1)

cumulative probabilities

areas under the normal curve to the left of z

finding normal probabilities

determine what the problem is asking for in terms of x
standardize x by using the z score
draw the curve and shade the area asked for in the problem
determine the area by using the z table

finding the value of x given a proportion

draw a picture with the area given shaded on it
look up the cumulative area in the middle of the z table and look at the margins to find the z-score