Data and Sampling
Studies
Study types
- Observational study - No random assignment. Only correlation is possible and causation can not be inferred
- Experimental study - Random assignment is used, causation may be inferred
Variables in studies
- Response variable - a variable which us being studied
- Explanatory variable - a variable which is explains changes in response variable
- Confounding variable - a variable which is explains correlation between the response and explanatory variables
- Blocking variable - a variable used to split the data into categories (blocks) if it is suspected the variable may have an impact on the response variable
Variable types
- Variables can either be numerical or categorical
- Numerical variables can either be discrete or continuous
- Categorical variables may be ordinal in which case categories have a natural ordering
Sampling Techniques
- Simple random sampling - randomly sample from whole population
- Stratified sampling - split population into strata, randomly sample from each strata
- Cluster sampling - split population into clusters, randomly select some clusters then randomly select from those clusters
Stratified sampling is used when we want to ensure we gather stats for a particular group.
Cluster sampling is used to reduce the amount of samples required.
Bias
- Convenience sample
- Non-response
- Voluntary response
Experimental Design
- Control - control differences between treatment groups under study
- Randomisation - randomly assign to treatment groups
- Replication - studies should be replicable, single studies should use a large sample
- Block - handle other variables which might impact response variable
Numerical Data
Measures of Centre
- Mean - arithmetic average
- Median - “middle” number, or if even number of values, average of middle 2
- Mode - most frequent value
Measures of Spread
Inter-quartile range (IQR)
`Q3 - Q1`
Range
`max - min`
Variance
`s^2 = (sum_(i=1)^n (x_i - bar x)^2) / (n - 1)`
Standard Deviation
`s = sqrt(s^2) = sqrt((sum_(i=1)^n (x_i - bar x)^2) / (n - 1))`
Z-score
The Z-score is the number of standard deviaitons a value is from the mean
`Z=(x−μ)/σ`
Sampling Error
Samples statistics taken from a population will always vary from the actual population value
Standard Error
Sampling variability of the mean
`SE = σ / sqrt(n)`
When standard deviation (σ) is not known, sample standard deviation s is used to estimate standard error.
`SE = s / sqrt(n)`
Central Limit Theorem (baby version)
`bar x ~ N(mean = μ, SE = σ / sqrt(n))`
Conditions:
- Sample observations must be independent. If sampling without replacement then must have n < 10% population
- Population is either normal or n is large (> 30 approx)
Confidence Intervals
A confidence interval gives a margin of error for a point estimate from a sample
point estimate ± margin of error
Margin of error
Half the width of confidence interval
Margin of error = `z_r * SE = (z_r * σ) / sqrt(n)`
Where `z_r` is the Z-score of the cut-off point
WKU/David Neal - Z–scores and Confidence Intervals
Hypothesis Testing
Competing claims
- Null Hypothesis - H0 - skeptical perspective
- Alternative Hypothesis - Ha - point of view under consideration
Construction of the hypothesis
Alternative hypothesis is one of:
- One Sided -
μ < null value
or μ > null value
- Two Sided - μ ≠ null value
UCLA - What are the differences between one-tailed and two-tailed tests?
Notes on hypothesis Construction:
- Always construct hypotheses about population parameters, not sample parameters
Errors in test
- Type 1 - rejecting a true null hypothesis
- Type 2 - failing to reject a false null hypothesis
Significance level
Write significance level as α
α = P(type 1 error)
α = P(rejecting a true null hypothesis)
Yale - Tests of Significance
P-value
p−value = P(observed or more extreme sample statistic | H0 is true)
- p-value < significance level reject null hypothesis
- p-value > significance level fail to reject null hypothesis
Z-score for hypothesis Testing
Z = (sample statistic - null value) / SE
e.g. for the mean:
`Z = (bar x - mu) / (SE)`
`
T-Distribution
Used when sample size is small
- Similar to normal distribution but with fat tails - more probability of values being distant from the centre.
- Used for confidence intervals / hypothesis testing on a single mean
T-Distribution With Single Mean
Degrees of Freedom
The t-distribution is a family of distributions with a degrees of freedom
parameter.
degrees of freedom = n-1
Confidence Interval
`CI = bar x ± t_(df)^** * SE`
Hypothesis Testing
`T_(df) = (bar x - mu) / (SE)`
T-Distribution For Comparing Two Independent Means
T-distribution can be used to determine the significance between two sample means when the population stdev is unknown
Conditions:
- Sampled observations are independent within groups
- The two groups should be independent
- Skew / sample size - need low skew otherwise large sample size
Degrees of Freedom
`df = min(n_1 - 1, n_2 - 1)`
Standard Error
`SE_((bar x_1 - bar x_2)) = sqrt( s_1^2 / n_1 + s_2^2 / n_2 )`
Confidence Interval
`CI = (bar x_1 - bar x_2) ± t_(df)^** * SE`
Hypothesis Testing
`T_(df) = ((bar x_1 - bar x_2) - (mu_1 - mu_2)) / (SE)`
T-Distribution For Comparing Two Dependent Means
`df = n_{d i f f} - 1`
Standard Error
`SE_{d i f f} = s_{d i f f} / sqrt(n_{d i f f})`
Confidence Interval
`CI = bar x_{d i f f} ± t_{d i f f}^** * SE`
Hypothesis Testing
`T_df = (bar x_{d i f f} - mu_{d i f f}) / (SE)`
ANOVA - Analysis of Variance
Used to compare more than two groups
Conditions
- independence - with and between groups (unless using repeated value ANOVA)
- nearly normal distributions
- constant variance across groups - homoscedastic
|
|
Degrees of Freedom |
Sum of Squares |
Mean Squares |
F Value |
Pr(>F) |
Group |
|
df_G |
SSG |
MSG |
|
|
Error |
Residuals |
df_E |
SSE |
MSE |
|
|
Total |
|
df_T |
SST |
|
|
|
SST - Sum Of Squares Total
`SST = sum_(i=1)^n (y_i - bar y)^2`
- yi - value of response variable for each observation
- ȳ - mean of response variable
SSG - Sum Of Squares Group
`SST = sum_(j=1)^k n_j * (y_j - bar y)^2`
- nj - num observations in group j
- yj - mean of response variable for group j
- ȳ - mean of response variable
SSE - Sum Of Squares Error
`S S E = S S T - S S G`
Degrees Of Freedom
`d f_T = n - 1`
`d f_G = k - 1`
`d f_E = d f_T - d f_G`
Mean Squares
`M S G = (S S G) / (d f_G)`
`M S E = (S S E) / (d f_E)`
F-Statistic
`F = (M S G) / (M S E)`
Bonferoni Correction
Adjustment of significance level
`alpha^** = alpha / K | K = (k(k-1)) / 2`
Standard Error (for multiple pairwise comparisons)
`SE = sqrt( (M S E)/n_1 + (M S E)/n_2 )`
Standard Error (for multiple pairwise comparisons)
`d f = d f_E`
Categorical Stuff
Sampling & CLT for Proportions
Distribution of sample proportions
`hat p ~ N(mean = p, SE = sqrt((p(1-p))/n))`
Conditions:
- Independence - random sample. If without replacement n < 10% of population
- Sample size / skew (at least 10 successes and 10 failures)
CI for Proportions
point estimate +- standard error
`hat p +- z^** SE_{hat p}`
`SE_{hat p} = sqrt((hat p(1- hat p))/n)`
Linear Regression
Conditions
- Linearity - relationship between explanatory and response
- Nearly normal residuals
- Constant variability
Fan shaped residuals plot - as value of explanatory variable increases, variability of response variable increases. Fails conditions for linear regression
Strength of fit
`R^2`
- Square of the correlation coefficient
- % variability in response variable is explained by model
- Always between 0 and 1