Data and Sampling

Studies

Study types

Observational study - No random assignment. Only correlation is possible and causation can not be inferred
Experimental study - Random assignment is used, causation may be inferred

Variables in studies

Response variable - a variable which us being studied
Explanatory variable - a variable which is explains changes in response variable
Confounding variable - a variable which is explains correlation between the response and explanatory variables
Blocking variable - a variable used to split the data into categories (blocks) if it is suspected the variable may have an impact on the response variable

Variable types

Variables can either be numerical or categorical
Numerical variables can either be discrete or continuous
Categorical variables may be ordinal in which case categories have a natural ordering

Sampling Techniques

Simple random sampling - randomly sample from whole population
Stratified sampling - split population into strata, randomly sample from each strata
Cluster sampling - split population into clusters, randomly select some clusters then randomly select from those clusters

Stratified sampling is used when we want to ensure we gather stats for a particular group.

Cluster sampling is used to reduce the amount of samples required.

Bias

Convenience sample
Non-response
Voluntary response

Experimental Design

Control - control differences between treatment groups under study
Randomisation - randomly assign to treatment groups
Replication - studies should be replicable, single studies should use a large sample
Block - handle other variables which might impact response variable

Numerical Data

Measures of Centre

Mean - arithmetic average
Median - “middle” number, or if even number of values, average of middle 2
Mode - most frequent value

Measures of Spread

Inter-quartile range (IQR)

`Q3 - Q1`

Range

`max - min`

Variance

`s^2 = (sum_(i=1)^n (x_i - bar x)^2) / (n - 1)`

Standard Deviation

`s = sqrt(s^2) = sqrt((sum_(i=1)^n (x_i - bar x)^2) / (n - 1))`

Z-score

The Z-score is the number of standard deviaitons a value is from the mean

`Z=(x−μ)/σ`

Sampling Error

Samples statistics taken from a population will always vary from the actual population value

Standard Error

Sampling variability of the mean

`SE = σ / sqrt(n)`

When standard deviation (σ) is not known, sample standard deviation s is used to estimate standard error.

`SE = s / sqrt(n)`

Central Limit Theorem (baby version)

`bar x ~ N(mean = μ, SE = σ / sqrt(n))`

Conditions:

Sample observations must be independent. If sampling without replacement then must have n < 10% population
Population is either normal or n is large (> 30 approx)

Confidence Intervals

A confidence interval gives a margin of error for a point estimate from a sample

point estimate ± margin of error

Margin of error

Half the width of confidence interval

Margin of error = `z_r * SE = (z_r * σ) / sqrt(n)`

Where `z_r` is the Z-score of the cut-off point

WKU/David Neal - Z–scores and Confidence Intervals

Hypothesis Testing

Competing claims

Null Hypothesis - H0 - skeptical perspective
Alternative Hypothesis - Ha - point of view under consideration

Construction of the hypothesis

Alternative hypothesis is one of:

One Sided - μ < null value or μ > null value
Two Sided - μ ≠ null value

UCLA - What are the differences between one-tailed and two-tailed tests?

Notes on hypothesis Construction:

Always construct hypotheses about population parameters, not sample parameters

Errors in test

Type 1 - rejecting a true null hypothesis
Type 2 - failing to reject a false null hypothesis

Significance level

Write significance level as α

α = P(type 1 error)

α = P(rejecting a true null hypothesis)

Yale - Tests of Significance

P-value

p−value = P(observed or more extreme sample statistic | H0 is true)

p-value < significance level reject null hypothesis
p-value > significance level fail to reject null hypothesis

Z-score for hypothesis Testing

Z = (sample statistic - null value) / SE

e.g. for the mean:

`Z = (bar x - mu) / (SE)`

T-Distribution

Used when sample size is small

Similar to normal distribution but with fat tails - more probability of values being distant from the centre.
Used for confidence intervals / hypothesis testing on a single mean

T-Distribution With Single Mean

Degrees of Freedom

The t-distribution is a family of distributions with a degrees of freedom parameter.

degrees of freedom = n-1

Confidence Interval

`CI = bar x ± t_(df)^** * SE`

Hypothesis Testing

`T_(df) = (bar x - mu) / (SE)`

T-Distribution For Comparing Two Independent Means

T-distribution can be used to determine the significance between two sample means when the population stdev is unknown

Conditions:

Sampled observations are independent within groups
The two groups should be independent
Skew / sample size - need low skew otherwise large sample size

Degrees of Freedom

`df = min(n_1 - 1, n_2 - 1)`

Standard Error

`SE_((bar x_1 - bar x_2)) = sqrt( s_1^2 / n_1 + s_2^2 / n_2 )`

Confidence Interval

`CI = (bar x_1 - bar x_2) ± t_(df)^** * SE`

Hypothesis Testing

`T_(df) = ((bar x_1 - bar x_2) - (mu_1 - mu_2)) / (SE)`

T-Distribution For Comparing Two Dependent Means

`df = n_{d i f f} - 1`

Standard Error

`SE_{d i f f} = s_{d i f f} / sqrt(n_{d i f f})`

Confidence Interval

`CI = bar x_{d i f f} ± t_{d i f f}^** * SE`

Hypothesis Testing

`T_df = (bar x_{d i f f} - mu_{d i f f}) / (SE)`

ANOVA - Analysis of Variance

Used to compare more than two groups

Conditions

independence - with and between groups (unless using repeated value ANOVA)
nearly normal distributions
constant variance across groups - homoscedastic

		Degrees of Freedom	Sum of Squares	Mean Squares
Group		df_G	SSG	MSG
Error	Residuals	df_E	SSE	MSE
Total		df_T	SST

SST - Sum Of Squares Total

`SST = sum_(i=1)^n (y_i - bar y)^2`

yi - value of response variable for each observation
ȳ - mean of response variable

SSG - Sum Of Squares Group

`SST = sum_(j=1)^k n_j * (y_j - bar y)^2`

nj - num observations in group j
yj - mean of response variable for group j
ȳ - mean of response variable

SSE - Sum Of Squares Error

`S S E = S S T - S S G`

Degrees Of Freedom

`d f_T = n - 1`

`d f_G = k - 1`

`d f_E = d f_T - d f_G`

Mean Squares

`M S G = (S S G) / (d f_G)`

`M S E = (S S E) / (d f_E)`

F-Statistic

`F = (M S G) / (M S E)`

Bonferoni Correction

Adjustment of significance level

`alpha^** = alpha / K | K = (k(k-1)) / 2`

Standard Error (for multiple pairwise comparisons)

`SE = sqrt( (M S E)/n_1 + (M S E)/n_2 )`

Standard Error (for multiple pairwise comparisons)

`d f = d f_E`

Categorical Stuff

Sampling & CLT for Proportions

Distribution of sample proportions

`hat p ~ N(mean = p, SE = sqrt((p(1-p))/n))`

Conditions:

Independence - random sample. If without replacement n < 10% of population
Sample size / skew (at least 10 successes and 10 failures)

CI for Proportions

point estimate +- standard error

`hat p +- z^** SE_{hat p}`

`SE_{hat p} = sqrt((hat p(1- hat p))/n)`

Linear Regression

Conditions

Linearity - relationship between explanatory and response
Nearly normal residuals
Constant variability

Fan shaped residuals plot - as value of explanatory variable increases, variability of response variable increases. Fails conditions for linear regression

Strength of fit

`R^2`

Square of the correlation coefficient
% variability in response variable is explained by model
Always between 0 and 1