**Observational study**- No random assignment. Only correlation is possible and causation can not be inferred**Experimental study**- Random assignment is used, causation may be inferred

**Response variable**- a variable which us being studied**Explanatory variable**- a variable which is explains changes in response variable**Confounding variable**- a variable which is explains correlation between the response and explanatory variables**Blocking variable**- a variable used to split the data into categories (blocks) if it is suspected the variable may have an impact on the response variable

- Variables can either be
**numerical**or**categorical** - Numerical variables can either be
**discrete**or**continuous** - Categorical variables may be
**ordinal**in which case categories have a natural ordering

**Simple random sampling**- randomly sample from whole population**Stratified sampling**- split population into strata, randomly sample from each strata**Cluster sampling**- split population into clusters, randomly select some clusters then randomly select from those clusters

Stratified sampling is used when we want to ensure we gather stats for a particular group.

Cluster sampling is used to reduce the amount of samples required.

**Convenience sample****Non-response****Voluntary response**

**Control**- control differences between treatment groups under study**Randomisation**- randomly assign to treatment groups**Replication**- studies should be replicable, single studies should use a large sample**Block**- handle other variables which might impact response variable

**Mean**- arithmetic average**Median**- “middle” number, or if even number of values, average of middle 2**Mode**- most frequent value

`Q3 - Q1`

`max - min`

`s^2 = (sum_(i=1)^n (x_i - bar x)^2) / (n - 1)`

`s = sqrt(s^2) = sqrt((sum_(i=1)^n (x_i - bar x)^2) / (n - 1))`

The Z-score is the number of standard deviaitons a value is from the mean

`Z=(x−μ)/σ`

Samples statistics taken from a population will always vary from the actual population value

Sampling variability of the mean

`SE = σ / sqrt(n)`

When standard deviation (σ) is not known, sample standard deviation s is used to estimate standard error.

`SE = s / sqrt(n)`

`bar x ~ N(mean = μ, SE = σ / sqrt(n))`

Conditions:

- Sample observations must be independent. If sampling without replacement then must have n < 10% population
- Population is either normal or n is large (> 30 approx)

A confidence interval gives a margin of error for a point estimate from a sample

`point estimate ± margin of error`

Half the width of confidence interval

Margin of error = `z_r * SE = (z_r * σ) / sqrt(n)`

Where `z_r` is the Z-score of the cut-off point

WKU/David Neal - Z–scores and Confidence Intervals

**Null Hypothesis**-*H0*- skeptical perspective**Alternative Hypothesis**-*Ha*- point of view under consideration

Alternative hypothesis is one of:

**One Sided**-`μ < null value`

or`μ > null value`

**Two Sided**- μ ≠ null value

UCLA - What are the differences between one-tailed and two-tailed tests?

Notes on hypothesis Construction:

- Always construct hypotheses about population parameters, not sample parameters

**Type 1**- rejecting a true null hypothesis**Type 2**- failing to reject a false null hypothesis

Write **significance level** as α

α = P(type 1 error)

α = P(rejecting a true null hypothesis)

`p−value = P(observed or more extreme sample statistic | H0 is true)`

**p-value < significance level**reject null hypothesis**p-value > significance level**fail to reject null hypothesis

`Z = (sample statistic - null value) / SE`

e.g. for the mean:

`Z = (bar x - mu) / (SE)`

`

*Used when sample size is small*

- Similar to normal distribution but with fat tails - more probability of values being distant from the centre.
- Used for confidence intervals / hypothesis testing on a single mean

The t-distribution is a family of distributions with a `degrees of freedom`

parameter.

`degrees of freedom = n-1`

`CI = bar x ± t_(df)^** * SE`

`T_(df) = (bar x - mu) / (SE)`

T-distribution can be used to determine the significance between two sample means when the population stdev is unknown

*Conditions:*

- Sampled observations are independent within groups
- The two groups should be independent
- Skew / sample size - need low skew otherwise large sample size

`df = min(n_1 - 1, n_2 - 1)`

`SE_((bar x_1 - bar x_2)) = sqrt( s_1^2 / n_1 + s_2^2 / n_2 )`

`CI = (bar x_1 - bar x_2) ± t_(df)^** * SE`

`T_(df) = ((bar x_1 - bar x_2) - (mu_1 - mu_2)) / (SE)`

`df = n_{d i f f} - 1`

`SE_{d i f f} = s_{d i f f} / sqrt(n_{d i f f})`

`CI = bar x_{d i f f} ± t_{d i f f}^** * SE`

`T_df = (bar x_{d i f f} - mu_{d i f f}) / (SE)`

Used to compare more than two groups

- independence - with and between groups (unless using repeated value ANOVA)
- nearly normal distributions
- constant variance across groups - homoscedastic

Degrees of Freedom | Sum of Squares | Mean Squares | F Value | Pr(>F) | ||
---|---|---|---|---|---|---|

Group |
df_G | SSG | MSG | |||

Error |
Residuals | df_E | SSE | MSE | ||

Total |
df_T | SST |

`SST = sum_(i=1)^n (y_i - bar y)^2`

- yi - value of response variable for each observation
- ȳ - mean of response variable

`SST = sum_(j=1)^k n_j * (y_j - bar y)^2`

- nj - num observations in group j
- yj - mean of response variable for group j
- ȳ - mean of response variable

`S S E = S S T - S S G`

`d f_T = n - 1`

`d f_G = k - 1`

`d f_E = d f_T - d f_G`

`M S G = (S S G) / (d f_G)`

`M S E = (S S E) / (d f_E)`

`F = (M S G) / (M S E)`

Adjustment of significance level

`alpha^** = alpha / K | K = (k(k-1)) / 2`

`SE = sqrt( (M S E)/n_1 + (M S E)/n_2 )`

`d f = d f_E`

Distribution of sample proportions

`hat p ~ N(mean = p, SE = sqrt((p(1-p))/n))`

Conditions:

- Independence - random sample. If without replacement n < 10% of population
- Sample size / skew (at least 10 successes and 10 failures)

point estimate +- standard error

`hat p +- z^** SE_{hat p}`

`SE_{hat p} = sqrt((hat p(1- hat p))/n)`

**Conditions**

- Linearity - relationship between explanatory and response
- Nearly normal residuals
- Constant variability

*Fan shaped residuals plot* - as value of explanatory variable increases, variability of response variable increases. Fails conditions for linear regression

**Strength of fit**

`R^2`

- Square of the correlation coefficient
- % variability in response variable is explained by model
- Always between 0 and 1

© Will Robertson - wjsrobertson@gmail.com