Hypothesis testing

Null hypothesis (H0):
1. H0: m = μ
2. H0: m \(\leq\) μ
3. H0: m \(\geq\) μ

Alternative hypotheses (Ha): 1. Ha:m ≠ μ (different)
2. Ha:m > μ (greater)
3. Ha:m < μ (less)

Note: Hypothesis 1. are called two-tailed tests and hypotheses 2. & 3. are called one-tailed tests.

The p-value is the probability that the observed data could happen, under the condition that the null hypothesis is true.

Note: p-value is not the probability that the null hypothesis is true.
Note: Absence of evidence ⧧ evidence of absence.

Cutoffs for hypothesis testing *p < 0.05, **p < 0.01, ***p < 0.001. If p value is less than significance level alpha (0.05), the hull hypothesies is rejected.

not rejected (‘negative’) rejected (‘positive’)
H0 true True negative (specificity) False Positive (Type I error)
H0 false False Negative (Type II error) True positive (sensitivity)

Type II errors are usually more dangerous.

t-test

t-test and normal distribution

t-distribution assumes that the observations are independent and that they follow a normal distribution. If the data are dependent, then p-values will likely be totally wrong (e.g., for positive correlation, too optimistic). Type II errors?
It is good to test if observations are normally distributed. Otherwise we assume that data is normally distributed.
Independence of observations is usually not testable, but can be reasonably assumed if the data collection process was random without replacement.

FIXME: I do not understand this. Deviation data from normalyty will lead to type-I errors. I data is deviated from normal distribution, use Wilcoxon test or permutation tests.

One-sample t-test

One-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ).
t-statistics: \(t = \frac{m - \mu}{s/\sqrt{n}}\), where
m is the sample mean
n is the sample size
s is the sample standard deviation with n−1 degrees of freedom
μ is the theoretical value
Q: And what should I do with this t-statistics?
Q: What is the difference between t-test and ANOVA?
Q: What is the smallest sample size which can be tested by t-test?
Q: Show diagrams explaining why p-value of one-sided is smaller than two-sided tests.

R example:
We want to test if N is different from given mean μ=0:

N = c(-0.01, 0.65, -0.17, 1.77, 0.76, -0.16, 0.88, 1.09, 0.96, 0.25)
t.test(N, mu = 0, alternative = "less")
## 
##  One Sample t-test
## 
## data:  N
## t = 3.0483, df = 9, p-value = 0.9931
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##      -Inf 0.964019
## sample estimates:
## mean of x 
##     0.602
t.test(N, mu = 0, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  N
## t = 3.0483, df = 9, p-value = 0.01383
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.1552496 1.0487504
## sample estimates:
## mean of x 
##     0.602
t.test(N, mu = 0, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  N
## t = 3.0483, df = 9, p-value = 0.006916
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  0.239981      Inf
## sample estimates:
## mean of x 
##     0.602

FIXME: why it accepts all alternatives at the same time (less and greater?)

Two samples t-test

Do two different samples have the same mean?
H0:
1. H0: m1 - m2 = 0
2. H0: m1 - m2 \(\leq\) 0
3. H0: m1 - m2 \(\geq\) 0

Ha:
1. Ha: m1 - m2 ≠ 0 (different)
2. Ha: m1 - m2 > 0 (greater)
3. Ha: m1 - m2 < 0 (less)

The paired sample t-test has four main assumptions:

  1. The dependent variable must be continuous (interval/ratio).
  2. The observations are independent of one another.
  3. The dependent variable should be approximately normally distributed.
  4. The dependent variable should not contain any outliers.

Continuous data can take on any value within a range (income, height, weight, etc.). The opposite of continuous data is discrete data, which can only take on a few values (Low, Medium, High, etc.). Occasionally, discrete data can be used to approximate a continuous scale, such as with Likert-type scales.

t-statistics: \(t=\frac{y - x}{SE}\), where y and x are the samples means. SE is the standard error for the difference. If H0 is correct, test statistic follows a t-distribution with n+m-2 degrees of freedom (n, m the number of observations in each sample).

To apply t-test samples must be tested if they have equal variance:
equal variance (homoscedastic). Type 3 means two samples, unequal variance (heteroscedastic).

### t-test
a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

# test homogeneity of variances using Fisher’s F-test
var.test(a,b)
## 
##  F test to compare two variances
## 
## data:  a and b
## F = 2.1028, num df = 9, denom df = 9, p-value = 0.2834
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5223017 8.4657950
## sample estimates:
## ratio of variances 
##           2.102784
# variance is homogene (can use var.equal=T in t.test)

# t-test
t.test(a,b, 
       var.equal=TRUE,   # variance is homogene (tested by var.test(a,b)) 
       paired=FALSE)     # samples are independent
## 
##  Two Sample t-test
## 
## data:  a and b
## t = -0.94737, df = 18, p-value = 0.356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.93994   4.13994
## sample estimates:
## mean of x mean of y 
##     174.8     178.2

Summary of R functions for t-tests

One-sample t-test

t.test(x, mu = 0, alternative = c("two.sided", "less", "greater"), paired = FALSE, var.equal = FALSE, conf.level = 0.95)

What is that?

one-way ANOVA or 2-way ANOVA with Bonferroni multiple comparison or Dunnett’s post-test

Non-parametric tests

Mann-Whitney U Rank Sum Test

  1. The dependent variable is ordinal or continuous.
  2. The data consist of a randomly selected sample of independent observations from two independent groups.
  3. The dependent variables for the two independent groups share a similar shape.

Wilcoxon test

The Wilcoxon is a non-parametric test which works on normal and non-normal data. However, we usually prefer not to use it if we can assume that the data is normally distributed. The non-parametric test comes with less statistical power, this is a price that one has to pay for more flexible assumptions.

Tests for categorical variables

Categorical variable can take fixed number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

Chi-squared tests

The chi-squared test is most suited to large datasets. As a general rule, the chi-squared test is appropriate if at least 80% of the cells have an expected frequency of 5 or greater. In addition, none of the cells should have an expected frequency less than 1. If the expected values are very small, categories may be combined (if it makes sense to do so) to create fewer larger categories. Alternatively, Fisher’s exact test can be used.

data = rbind(c(83,35), c(92,43))
data
##      [,1] [,2]
## [1,]   83   35
## [2,]   92   43
chisq.test(data, correct=F)
## 
##  Pearson's Chi-squared test
## 
## data:  data
## X-squared = 0.14172, df = 1, p-value = 0.7066

chisq.test(testor,correct=F) ## Fisher’s Exact test R Example:

Group TumourShrinkage-No TumourShrinkage-Yes Total
1 Treatment 8 3 11
2 Placebo 9 4 13
3 Total 17 7 24

The null hypothesis is that there is no association between treatment and tumour shrinkage.
The alternative hypothesis is that there is some association between treatment group and tumour shrinkage.

data = rbind(c(8,3), c(9,4))
data
##      [,1] [,2]
## [1,]    8    3
## [2,]    9    4
fisher.test(data)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  data
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.1456912 10.6433317
## sample estimates:
## odds ratio 
##   1.176844

The output Fisher’s exact test tells us that the probability of observing such an extreme combination of frequencies is high, our p-value is 1.000 which is clearly greater than 0.05. In this case, there is no evidence of an association between treatment group and tumour shrinkage.

Multiple testing

When performing a large number of tests, the type I error is inflated: for α=0.05 and performing n tests, the probability of no false positive result is: 0.095 x 0.95 x … (n-times) <<< 0.095
The larger the number of tests performed, the higher the probability of a false rejection!
Many data analysis approaches in genomics rely on itemby-item (i.e. multiple) testing:
Microarray or RNA-Seq expression profiles of “normal” vs “perturbed” samples: gene-by-gene
ChIP-chip: locus-by-locus
RNAi and chemical compound screens
Genome-wide association studies: marker-by-marker
QTL analysis: marker-by-marker and trait-by-trait

False positive rate (FPR) - the proportion of false positives among all resulst.

False discovery rate (FDR) - the proportion of false positives among all significant results.

Example: 20,000 genes, 100 hits, 10 of them wrong.
FPR: 0.05%
FDR: 10%

The Bonferroni correction

The Bonferroni correction sets the significance cut-off at α/n.

Avatar
Mark Goldberg
Researcher

My research interests include epigenetics and computational biology.

Related

comments powered by Disqus