Assess genotoxicity of 1,2-dimethylhydrazine dihydrochloride (DMH) (EU directive)
<- read_delim("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/dna.txt", delim = " ")
dna $dose <- as.factor(dna$dose)
dna dna
%>%
dna ggplot(aes(x = dose, y = length, fill = dose)) +
geom_boxplot() +
geom_point(position = "jitter")
%>%
dna ggplot(aes(sample = length)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~dose)
plot(lm(length ~ dose, data = dna))
The Kruskal-Wallis Rank Test (KW-test) is a non-parameteric alternative for ANOVA F-test.
Classical \(F\)-teststatistic can be written as \[ F = \frac{\text{SST}/(g-1)}{\text{SSE}/(n-g)} = \frac{\text{SST}/(g-1)}{(\text{SSTot}-\text{SST})/(n-g)} , \]
with \(g\) the number of groups.
SSTot depends only on outcomes \(\mathbf{y}\) and will not vary in permutation test.
SST can be used as statistic \[\text{SST}=\sum_{j=1}^t n_j(\bar{Y}_j-\bar{Y})^2\]
The KW test statistic is based on SST on rank-transformed outcomes1, \[ \text{SST} = \sum_{j=1}^g n_j \left(\bar{R}_j - \bar{R}\right)^2 = \sum_{j=1}^t n_j \left(\bar{R}_j - \frac{n+1}{2}\right)^2 , \]
with \(\bar{R}_j\) the mean of the ranks in group \(j\), and \(\bar{R}\) the mean of all ranks, \[ \bar{R} = \frac{1}{n}(1+2+\cdots + n) = \frac{1}{n}\frac{1}{2}n(n+1) = \frac{n+1}{2}. \]
The KW teststatistic is given by \[ KW = \frac{12}{n(n+1)} \sum_{j=1}^g n_j \left(\bar{R}_j - \frac{n+1}{2}\right)^2. \]
The factor \(\frac{12}{n(n+1)}\) is used so that \(KW\) has a simple asymptotic null distribution. In particular under \(H_0\), given thart \(\min(n_1,\ldots, n_g)\rightarrow \infty\), \[ KW \rightarrow \chi^2_{t-1}. \]
The exact KW-test can be executed by calculating the permutation null distribution (that only depends on \(n_1, \ldots, n_g\)) to test \[H_0: f_1=\ldots=f_g \text{ vs } H_1: \text{ at least two means are different}.\]
In order to allow \(H_1\) to be formulated in terms of means, the assumption of locations shift should be valid.
For DMH example this is not the case.
If location-shift is invalid, we have to formulate \(H_1\) in terms of probabilistic indices: \[H_0: f_1=\ldots=f_g \text{ vs } H_1: \exists\ j,k \in \{1,\ldots,g\} : \text{P}\left[Y_j\geq Y_k\right]\neq 0.5\]
kruskal.test(length ~ dose, data = dna)
Kruskal-Wallis rank sum test
data: length by dose
Kruskal-Wallis chi-squared = 14, df = 3, p-value = 0.002905
On the \(5\%\) level of significance we can reject the null hypothesis.
R-functie kruskal.test
only returns the asymptotic approximation for \(p\)-values.
With only 6 observaties per groep, this is not a good approximation of the \(p\)-value
With the coin
R package we can calculate the exacte \(p\)-value
library(coin)
<- kruskal_test(length ~ dose,
kwPerm data = dna,
distribution = approximate(B = 100000)
) kwPerm
Approximative Kruskal-Wallis Test
data: length by dose (0, 1.25, 2.5, 5)
chi-squared = 14, p-value = 0.00043
We conclude that the difference in the distribution of the DNA damages due to the DMH dose is extremely significantly different.
Posthoc analysis with WMW tests.
pairwise.wilcox.test(dna$length, dna$dose)
Pairwise comparisons using Wilcoxon rank sum exact test
data: dna$length and dna$dose
0 1.25 2.5
1.25 0.013 - -
2.5 0.013 0.818 -
5 0.013 0.721 0.788
P value adjustment method: holm
pairwise.wilcox.test
output. Point estimate on probability on higher DNA-damage?<- table(dna$dose)
nGroup <- combn(levels(dna$dose), 2, function(x) {
probInd <- wilcox.test(length ~ dose, subset(dna, dose %in% x))
test return(test$statistic / prod(nGroup[x]))
})names(probInd) <- combn(levels(dna$dose), 2, paste, collapse = "vs")
probInd
0vs1.25 0vs2.5 0vs5 1.25vs2.5 1.25vs5 2.5vs5
0.0000000 0.0000000 0.0000000 0.4444444 0.2777778 0.3333333
Because there are doubts on the location-shift model we draw our conclusions in terms of the probabilistic index.
we assume that no ties are available↩︎