Until now we used models for a continuous response in function of categorical or continuous predictor. Here we will focus on a categorical response.
boys <- 3175
n <- 6155
The data are derived from a binary random variable \(X\)
\(X=0\) for a girl.
Note: count problem: the outcome is a count (number of boys)
Formally we have considered a population of unborn children where each individual is characterized by a 0 or 1.
It has one model parameter \(\pi\)
The variance of Bernoulli data is also related to \(\pi\). \[\text{Var}[X_i]=\pi (1-\pi).\]
Some Bernoulli probability mass functions
pi=boys/n
pi
[1] 0.5158408
In our example \(\bar x =\) 3175 / 6155 = 51.6%.
Is the observed probability 51.6% that an unborn child is male, sufficient evidence to conclude that there is a higher chance on a boy than on a girl?
Statistical test for \[H_0: \pi=1/2 \text{ versus } H_1: \pi\neq 1/2,\]
We need known the distribution of
A random individual from the population has a probability of \[P(X=1)=\pi=1/2.\] to be a boy
Two children that are drawn independently :
Random variable \(S\) is the sum of the outcomes :
\((x_1,x_2)\) | \(s\) | \(P(S = s)\) |
---|---|---|
(0,0) | 0 | 1/4 |
(0,1), (1,0) | 1 | 1/2 |
(1,1) | 2 | 1/4 |
dbinom(k,n,p)
S is a:
Binomial distributed random variable with Binomial probability mass function
Parameters
success
for each draw).Use: Compare proportions or risks on a particular event between groups.
Test statistic \[H_0:\pi=1/2\text{ vs }H_1:\pi\neq 1/2\]
When boys and girls are equally likely, i.e. under \(H_0:\pi=1/2\), we will get following two-sided p-value: \[p=\text{P}_0\left[S-s_0\geq \vert \delta\vert \right] + \text{P}_0\left[S-s_0\leq - \vert \delta\vert \right].\]
Note, that we can rewrite it in terms of S. \[p=\text{P}_0\left[S\geq s_0+ \vert \delta\vert \right] + \text{P}_0\left[S \leq s_0 - \vert \delta\vert \right].\]
pi0 <- 0.5; s0 <- pi0 *n
delta <- abs(boys- s0)
delta
[1] 97.5
sUp <- s0 + delta
sDown <- s0 -delta
c(sDown,sUp)
[1] 2980 3175
pUp <- 1-pbinom(sUp-1,n,pi0)
pDown <- pbinom(sDown,n,pi0)
p <- pUp+pDown
c(pUp,pDown, p)
[1] 0.006699883 0.006699883 0.013399766
The test can immediately be conducted using the binomial.test
function in R.
binom.test(x=boys,n=n,p=pi0)
Exact binomial test
data: boys and n
number of successes = 3175, number of trials = 6155, p-value =
0.0134
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5032696 0.5283969
sample estimates:
probability of success
0.5158408
On the 5% significance-level we conclude that there is on average a higher probability on a unborn male child than an unborn female child.
se=sqrt(pi*(1-pi)/n)
pi+c(-1,1)*qnorm(0.975)*se
[1] 0.5033559 0.5283257
CI <- binom.test(x=boys,n=n,p=pi0)$conf.int
CI
[1] 0.5032696 0.5283969
attr(,"conf.level")
[1] 0.95
Note that the test for a proportion is equivalent to a one-sample t-test for binary data.
For the Saksen population we conclude at the 5% significance level that the gender of unborn children is more likely to be male than female (\(p=\) 0.013). The probability that an unborn child is male equals 51.6% (95% CI [50.3,52.8]%).
Each female undergoes the test twice:
and once upon stay in gentle evironment.
friendly | _agressive friendly | _non-agressive total | |
---|---|---|---|
hostile_agressive | 3 (e) | 17 (f) | 20 |
hostile_non-agressive | 1 (g) | 13 (h) | 14 |
total | 4 | 30 | 34 |
hamster <- matrix(c(3,17,1,13),ncol=2,byrow=TRUE)
rownames(hamster) <- c("hostile-agressive", "hostile-non-agressive")
colnames(hamster) <- c("friendly-agressive","friendly-non-agressive")
f=hamster[1,2]; g=hamster[2,1] ;n=sum(hamster)
riskdiff=(f-g)/n
riskdiff
[1] 0.4705882
se=sqrt(f+g-(f-g)^2/n)/n
se
[1] 0.09517144
ci<-riskdiff+c(-1,1)*qnorm(0.975)*se
ci
[1] 0.2840556 0.6571208
friendly_agressive | friendly_non-agressive | total | |
---|---|---|---|
hostile_agressive | 3 (e) | 17 (f) | 20 |
hostile_non-agressive | 1 (g) | 13 (h) | 14 |
total | 4 | 30 | 34 |
The Mc Nemar test is the analogon of the paired t-test for binary qualitative variables.
In R we can conduct the analysis using the mcnemar.test
function
mcnemar.test(hamster)
McNemar's Chi-squared test with continuity correction
data: hamster
McNemar's chi-squared = 12.5, df = 1, p-value = 0.000407
binom.test(x=f,n=f+g,p=0.5)
Exact binomial test
data: f and f + g
number of successes = 17, number of trials = 18, p-value =
0.000145
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.7270564 0.9985944
sample estimates:
probability of success
0.9444444
brca<-read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/brca.csv")
head(brca)
# A tibble: 6 x 3
cancer variant variant2
<chr> <chr> <chr>
1 control pro/pro other
2 control pro/pro other
3 control pro/pro other
4 control pro/pro other
5 control pro/pro other
6 control pro/pro other
summary(brca)
cancer variant variant2
Length:1372 Length:1372 Length:1372
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Genotype | Control | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
Genotype | Controls | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
with \(p\) the probability on the event.
Transformation of risk with following properties:
odds has value between 1 and \(\infty\).
odds only equals 1 for probability \(p=1/2\).
de odds increases if probability increases.
popular in gambling: how much more likely is it to win than to loose
Genotype | Controls | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
Odds on allel Leu/Leu
Controls: \(\mbox{odds}_2=c/(a+b)=56/516=0.109\).
Association between exposure and outcome: \[ OR_{Leu/Leu}=\frac{\mbox{odds}_T}{\mbox{odds}_C}= \frac{f/(d+e)}{c/(a+b)}=\frac{f/(d+e)}{c/(a+b)}=1.15 \]
Genotype | Controls | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
Is the difference large enough to generalize the effect in the sample towards the population?
We will first rewrite the data in a 2x2 table
Genotype | Controls | Cases | Totaal |
---|---|---|---|
other | 516 (a) | 711 (c) | 1227 (a+c) |
Leu/Leu | 56 (b) | 89 (d) | 145 (b+d) |
Total | 572 (a+b) | 800 (c+d) | 1372 (n) |
Test association between categorical exposure(e.g. variant, X) en categorical response (e.g. disease, Y). \[H_0: \text{There is no association between } X \text{ and } Y \text{ vs } H_1: X \text{ and } Y \text{ are associated}\]
We can calculate this expected number \(E_{ij}\) under the null hypothesis for each cell of the \(2 \times 2\) table.
\(E_{11}\) = Expected number under \(H_0\) in (1,1)-cell = 1227 \(\times\) 800/1372 = 715.5 ;
\(E_{12}\) = Expected number under \(H_0\) in (1,2)-cell = 1227 \(\times\) 572/1372 = 511.5 ;
\(E_{21}\) = Expected number under \(H_0\) in (2,1)-cell = 145 \(\times\) 800/1372 = 84.55 ;
\(E_{22}\) = Expected number under \(H_0\) in (2,2)-cell = 145 \(\times\) 572/1372 = 60.45 ;
expected <- matrix(0,nrow=2,ncol=2)
for (i in 1:2)
for (j in 1:2)
expected[i,j] <-
sum(brcaTab2[i,])*sum(brcaTab2[,j])/sum(brcaTab2)
expected
[,1] [,2]
[1,] 715.4519 511.5481
[2,] 84.5481 60.4519
x2 <- sum((abs(brcaTab2-expected) - .5)^2/expected)
1-pchisq(x2,1)
[1] 0.481519
In R you can switch the correction on or off by setting the argument correct
on TRUE or FALSE, respectively:
chisq.test(brcaTab2)
Pearson's Chi-squared test with Yates' continuity correction
data: brcaTab2
X-squared = 0.49542, df = 1, p-value = 0.4815
chisq.test(brcaTab2,correct=FALSE)
Pearson's Chi-squared test
data: brcaTab2
X-squared = 0.62871, df = 1, p-value = 0.4278
fisher.test(brcaTab2)
Fisher's Exact Test for Count Data
data: brcaTab2
p-value = 0.4764
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5974269 1.2501878
sample estimates:
odds ratio
0.8670925
The \(\chi^2\)-test can also be used if one of the categorical variables \(X\) and \(Y\) has more than 2 levels
Again: the null hypothesis \(H_0\): \(X\) and \(Y\) are independent (not associated), against the alternative \(H_A: X\) and \(Y\) are associated.
Let the variable in the rows have \(r\) different possible outcomes and the one in the columns \(c\) possible outcomes, then we obtain an \(r \times c\) table.
Again \(E_{ij}\) is the product of the \(i^\text{th}\) row-total and the \(j^\text{th}\) column total devided by the overall total.
brcaTab <- table(brca$variant,brca$cancer)
chisq.test(brcaTab)
Pearson's Chi-squared test
data: brcaTab
X-squared = 2.0551, df = 2, p-value = 0.3579
Model binary data with continuous and/or dummy variables.
breast cancer example: is BRCA 1 variant associated with breast cancer.
1 dummy variable less than the number of groups.
For BRCA 1 we need two dummy variables and obtain the following linear predictor:
with : \[x_{i1} = \left\{ \begin{array}{ll} 1 & \text{ if subject $i$ is heterozygous, Pro/Leu variant} \\ 0 & \text{if subject $i$ is homozygous, (Pro/Pro or Leu/Leu variant)} \end{array}\right. .\] \[x_{i2} = \left\{ \begin{array}{ll} 1 & \text{ if subject $i$ is homozygous is the Leucine mutation: Leu/Leu } \\ 0 & \text{ of subject $i$ is not homozygous in the Leu/Leu variant} \end{array}\right. .\]
Homozygosity in the wild type allele Pro/Pro is the reference group.
We fit the model in R.
as.factor
to convert the cancer variable to a factor variable. - We further use the function relevel
to specify the control treatment as the reference group.brca$cancer<-brca$cancer %>% as.factor %>% relevel("control")
brca$variant<-brca$variant %>% as.factor %>% relevel("pro/pro")
brcaLogit <- glm(cancer~variant,data=brca,family=binomial)
summary(brcaLogit)
Call:
glm(formula = cancer ~ variant, family = binomial, data = brca)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.379 -1.286 1.017 1.017 1.073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.25131 0.08175 3.074 0.00211 **
variantleu/leu 0.21197 0.18915 1.121 0.26243
variantpro/leu 0.13802 0.11573 1.193 0.23302
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1863.9 on 1371 degrees of freedom
Residual deviance: 1861.9 on 1369 degrees of freedom
AIC: 1867.9
Number of Fisher Scoring iterations: 4
The intercept is the log-odds on cancer in the reference class (Pro/Pro) and the slope terms are log odds ratios between treatment and reference class:
\[\begin{eqnarray*}
\log \text{ODDS}_\text{Pro/Pro}&=&\beta_0\\\\
\log \text{ODDS}_\text{Pro/Leu}&=&\beta_0+\beta_1\\\\
\log \text{ODDS}_\text{Leu/Leu}&=&\beta_0+\beta_2\\\\
\log \frac{\text{ODDS}_\text{Pro/Leu}}{\text{ODDS}_\text{Pro/Pro}}&=&\log \text{ODDS}_\text{Pro/Leu}-\log ODDS_{Pro/Pro}\\
&=&\beta_0+\beta_1-\beta_0=\beta_1\\\\
\log \frac{\text{ODDS}_\text{Leu/Leu}}{\text{ODDS}_\text{Pro/Pro}}&=&\beta_2
\end{eqnarray*}\]
anova(brcaLogit,test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: cancer
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1371 1863.9
variant 2 2.0562 1369 1861.9 0.3577
The \(\chi^2\)-test on the logsitic regression model also indicates that there is no significant association between cancer status and the genetic variant of the BRCA gene (\(p=\) 0.358). De p-value is almost equivalent to the one of the \(\chi^2\)-test (see previous section).
Significant association?
suppressPackageStartupMessages({library(multcomp)})
posthoc <- glht(brcaLogit,linfct=mcp(variant = "Tukey"))
posthocTests <- summary(posthoc)
posthocTests
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = cancer ~ variant, family = binomial, data = brca)
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
leu/leu - pro/pro == 0 0.21197 0.18915 1.121 0.493
pro/leu - pro/pro == 0 0.13802 0.11573 1.193 0.449
pro/leu - leu/leu == 0 -0.07395 0.18922 -0.391 0.917
(Adjusted p values reported -- single-step method)
posthocCI <- confint(posthoc)
posthocCI
Simultaneous Confidence Intervals
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = cancer ~ variant, family = binomial, data = brca)
Quantile = 2.3254
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
leu/leu - pro/pro == 0 0.21197 -0.22787 0.65181
pro/leu - pro/pro == 0 0.13802 -0.13110 0.40714
pro/leu - leu/leu == 0 -0.07395 -0.51396 0.36606
confint
function CI are obtained on the log-odds ratios corrected for multiple testing.OR <- exp(posthocCI$confint)
OR
Estimate lwr upr
leu/leu - pro/pro 1.2361111 0.7962251 1.919018
pro/leu - pro/pro 1.1480000 0.8771317 1.502515
pro/leu - leu/leu 0.9287191 0.5981245 1.442040
attr(,"conf.level")
[1] 0.95
attr(,"calpha")
[1] 2.32541
e.g \(\text{OR}_\text{Leu/Leu-Pro/Pro}=89\times 266/(56\times 342)=\) 1.236.
Note, that statistical inference for logistic regression relies on asymptotic theory.
beetles<-read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/beetles.csv")
head(beetles)
# A tibble: 6 x 2
dose status
<dbl> <dbl>
1 169. 1
2 169. 0
3 169. 0
4 169. 0
5 172. 1
6 172. 0
table(beetles$dose,beetles$status)
0 1
169.07 3 1
172.42 3 1
175.52 3 1
178.42 2 2
181.13 1 3
183.69 0 4
186.1 0 4
188.39 0 4
We build a model for the log odds in function of the dose \(x_i\): \[\log \frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 \times x_i.\]
beatleModel<-glm(status~dose,data=beetles,family=binomial)
summary(beatleModel)
Call:
glm(formula = status ~ dose, family = binomial, data = beetles)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7943 -0.7136 0.2825 0.5177 2.1670
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -53.1928 18.0046 -2.954 0.00313 **
dose 0.3013 0.1014 2.972 0.00296 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 42.340 on 31 degrees of freedom
Residual deviance: 26.796 on 30 degrees of freedom
AIC: 30.796
Number of Fisher Scoring iterations: 5
Note that this is a very large extrapolation: minimum dose in dataset is 169.07 mg/l.
So a beatle exposed to a CS\(_2\) concentration that is 1 mg/l larger than another beatle, will on average have an odds ratio on mortality of \(1.35\).
beetlesTab<-table(beetles) %>% data.frame
data.frame(grid=seq(min(beetles$dose),max(beetles$dose),.1)) %>%
mutate(piHat=predict(beatleModel,
newdata=data.frame(dose=grid),
type="response")) %>%
ggplot(aes(grid,piHat))+
geom_line() +
xlab("dose") +
ylab("probability (dead)") +
geom_text(aes(x=dose%>%as.character%>%as.double,y=status%>%as.character%>%as.double,label=Freq),beetlesTab%>%filter(status==0)) +
geom_text(aes(x=dose%>%as.character%>%as.double,y=status%>%as.character%>%as.double,label=Freq),beetlesTab%>%filter(status==1))