Subset of study https://doi.org/10.1093/jnci/djj052
32 breast cancer patients with estrogen receptor positive tumour that had tamoxifen chemotherapy. Variables:
ESR1 in active in \(\pm\) 75% of breast cancer tumours.
Proteins of S100 family often dysregulated in cancer
brca <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/breastcancer.csv")
brca
# A tibble: 32 x 10
sample_name filename treatment er grade node size age ESR1 S100A8
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 OXFT_209 gsm65344.ce… tamoxifen 1 3 1 2.5 66 1939. 207.
2 OXFT_1769 gsm65345.ce… tamoxifen 1 1 1 3.5 86 2752. 37.0
3 OXFT_2093 gsm65347.ce… tamoxifen 1 1 1 2.2 74 379. 2364.
4 OXFT_1770 gsm65348.ce… tamoxifen 1 1 1 1.7 69 2532. 23.6
5 OXFT_1342 gsm65350.ce… tamoxifen 1 3 0 2.5 62 141. 3219.
6 OXFT_2338 gsm65352.ce… tamoxifen 1 3 1 1.4 63 1495. 108.
7 OXFT_2341 gsm65353.ce… tamoxifen 1 1 1 3.3 76 3406. 14.0
8 OXFT_1902 gsm65354.ce… tamoxifen 1 3 0 2.4 61 2813. 68.4
9 OXFT_1982 gsm65355.ce… tamoxifen 1 1 0 1.7 62 950. 74.2
10 OXFT_5210 gsm65356.ce… tamoxifen 1 3 0 3.5 65 1053. 182.
# … with 22 more rows
Is the expression of the S100A8 gene associated with that of that of the ESR1 gene?
There are many variables in the data, we will plot all variables in a scatterplot matrix using the GGally package.
#install.packages("GGally")
library(GGally)
brca[,-(1:4)] %>% ggpairs()
We now focus on the association between S100A8 expression and the ESR1 expression.
brca %>%
ggplot(aes(x=ESR1,y=S100A8)) +
geom_point() +
geom_smooth(se=FALSE,col="grey") +
geom_smooth(method="lm",se=FALSE)
The association of S100A and ESR1 does not appear to be linear.
brca %>%
ggplot(aes(x=ESR1 %>% log2,y=S100A8 %>% log2)) +
geom_point() +
geom_smooth(se=FALSE,col="grey") +
geom_smooth(method="lm",se=FALSE)
Upon log transformation the data are showing a linear association. Is this association strong enough to be able to conclude that the S100A8 gene expression is associated to the ESR1 gene expression?
We first calculate the Pearson and Spearman correlation based on the original data and the log2 transformed data
Pearson correlation
brca %>%
select(S100A8,ESR1) %>%
mutate(S100A8Log2=S100A8%>%log2,ESR1Log2=ESR1%>%log2) %>%
cor
S100A8 ESR1 S100A8Log2 ESR1Log2
S100A8 1.0000000 -0.5376479 0.8080424 -0.7717210
ESR1 -0.5376479 1.0000000 -0.7762524 0.9151322
S100A8Log2 0.8080424 -0.7762524 1.0000000 -0.8911990
ESR1Log2 -0.7717210 0.9151322 -0.8911990 1.0000000
Spearman correlation
brca %>%
select(S100A8,ESR1) %>%
mutate(S100A8Log2=S100A8%>%log2,ESR1Log2=ESR1%>%log2) %>%
cor(,method = "spearman")
S100A8 ESR1 S100A8Log2 ESR1Log2
S100A8 1.000000 -0.733871 1.000000 -0.733871
ESR1 -0.733871 1.000000 -0.733871 1.000000
S100A8Log2 1.000000 -0.733871 1.000000 -0.733871
ESR1Log2 -0.733871 1.000000 -0.733871 1.000000
Both the Pearson and Spearman correlation are negative. Note, that the Spearman correlation is much larger in absolute value than the Pearson correlation on the original expression measurements. Indeed, the Pearson correlation is affected by the non linear association at the original scale.
On the log scale the Pearson correlation is much higher. The spearman correlation, however, remains because it is based on rank transformed data.
We will model the data on the log2 scale and we assume the following statistical model:
\[Y_i\vert X_i\sim N(\beta_0+\beta_1X_i,\sigma^2)\]
with \(Y_i\) the log2 transformed S100A8 gene expression and \(X_i\) the log2 transformed ESR1 gene expression.
We fit the model to the data using the lm function and we first assess the assumptions. Because the subjects were selected at random from the population they are independent. We still have to check the following assumptions.
lm2<-lm(S100A8%>%log2 ~ ESR1 %>% log2, brca)
plot(lm2)
summary(lm2)
Call:
lm(formula = S100A8 %>% log2 ~ ESR1 %>% log2, data = brca)
Residuals:
Min 1Q Median 3Q Max
-1.94279 -0.66537 0.08124 0.68468 1.92714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.401 1.603 14.60 3.57e-15 ***
ESR1 %>% log2 -1.615 0.150 -10.76 8.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared: 0.7942, Adjusted R-squared: 0.7874
F-statistic: 115.8 on 1 and 30 DF, p-value: 8.07e-12
In the residuals vs fitted values plot we observe that the residuals are nicely spread around zero with more or less the same variance and we do not observed a trend in the residuals so there is no indication on deviations from linearity and homoscedasticity.
The QQ-plot further shows no substantial deviations from normality. So the assumptions hold.
We can now assess a formal hypothesis test to for the linear association between the S100A8 gene and the ESR1 gene expression at the log2 scale.
We can translate the hypothesis that there is an association between the S100A8 and ESR1 gene expression in terms of the slope of the model, so under the alternative hypothesis \(\beta_1\) is different from zero. \[H_1: \beta_1 \neq 0\]
With data we can never prove a hypothesis, so we therefore falsify the opposite: the null hypothesis the there is no association between the expression of both genes:
\[H_0: \beta_0 = 0\] We can do this by adopting a t-test on the slope and by using confidence intervals on the slope.
summary(lm2)
Call:
lm(formula = S100A8 %>% log2 ~ ESR1 %>% log2, data = brca)
Residuals:
Min 1Q Median 3Q Max
-1.94279 -0.66537 0.08124 0.68468 1.92714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.401 1.603 14.60 3.57e-15 ***
ESR1 %>% log2 -1.615 0.150 -10.76 8.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared: 0.7942, Adjusted R-squared: 0.7874
F-statistic: 115.8 on 1 and 30 DF, p-value: 8.07e-12
confint(lm2)
2.5 % 97.5 %
(Intercept) 20.128645 26.674023
ESR1 %>% log2 -1.921047 -1.308185
Both the t-test and the confidence interval indicate that association is extremely significant. The interval moreover shows that the association is biologically relevant.
We will transform slope and the confidence interval back to the original scale to interpret the results in terms of fold changes.
2^(lm2$coef)
(Intercept) ESR1 %>% log2
1.107908e+07 3.265519e-01
2^confint(lm2)
2.5 % 97.5 %
(Intercept) 1.146373e+06 1.070733e+08
ESR1 %>% log2 2.640628e-01 4.038287e-01
There is an extremely significant negative association between the S100A8 gene expression and that of ESR1 (\(p<<0.001\)).
A patient with an ESR1 expression that is 2 times the expression of that of another patient will on average have an S100A8 expression that is 3.06 times lower (95% CI [2.48,3.79]).