Anova
function from the car
package.→ Extend simple linear regression to multiple predictors.
Prostate specific antigen (PSA) and a number of clinical variables for 97 males with radical prostatectomy.
Association of PSA by
read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/prostate.csv")
prostate<- prostate
lcavol <dbl> | lweight <dbl> | age <dbl> | lbph <dbl> | svi <chr> | lcp <dbl> | gleason <dbl> | pgg45 <chr> | lpsa <dbl> |
---|---|---|---|---|---|---|---|---|
-0.579818495 | 2.769459 | 50 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | -0.4307829 |
-0.994252273 | 3.319626 | 58 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | -0.1625189 |
-0.510825624 | 2.691243 | 74 | -1.38629436 | healthy | -1.38629436 | 7 | 20 | -0.1625189 |
-1.203972804 | 3.282789 | 58 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | -0.1625189 |
0.751416089 | 3.432373 | 62 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | 0.3715636 |
-1.049822124 | 3.228826 | 50 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | 0.7654678 |
0.737164066 | 3.473518 | 64 | 0.61518564 | healthy | -1.38629436 | 6 | healthy | 0.7654678 |
0.693147181 | 3.539509 | 58 | 1.53686722 | healthy | -1.38629436 | 6 | healthy | 0.8544153 |
-0.776528789 | 3.539509 | 47 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | 1.0473190 |
0.223143551 | 3.244544 | 63 | -1.38629436 | healthy | -1.38629436 | 6 | healthy | 1.0473190 |
$svi<-as.factor(prostate$svi) prostate
Separate simple linair models, like
E(Y|Xv)=α+βvXv
Yi=β0+β1Xi1+...+βp−1Xip−1+ϵi
Model allows to
Interpretation βj:
or
lm(lpsa~lcavol,prostate)
lmV <-summary(lmV)
Call:
lm(formula = lpsa ~ lcavol, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.67624 -0.41648 0.09859 0.50709 1.89672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.50730 0.12194 12.36 <2e-16 ***
lcavol 0.71932 0.06819 10.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7875 on 95 degrees of freedom
Multiple R-squared: 0.5394, Adjusted R-squared: 0.5346
F-statistic: 111.3 on 1 and 95 DF, p-value: < 2.2e-16
lm(lpsa~lcavol + lweight + svi ,prostate)
lmVWS <-summary(lmVWS)
Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.72966 -0.45767 0.02814 0.46404 1.57012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26807 0.54350 -0.493 0.62301
lcavol 0.55164 0.07467 7.388 6.3e-11 ***
lweight 0.50854 0.15017 3.386 0.00104 **
sviinvasion 0.66616 0.20978 3.176 0.00203 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
If data are representative than the least squares estimators for the intercept and slopes are unbiased. E[ˆβj]=βj,j=0,…,p−1.
Gain insight in the distribution of the parameter estimators so as to generalize the effect in the sample to the population.
Additional assumptions are needed for inference.
Linearity
Independence
Homoscedasticity of equal variance
Normality: residuals ϵi are normally distributed.
Under these assumptions: ϵi∼N(0,σ2). en Yi∼N(β0+β1Xi1+…+βp−1Xip−1,σ2)
Slopes are again more precise if the predictor values have a larger range.
Conditional variance (σ2) can again be estimated based on the mean squared error (MSE):
ˆσ2=MSE=n∑i=1(yi−ˆβ0−ˆβ1Xi1−…−ˆβp−1Xip−1)2n−p=n∑i=1e2in−p.
Again hypothesis tests and confidence intervals by Tk=ˆβk−βkSE(ˆβk) met k=0,…,p−1.
If all assumptions are satisfied than the statistics Tk t-distributed with n−p degrees of freedom.
When normality thus not hold, but lineariteit, independence and homoscedasticity are valid we can again adopt the CLT that states that statistic Tk is approximately normally distributed in large samples.
We can build confidence intervals on the slopes by: [ˆβj−tn−p,α/2SEˆβj,ˆβj+tn−p,α/2SEˆβj].
confint(lmVWS)
2.5 % 97.5 %
(Intercept) -1.3473509 0.8112061
lcavol 0.4033628 0.6999144
lweight 0.2103288 0.8067430
sviinvasion 0.2495824 1.0827342
Formal hypothesis tests: H0:βj=0 H1:βj≠0
With test statistic T=ˆβj−0SE(ˆβj) which follows a t-distribution with n−p degrees of freedom under H0
summary(lmVWS)
Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.72966 -0.45767 0.02814 0.46404 1.57012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26807 0.54350 -0.493 0.62301
lcavol 0.55164 0.07467 7.388 6.3e-11 ***
lweight 0.50854 0.15017 3.386 0.00104 **
sviinvasion 0.66616 0.20978 3.176 0.00203 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
plot(lmVWS)
The previous model is additive because the contribution of the cancer volume on lpsa does not depend on the height of the prostate weight and the svi status.
The slope for lcavol does not depend on log prostate weight and svi.
β0+βv(xv+δv)+βwxw+βsxs−β0−βvxv−βwxw−βsxs=βvδv
The svi status and the log-prostategewicht (xw) do not influence the contribution of the log-tumor volume (xv) to the average log-PSA and vice versa.
To model this interactie or effect modification we can add a product term of both variables to the model
Yi=β0+βvxiv+βwxiw+βsxis+βvwxivxiw+ϵi
This term quantifies the interactie-effect of predictors xv en xw on the mean outcome.
Terms βvxiv and βwxiw are referred to as main effects of predictors xv and xw.
The difference in lpsa for patients that differ 1 unit in Xv and have an equal log prostate weight and the same svi status now becomes:
E(Y|Xv=xv+1,Xw=xw,Xs=xs)−E(Y|Xv=xv,Xw=xw,Xs=xs)=β0+βv(xv+1)+βwxw+βsxs+βvw(xv+1)xw−β0−βvxv−βwxw−βsxs−βvw(xv)xw=βv+βvwxw
lm(lpsa~lcavol + lweight + svi + lcavol:lweight ,prostate)
lmVWS_IntVW <-summary(lmVWS_IntVW)
Call:
lm(formula = lpsa ~ lcavol + lweight + svi + lcavol:lweight,
data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.65886 -0.44673 0.02082 0.50244 1.57457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6430 0.7030 -0.915 0.36278
lcavol 1.0046 0.5427 1.851 0.06734 .
lweight 0.6146 0.1961 3.134 0.00232 **
sviinvasion 0.6859 0.2114 3.244 0.00164 **
lcavol:lweight -0.1246 0.1478 -0.843 0.40156
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7179 on 92 degrees of freedom
Multiple R-squared: 0.6293, Adjusted R-squared: 0.6132
F-statistic: 39.05 on 4 and 92 DF, p-value: < 2.2e-16
Interaction between lcavol ↔ svi and lweight ↔ svi.
The model becomes
Y=β0+βvXv+βwXw+βsXs+βvsXvXs+βwsXwXs+ϵ
lm(lpsa ~ lcavol + lweight + svi + svi:lcavol + svi:lweight,data=prostate)
lmVWS_IntVS_WS <-summary(lmVWS_IntVS_WS)
Call:
lm(formula = lpsa ~ lcavol + lweight + svi + svi:lcavol + svi:lweight,
data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.50902 -0.44807 0.06455 0.45657 1.54354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.52642 0.56793 -0.927 0.356422
lcavol 0.54060 0.07821 6.912 6.38e-10 ***
lweight 0.58292 0.15699 3.713 0.000353 ***
sviinvasion 3.43653 1.93954 1.772 0.079771 .
lcavol:sviinvasion 0.13467 0.25550 0.527 0.599410
lweight:sviinvasion -0.82740 0.52224 -1.584 0.116592
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7147 on 91 degrees of freedom
Multiple R-squared: 0.6367, Adjusted R-squared: 0.6167
F-statistic: 31.89 on 5 and 91 DF, p-value: < 2.2e-16
Because XS is a dummy variabele we obtain to distinct regression planes:
The total SSTot is again
SSTot=n∑i=1(Yi−ˉY)2.
The residual sum of squares remains similar SSE=n∑i=1(Yi−ˆYi)2.
Again the total sum of squares can be decomposed in , SSTot=SSR+SSE, with SSR=n∑i=1(ˆYi−ˉY)2.
We have following degrees of freedom and mean sum of squares:
The determination coefficients remains as before, i.e. R2=1−SSESSTot=SSRSSTot and is the fraction of the total variability that can be explained by the regression model.
Teststatistic F=MSR/MSE is under H0:β1=…=βp−1=0 distributed by an F distribution: Fp−1;n−p.
Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.72966 -0.45767 0.02814 0.46404 1.57012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26807 0.54350 -0.493 0.62301
lcavol 0.55164 0.07467 7.388 6.3e-11 ***
lweight 0.50854 0.15017 3.386 0.00104 **
sviinvasion 0.66616 0.20978 3.176 0.00203 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
Consider 2 models for the predictors x1 en x2: Yi=β0+β1xi1+ϵi, with ϵi iid N(0,σ21), and Yi=β0+β1xi1+β2xi2+ϵi, with ϵi iid N(0,σ22).
for the first (gereduceerde) model we have decomposition SSTot=SSR1+SSE1 en for the second non-reduced model we have SSTot=SSR2+SSE2 (SSTot is of course the same because it only depends on the response and not of the models).
Definition of additional sum of squares The additional sum of squares of predictor x2 as compared to the model with only x1 as predictor is given by SSR2∣1=SSE1−SSE2=SSR2−SSR1.
Note that, SSE1−SSE2=SSR2−SSR1 is triviaal is because of the decomposition of the total sum of squares.
The additional sum of squares SSR2∣1 can simply be interpreted as the additional variability that can be explained by adding predictor x2 to the model with predictor x1.
With this sum of squares we can further decompose the total sum of squares SSTot=SSR1+SSR2∣1+SSE. which follows directly from the definition SSR2∣1.
Extension: (s<p−1) Yi=β0+β1xi1+⋯+βsxis+ϵi with ϵi iid N(0,σ21), and (s<q≤p−1) Yi=β0+β1xi1+⋯+βsxis+βs+1xis+1+⋯βqxiq+ϵi with ϵi iid N(0,σ22).
The additional sum of squares of predictor xs+1,…,xq compared to a model with only predictors x1,…,xs is given by
SSRs+1,…,q∣1,…,s=SSE1−SSE2=SSR2−SSR1.
Suppose that p−1 predictors are considered, and suppose the following sequence of models (s=2,…,p−1) Yi=β0+s∑j=1βjxij+ϵi wuth ϵi iid N(0,σ2).
We can show for model Model with s=p−1 that SSTot=SSR1+SSR2∣1+SSR3∣1,2+⋯+SSRp−1∣1,…,p−2+SSE, with SSE the residual sum of squares of the model with all p−1 predictors SSR1+SSR2∣1+SSR3∣1,2+⋯+SSRp−1∣1,…,p−2=SSR with SSR the sum of squares of all p−1 predictors.
Type III sum of squares for predictor xj are given by the additional sum of squares SSRj∣1,…,j−1,j+1,…,p−1=SSE1−SSE2
The type III sum of squares SSRj∣1,…,j−1,j+1,…,p−1 quantify the contribution in the total variance of the outcome explained by xj that cannot be explained by the remaining p−2 predictors.
The type III sum of squares has 1 degree of freedom because it involves 1 β-parameter.
For each type III SSR term the mean sum of squares is defined by MSRj∣1,…,j−1,j+1,…,p−1=SSRj∣1,…,j−1,j+1,…,p−1/1.
Teststatistiek F=MSRj∣1,…,j−1,j+1,…,p−1/MSE is F1;n−p distributed under H0:βj=0.
Anova
function from the car
package.library(car)
Anova(lmVWS,type=3)
Sum Sq <dbl> | Df <dbl> | F value <dbl> | Pr(>F) <dbl> | |
---|---|---|---|---|
(Intercept) | 0.1250021 | 1 | 0.2432815 | 6.230088e-01 |
lcavol | 28.0445759 | 1 | 54.5808864 | 6.303883e-11 |
lweight | 5.8923334 | 1 | 11.4677711 | 1.039115e-03 |
svi | 5.1813959 | 1 | 10.0841312 | 2.029035e-03 |
Residuals | 47.7849616 | 93 | NA | NA |
The p-values are identical to those of two-sided t-tests
Note, however, that all dummies for factors with multiple levels will be taken out of the model at once. So then the type III sum of squares will have as many degrees of freedom as the number of dummies and an omnibus test is performed for the effect of the factor.
Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.72966 -0.45767 0.02814 0.46404 1.57012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26807 0.54350 -0.493 0.62301
lcavol 0.55164 0.07467 7.388 6.3e-11 ***
lweight 0.50854 0.15017 3.386 0.00104 **
sviinvasion 0.66616 0.20978 3.176 0.00203 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
Call:
lm(formula = lpsa ~ lcavol + lweight + svi + lcavol:lweight,
data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.65886 -0.44673 0.02082 0.50244 1.57457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6430 0.7030 -0.915 0.36278
lcavol 1.0046 0.5427 1.851 0.06734 .
lweight 0.6146 0.1961 3.134 0.00232 **
sviinvasion 0.6859 0.2114 3.244 0.00164 **
lcavol:lweight -0.1246 0.1478 -0.843 0.40156
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7179 on 92 degrees of freedom
Multiple R-squared: 0.6293, Adjusted R-squared: 0.6132
F-statistic: 39.05 on 4 and 92 DF, p-value: < 2.2e-16
Estimates are different from those in the additive model and the standard errors are much higher!
This is caused by the multicollinearity problem.
If 2 predictors are strongly correlated than they share a lot of information.
It is therefore difficult to estimate the individual contribution of each predictor on the outcome.
Least squares estimators become instable.
Standard errors become inflated.
As long as we only do predictions on the basis of the regression model without extrapolating beyond the range of the predictors observed in the sample multicolinearity is not problematic.
But for inference it is problematic.
cor(cbind(prostate$lcavol,prostate$lweight,prostate$lcavol*prostate$lweight))
[,1] [,2] [,3]
[1,] 1.0000000 0.1941283 0.9893127
[2,] 0.1941283 1.0000000 0.2835608
[3,] 0.9893127 0.2835608 1.0000000
For parameter j in de regression model VIFj=(1−R2j)−1
Call:
lm(formula = Body_fat ~ Triceps + Thigh + Midarm, data = bodyfat)
Residuals:
Min 1Q Median 3Q Max
-3.7263 -1.6111 0.3923 1.4656 4.1277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 117.085 99.782 1.173 0.258
Triceps 4.334 3.016 1.437 0.170
Thigh -2.857 2.582 -1.106 0.285
Midarm -2.186 1.595 -1.370 0.190
Residual standard error: 2.48 on 16 degrees of freedom
Multiple R-squared: 0.8014, Adjusted R-squared: 0.7641
F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
vif(lmFat)
Triceps Thigh Midarm
708.8429 564.3434 104.6060
Call:
lm(formula = Midarm ~ Triceps + Thigh, data = bodyfat)
Residuals:
Min 1Q Median 3Q Max
-0.58200 -0.30625 0.02592 0.29526 0.56102
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.33083 1.23934 50.29 <2e-16 ***
Triceps 1.88089 0.04498 41.82 <2e-16 ***
Thigh -1.60850 0.04316 -37.26 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.377 on 17 degrees of freedom
Multiple R-squared: 0.9904, Adjusted R-squared: 0.9893
F-statistic: 880.7 on 2 and 17 DF, p-value: < 2.2e-16
We evaluate the VIF in the prostate cancer example for the additive model and the model with interactie.
vif(lmVWS)
lcavol lweight svi
1.447048 1.039188 1.409189
vif(lmVWS_IntVW)
lcavol lweight svi lcavol:lweight
76.193815 1.767121 1.426646 80.611657
set.seed(112358)
20
nobs<-1
sdy<-seq(0,1,length=nobs)
x<-10+5*x+rnorm(nobs,sd=sdy)
y<-c(x,0.5)
x1<- c(y,10+5*1.5+rnorm(1,sd=sdy))
y1 <- c(x,1.5)
x2 <- c(y,y1[21])
y2 <- c(x,1.5)
x3 <- c(y,11)
y3 <-plot(x,y,xlim=range(c(x1,x2,x3)),ylim=range(c(y1,y2,y3)))
points(c(x1[21],x2[21],x3[21]),c(y1[21],y2[21],y3[21]),pch=as.character(1:3),col=2:4)
abline(lm(y~x),lwd=2)
abline(lm(y1~x1),col=2,lty=2,lwd=2)
abline(lm(y2~x2),col=3,lty=3,lwd=2)
abline(lm(y3~x3),col=4,lty=4,lwd=2)
legend("topleft",col=1:4,lty=1:4,legend=paste("lm",c("",as.character(1:3))),text.col=1:4)
It is not desirable that a single observation largely influences the result of a linear regression analysis
Diagnostics allow us to detect extreme observations.
Studentized residuals to spot outliers
Leverage to spot observations with extreem covariate pattern
A statistics to assess the influence the effect of a single observation on the regression analysis
Cook’s distance for observation i is diagnostic measure for this particular observation on all all predictions or on all estimated parameters. Di=∑nj=1(ˆYj−ˆYj(i))2pMSE
Observation i has a large influence on the regression parameters and predictions if the Cook’s distance Di is large.
Extreme Cook’s distance if it is larger than the 50% quantile of an Fp+1,n−(p+1)-distribution.
Suppose that researchers want to assess the association between age and bloodpressure.
Possibly this association will differ for females and males.
They want to assess following hypotheses:
We fit a model for the average systolic blood pressure (BPSysAve
) using age, gender and the interaction between age and gender for adult caucasians from the NHANES study.
library(NHANES)
NHANES %>%
bpData <-filter(
=="White" &
Race1 Age >= 18 &
!is.na(BPSysAve)
)
lm(BPSysAve ~ Age*Gender, bpData)
mBp1 <-par(mfrow = c(2,2))
plot(mBp1)
Assumptions are not fullfilled!
We fit a model using log2 transformed average systolic blood pressure (BPSysAve
).
lm(BPSysAve %>% log2 ~ Age*Gender, bpData)
mBp2 <-par(mfrow = c(2,2))
plot(mBp2)
If the residual plot shows a cone we can get to valid inference for large samples by modeling the variance explicitly in function of the fitted response.
The inverse variance for each observation can than be used as a weight in the lm function.
lm(mBp1$res %>% abs ~ mBp2$fitted) mSd <-
We estimate the model again:
lm(BPSysAve ~ Age*Gender, bpData, w = 1/mSd$fitted^2) mBp3 <-
Residuals are still heteroscedastic.
data.frame(residuals = mBp3$residuals, fit = mBp3$fitted) %>%
ggplot(aes(fit,residuals)) +
geom_point()
But we accounted for that using weights! Note that if we rescale the residuals using the standard deviation (multiplying them with the square root of the weight) we obtain rescaled residuals that are homoscedastic.
The model parameters are estimated using weighted least squares:
SSE=n∑i=1wie2i
with wi=1/ˆσ2i.
Weighted regression will correct for heteroscedasticity.
data.frame(scaled_residuals = mBp3$residuals/mSd$fitted, fit = mBp3$fitted) %>%
ggplot(aes(fit,scaled_residuals)) +
geom_point()
summary(mBp3)
Call:
lm(formula = BPSysAve ~ Age * Gender, data = bpData, weights = 1/mSd$fitted^2)
Weighted Residuals:
Min 1Q Median 3Q Max
-4.3642 -0.8494 -0.0940 0.7605 6.5701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.59709 0.63501 153.693 < 2e-16 ***
Age 0.44082 0.01505 29.294 < 2e-16 ***
Gendermale 13.36724 1.09017 12.262 < 2e-16 ***
Age:Gendermale -0.19115 0.02420 -7.899 3.45e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.319 on 4828 degrees of freedom
Multiple R-squared: 0.2182, Adjusted R-squared: 0.2178
F-statistic: 449.3 on 3 and 4828 DF, p-value: < 2.2e-16
The research questions translate to following nullhypotheses:
Association between blood pressure and age for females? H0:βAge=0 vs H1:βAge≠0
Association between blood pressure and age for males? H0:βAge+βAge:Gendermale=0 vs H1:βAge+βAge:Gendermale≠0
Is the association between blood pressure and age different for females and males? H0:βAge:Gendermale=0 vs H1:βAge:Gendermale≠0
We can again use an Anova approach.
H0:βAge=βAge:Gendermale=0
lm(BPSysAve ~ Gender, bpData, w = 1/mSd$fitted^2)
mBp0 <-anova(mBp0, mBp3)
Res.Df <dbl> | RSS <dbl> | Df <dbl> | Sum of Sq <dbl> | F <dbl> | Pr(>F) <dbl> | |
---|---|---|---|---|---|---|
1 | 4830 | 10200.529 | NA | NA | NA | NA |
2 | 4828 | 8404.529 | 2 | 1796 | 515.858 | 9.123215e-204 |
For the posthoc tests we will again build upon the multcomp
package.
library(multcomp)
glht(mBp3, linfct=c(
bpPosthoc <-"Age = 0",
"Age + Age:Gendermale = 0",
"Age:Gendermale = 0")
)%>% summary bpPosthoc
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = BPSysAve ~ Age * Gender, data = bpData, weights = 1/mSd$fitted^2)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Age == 0 0.44082 0.01505 29.294 <1e-10 ***
Age + Age:Gendermale == 0 0.24967 0.01895 13.175 <1e-10 ***
Age:Gendermale == 0 -0.19115 0.02420 -7.899 <1e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
bpPosthoc %>% confint
bpPosthocCI <- bpPosthocCI
Simultaneous Confidence Intervals
Fit: lm(formula = BPSysAve ~ Age * Gender, data = bpData, weights = 1/mSd$fitted^2)
Quantile = 2.3154
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
Age == 0 0.4408 0.4060 0.4757
Age + Age:Gendermale == 0 0.2497 0.2058 0.2936
Age:Gendermale == 0 -0.1911 -0.2472 -0.1351
Note that the glht function allows us to define the contrasts by explicitely defining the nullhypothese using the names of the model parameters.
We can conclude that the association between age and blood pressure is extremely significant (p << 0.001).
The blood pressure for females that differ in age is on average 0.44 mm Hg higher per year of age difference for the eldest female and is extremely significant (p << 0.001, 95% CI [0.41, 0.48].
The blood pressure for males that differ in age is on average 0.25 mm Hg higher per year of age difference for the eldest male and is extremely significant (p << 0.001, 95% CI [0.21, 0.29].
The average blood pressure difference between subjects that differ in age is on average 0.19 mm Hg/jaar higher for females than for males (p << 0.001, 95% CI [0.14, 0.25]).