Smelly armpits are not caused by sweat itself. The smell is caused by specific micro-organisms belonging to the group of Corynebacterium spp. that metabolise sweat. Another group of abundant bacteria are the Staphylococcus spp., these bacteria do not metabolise sweat in smelly compounds.
The CMET-groep at Ghent University does research on transplanting the armpit microbiome to save people with smelly armpits.
Experiment:
read_lines("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/armpit.csv")
[1] "trt,rel" "placebo,54.99207606973059"
[3] "placebo,31.84466019417476" "placebo,41.09947643979057"
[5] "placebo,59.52063914780293" "placebo,63.573407202216075"
[7] "placebo,41.48648648648649" "placebo,30.44041450777202"
[9] "placebo,42.95676429567643" "placebo,41.7391304347826"
[11] "placebo,33.896515311510036" "transplant,57.218124341412015"
[13] "transplant,72.50900360144058" "transplant,61.89258312020461"
[15] "transplant,56.690140845070424" "transplant,76"
[17] "transplant,71.7357910906298" "transplant,57.757296466973884"
[19] "transplant,65.1219512195122" "transplant,67.53424657534246"
[21] "transplant,77.55359394703657"
The file is comma separated and in tidy format
ap<-read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/armpit.csv")
ap
# A tibble: 20 x 2
trt rel
<chr> <dbl>
1 placebo 55.0
2 placebo 31.8
3 placebo 41.1
4 placebo 59.5
5 placebo 63.6
6 placebo 41.5
7 placebo 30.4
8 placebo 43.0
9 placebo 41.7
10 placebo 33.9
11 transplant 57.2
12 transplant 72.5
13 transplant 61.9
14 transplant 56.7
15 transplant 76
16 transplant 71.7
17 transplant 57.8
18 transplant 65.1
19 transplant 67.5
20 transplant 77.6
We first summarize the data and calculate the mean, standard deviation, number of observations and standard error and store the result in an object apRelSum via ’apRelSum<-`
ap
dataframe to the group_by function to group the data by treatment trt group_by(trt)
summarize_at
function to summarize the “rel” variable and calculate the mean, standard deviation and the number of observationsmutate
function to make a new variable in the data frame se
for which we calculate the standard errorapRelSum<-ap%>%
group_by(trt)%>%
summarize_at("rel",
list(mean=~mean(.,na.rm=TRUE),
sd=~sd(.,na.rm=TRUE),
n=function(x) x%>%is.na%>%`!`%>%sum)) %>%
mutate(se=sd/sqrt(n))
apRelSum
# A tibble: 2 x 5
trt mean sd n se
<chr> <dbl> <dbl> <int> <dbl>
1 placebo 44.2 11.5 10 3.65
2 transplant 66.4 7.88 10 2.49
We will use ggplot2 to make our plots. With the ggplot2 library we can easily build plots by adding layers.
We pipe our summarized data to the ggplot
function and we select the treatment variable trt and the variable mean for plotting aes(x=trt,y=mean)
We make a barplot based on this data using the geom_bar
function. The statistic is stat="identity"
because the bar height should be equal the value for the mean of the relative abundance.
apRelSum%>%
ggplot(aes(x=trt,y=mean)) +
geom_bar(stat="identity")
We will now add standard errors to the plot using geom_errorbar
function and specify the minimum and maximum value for of the error bar, the width command is used to set the width of the error bar smaller than the width of the bar.
apRelSum%>%
ggplot(aes(x=trt,y=mean)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=mean-se,ymax=mean+se),width=.2)
I consider barplots to be bad plots
It is better to get a view on the distribution of the data. We can use a boxplot for this purpose. We first explain what a boxplot.
We will now make a boxplot for the ap data
ap
dataframe to the ggplot commandggplot(aes(x=trt,y=rel))
geom_boxplot()
ap %>%
ggplot(aes(x=trt,y=rel)) +
geom_boxplot()
Note, that we do not have so many observations.
It is always better to show the data as raw as possible!
We will now add the raw data to the plot.
geom_point(position="jitter")
, with the argument position=‘jitter’ we will add some random noise to the x coordinate so that we can see all data.ap %>%
ggplot(aes(x=trt,y=rel)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter")
This is an informative plot!
Why do we need more than one student per group?
What are the meaning and the consequences of the term “randomly”?
Consider two designs:
A research assistant selects 20 male students with smelly armpits from his faculty.
A research assistant selects 20 random people with smelly armpits from the Belgian population.
ap %>%
ggplot(aes(x=trt,y=rel)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter") +
ylim(0,100)
ap2 %>%
ggplot(aes(x=trt,y=rel)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter") +
ylim(0,100)
Design 1: smaller variability
Design 2: larger variability and lower relative abundance of Staphylococcus
What is the best design?
Random sampling is closely related to the concept of the population or the scope of the study.
Based on a sample of subjects, the researchers want to come to conclusions that hold for
Scope of the study should be well specified before the start of the study.
For the statistical analysis to be valid, it is required that the subjects are selected completely at random from the population to which we want to generalize our conclusions.
The sample is thus supposed to be representative for the population, but still it is random.
What does this imply?
National Health NHanes study
library(NHANES)
head(NHANES)
# A tibble: 6 x 76
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
<int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
1 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch~
2 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch~
3 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch~
4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some Col~
6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
# ... with 67 more variables: MaritalStatus <fct>, HHIncome <fct>,
# HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
# Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>,
# BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
# BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
# BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
# Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>,
# UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>,
# DiabetesAge <int>, HealthGen <fct>, DaysPhysHlthBad <int>,
# DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>,
# nPregnancies <int>, nBabies <int>, Age1stBaby <int>,
# SleepHrsNight <int>, SleepTrouble <fct>, PhysActive <fct>,
# PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
# TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
# AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
# Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
# RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
# SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>,
# SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>
glimpse(NHANES)
Observations: 10,000
Variables: 76
$ ID <int> 51624, 51624, 51624, 51625, 51630, 51638, 516...
$ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, ...
$ Gender <fct> male, male, male, male, female, male, male, f...
$ Age <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, ...
$ AgeDecade <fct> 30-39, 30-39, 30-39, 0-9, 40-49, 0-9, ...
$ AgeMonths <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 5...
$ Race1 <fct> White, White, White, Other, White, White, Whi...
$ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Education <fct> High School, High School, High School, NA, So...
$ MaritalStatus <fct> Married, Married, Married, NA, LivePartner, N...
$ HHIncome <fct> 25000-34999, 25000-34999, 25000-34999, 20000-...
$ HHIncomeMid <int> 30000, 30000, 30000, 22500, 40000, 87500, 600...
$ Poverty <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.0...
$ HomeRooms <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 1...
$ HomeOwn <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own...
$ Work <fct> NotWorking, NotWorking, NotWorking, NA, NotWo...
$ Weight <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75....
$ Length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Height <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130...
$ BMI <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20....
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ BMI_WHO <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 3...
$ Pulse <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 6...
$ BPSysAve <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 11...
$ BPDiaAve <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 7...
$ BPSys1 <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 10...
$ BPDia1 <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 7...
$ BPSys2 <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 11...
$ BPDia2 <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 7...
$ BPSys3 <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 11...
$ BPDia3 <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 7...
$ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ DirectChol <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12,...
$ TotChol <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82,...
$ UrineVol1 <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 10...
$ UrineFlow1 <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1...
$ UrineVol2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ UrineFlow2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, N...
$ DiabetesAge <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ HealthGen <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vg...
$ DaysPhysHlthBad <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA...
$ DaysMentHlthBad <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0,...
$ LittleInterest <fct> Most, Most, Most, NA, Several, NA, NA, None, ...
$ Depressed <fct> Several, Several, Several, NA, Several, NA, N...
$ nPregnancies <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, N...
$ nBabies <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA...
$ Age1stBaby <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, N...
$ SleepHrsNight <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA,...
$ SleepTrouble <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, N...
$ PhysActive <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Ye...
$ PhysActiveDays <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1,...
$ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ TVHrsDayChild <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, ...
$ CompHrsDayChild <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, ...
$ Alcohol12PlusYr <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
$ AlcoholDay <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, ...
$ AlcoholYear <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104...
$ SmokeNow <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, ...
$ Smoke100 <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Y...
$ Smoke100n <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, N...
$ SmokeAge <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, N...
$ Marijuana <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
$ AgeFirstMarij <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 1...
$ RegularMarij <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Y...
$ AgeRegMarij <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2...
$ HardDrugs <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, N...
$ SexEver <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
$ SexAge <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 2...
$ SexNumPartnLife <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 1...
$ SexNumPartYear <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA...
$ SameSex <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, N...
$ SexOrientation <fct> Heterosexual, Heterosexual, Heterosexual, NA,...
$ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
Suppose that we are interested in assessing the difference in direct cholesterol levels between males and females older than 25 years.
filter
to filter the data according to age.ggplot(aes(x=DirectChol))
geom_histogram()
facet_grid(Gender~.)
xlab
command.NHANES%>%filter(Age>25)%>%
ggplot(aes(x=DirectChol))+
geom_histogram() +
facet_grid(Gender~.) +
xlab("Direct cholesterol (mg/dl)")
NHANES%>%
filter(Age>25)%>%
ggplot(aes(x=DirectChol%>%log2))+
geom_histogram() +
facet_grid(Gender~.) +
xlab("Direct cholesterol (log2)")
We see that the data are more or less bell shaped upon log transformation.
We will now create a subset of the data that we will use to sample from in the next sections.
nhanesSub<- NHANES%>%
filter(Age>25&!is.na(DirectChol)) %>%
select(c("Gender","DirectChol")) %>%
mutate(cholLog=log2(DirectChol))
We will calculate the summary statistics for the cholLog variable for males and females in the large dataset. So we group by Gender
cholLogSum<- nhanesSub %>%
group_by(Gender) %>%
summarize_at("cholLog",
list(mean=~mean(.,na.rm=TRUE),
sd=~sd(.,na.rm=TRUE),
n=function(x) x%>%is.na%>%`!`%>%sum)) %>%
mutate(se=sd/sqrt(n))
cholLogSum
# A tibble: 2 x 5
Gender mean sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 female 0.553 0.408 3099 0.00733
2 male 0.238 0.400 3025 0.00727
fem<-nhanesSub%>%filter(Gender=="female")%>%sample_n(size=10)
mal<-nhanesSub%>%filter(Gender=="male")%>%sample_n(size=10)
samp<-rbind(fem,mal)
samp
# A tibble: 20 x 3
Gender DirectChol cholLog
<fct> <dbl> <dbl>
1 female 1.6 0.678
2 female 1.4 0.485
3 female 1.68 0.748
4 female 1.4 0.485
5 female 1.32 0.401
6 female 1.16 0.214
7 female 1.94 0.956
8 female 1.55 0.632
9 female 1.58 0.660
10 female 1.29 0.367
11 male 0.72 -0.474
12 male 1.89 0.918
13 male 2.48 1.31
14 male 2.38 1.25
15 male 0.98 -0.0291
16 male 0.88 -0.184
17 male 0.91 -0.136
18 male 1.73 0.791
19 male 1.22 0.287
20 male 1.22 0.287
We will now plot the data with a histogram and boxplots
samp %>%
ggplot(aes(x=cholLog))+
geom_histogram(binwidth = .1) +
facet_grid(Gender~.) +
xlab("Direct cholesterol (log2)")
samp%>%
ggplot(aes(x=Gender,y=cholLog)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter")
We summarize the data
samp %>%
group_by(Gender) %>%
summarize_at("cholLog",
list(mean=~mean(.,na.rm=TRUE),
sd=~sd(.,na.rm=TRUE),
n=function(x) x%>%is.na%>%`!`%>%sum)) %>%
mutate(se=sd/sqrt(n))
# A tibble: 2 x 5
Gender mean sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 female 0.563 0.215 10 0.0679
2 male 0.402 0.630 10 0.199
Note that the sample mean is different from that of the large experiment (“population”) we sampled from.
We test for the difference between Males and females
t.test(cholLog~Gender,samp,var.equal=TRUE)
Two Sample t-test
data: cholLog by Gender
t = 0.76309, df = 18, p-value = 0.4553
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2817359 0.6031382
sample estimates:
mean in group female mean in group male
0.5627670 0.4020659
If we do the experiment again we select other people and we obtain different results.
fem<-nhanesSub%>%filter(Gender=="female")%>%sample_n(size=10)
mal<-nhanesSub%>%filter(Gender=="male")%>%sample_n(size=10)
samp2<-rbind(fem,mal)
samp2%>%
ggplot(aes(x=DirectChol%>%log))+
geom_histogram(binwidth = .1) +
facet_grid(Gender~.) +
xlab("Direct cholesterol (log)")
samp2%>%
ggplot(aes(x=Gender,y=cholLog)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter")
samp2 %>%
group_by(Gender) %>%
summarize_at("cholLog",
list(mean=~mean(.,na.rm=TRUE),
sd=~sd(.,na.rm=TRUE),
n=function(x) x%>%is.na%>%`!`%>%sum)) %>%
mutate(se=sd/sqrt(n))
# A tibble: 2 x 5
Gender mean sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 female 0.457 0.584 10 0.185
2 male 0.359 0.502 10 0.159
t.test(cholLog~Gender,samp2,var.equal=TRUE)
Two Sample t-test
data: cholLog by Gender
t = 0.40094, df = 18, p-value = 0.6932
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4139006 0.6091374
sample estimates:
mean in group female mean in group male
0.4566768 0.3590584
set.seed(12857)
fem<-nhanesSub%>%filter(Gender=="female")%>%sample_n(size=10)
mal<-nhanesSub%>%filter(Gender=="male")%>%sample_n(size=10)
samp3<-rbind(fem,mal)
samp3%>%
ggplot(aes(x=DirectChol%>%log))+
geom_histogram(binwidth = .1) +
facet_grid(Gender~.) +
xlab("Direct cholesterol (log)")
samp3%>%
ggplot(aes(x=Gender,y=cholLog)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter")
samp3 %>%
group_by(Gender) %>%
summarize_at("cholLog",
list(mean=~mean(.,na.rm=TRUE),
sd=~sd(.,na.rm=TRUE),
n=function(x) x%>%is.na%>%`!`%>%sum)) %>%
mutate(se=sd/sqrt(n))
# A tibble: 2 x 5
Gender mean sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 female 0.259 0.328 10 0.104
2 male 0.638 0.365 10 0.115
t.test(cholLog~Gender,samp3,var.equal=TRUE)
Two Sample t-test
data: cholLog by Gender
t = -2.4449, df = 18, p-value = 0.02501
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7049891 -0.0533427
sample estimates:
mean in group female mean in group male
0.2585913 0.6377572
Note, that in the last sample the log cholesterol levels are on average lower for females than for males; based on this sample we even would wrongly conclude that the cholesterol levels for females are on average larger than those of males.
This implies that our conclusions are also subjected to uncertainty and might change from sample to sample.
This is illustrated with the code below, where we will draw 20000 repeated samples with sample size 10 for females and males from the NHanes study.
nsim<-20000
nSamp<-10
res<-matrix(0,nrow=nsim,ncol=2)
fem<-nhanesSub%>%filter(Gender=="female")
mal<-nhanesSub%>%filter(Gender=="male")
for (i in 1:nsim)
{
femSamp<-sample(fem$cholLog,nSamp)
malSamp<-sample(mal$cholLog,nSamp)
meanFem<-mean(femSamp)
meanMal<-mean(malSamp)
delta<-meanFem-meanMal
sdFem<-sd(femSamp)
sdMal<-sd(malSamp)
seFem<-sdFem/sqrt(nSamp)
seFem<-sdFem/sqrt(nSamp)
sdPool<-sqrt((sdFem^2*(nSamp-1) + sdMal^2*(nSamp-1))/(2*nSamp-2))
tvalue<-(delta)/(sdPool*sqrt(1/nSamp+1/nSamp))
pvalue<-pt(abs(tvalue),lower.tail = FALSE,df=2*nSamp-2)*2
res[i,]<-c(delta,pvalue)
}
sum(res[,2]<0.05&res[,1]>0)
[1] 7783
sum(res[,2]>0.05)
[1] 12214
sum(res[,2]<0.05&res[,1]<0)
[1] 3
res<-res %>% as.data.frame
names(res) <- c("delta","pvalue")
res %>%
ggplot(aes(x=delta,y=-log10(pvalue),color=pvalue<0.05)) +
geom_point() +
xlab("Average cholesterol difference") +
ylab("- log10(pvalue)") +
scale_color_manual(values=c("black","red"))
res %>%
ggplot(aes(y=delta)) +
geom_boxplot() +
geom_point(aes(x=0,y=c(mean(fem$cholLog)-mean(mal$cholLog)),color="pop. diff")) +
xlab("")
Only in 3 out of 20000 samples we conclude that the mean cholesterol level of males is significantly lower than for females. For the remaining samples the cholesterol levels for males were on average significantly lower than for females (7783 samples) or the average difference in cholesterol levels were not statistically significant (12214 samples). The latter is because the power is rather low to detect the difference with 10 samples in each group.
nfip<-data.frame(group=c("cases","control","noConcent"),grade=c("g2","g1g3","g2"),vaccin=c("yes","no","no"),total=c(221998,725173,123605),polio=c(54,391,56))
nfip$noPolio<-nfip$total-nfip$polio
knitr::kable(nfip)
group | grade | vaccin | total | polio | noPolio |
---|---|---|---|---|---|
cases | g2 | yes | 221998 | 54 | 221944 |
control | g1g3 | no | 725173 | 391 | 724782 |
noConcent | g2 | no | 123605 | 56 | 123549 |
Compare polio incidence?
What can we conclude?
We observe a lower polio (P) incidence for children for who no consent was given than for the children in the control group.
Consent for vaccination (V) was associated with the socio-economic status (S).
Children of lower socio-economic status were more resistant to the disease.
A new study was conducted: Randomized double blind study
salk<-data.frame(group=c("cases","control","noConcent"),treatment=c("vaccine","placebo","none"),total=c(200745,
201229, 338778),polio=c(57,142,157))
salk$noPolio<-salk$total-salk$polio
salk$incidencePM<-round(salk$polio/salk$total*1e6,0)
knitr::kable(salk)
group | treatment | total | polio | noPolio | incidencePM |
---|---|---|---|---|---|
cases | vaccine | 200745 | 57 | 200688 | 284 |
control | placebo | 201229 | 142 | 201087 | 706 |
noConcent | none | 338778 | 157 | 338621 | 463 |
We observe a much larger effect now that the cases and the controls are comparable, incidence of 284 and 706 per million, respectively.
The polio incidence for children with no consent remains similar, and 463 per million in the NFIP and Salk study, respectively.
Nature: biological process which we study
Deduction: deduce logical consequences of the theory/hypothesis that can be experimentally validated.
Experiment: collect data on the process. The data are a manifestation of the real process. The experiment has to be representative and reproducible and should challenge the theory. Experimental Design!
Statistical inference: Bridge to confront the model/hypothesis to the data \(\rightarrow\) Cornerstone of the scientific method.
Falsification principle: Data cannot be used to prove a model/hypothesis, only to reject it.
Data-exploration and analysis is typically used to refine the theory and to generate new hypotheses.
\(\rightarrow\) Good experimental design is crucial!
We also observed that there is variability in the population and because we can only sample a small part of the population our results and conclusions are subjected to uncertainty.
Therefore, statistics plays an important role in almost all sciences (e.g. column “points of significance” in Nature Methods. http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html)