In this tutorial, you will learn how to import, tidy, wrangle and visualize data yourself! You will work with one specific dataset;
The FEV dataset
The forced expiratory volume, FEV, is a measure of how much air a person can exhale (in liter) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age
, their height
, their gender
and, most importantly, whether the child is a smoker or a non-smoker.
The goal of this experiment was to find out whether or not smoking has an effect on the FEV of children.
Note: to analyse this dataset properly, we will need some relatively advanced modelling techniques. At the end of this week, you will have seen all three required steps to analyse such a dataset! For now, we will limit ourselves to exploring the data.
Load the required libraries
Import the data
fev <- read_tsv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/fev.txt")
## Parsed with column specification:
## cols(
## age = col_double(),
## fev = col_double(),
## height = col_double(),
## gender = col_character(),
## smoking = col_double()
## )
## # A tibble: 6 x 5
## age fev height gender smoking
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 9 1.71 57 f 0
## 2 8 1.72 67.5 f 0
## 3 7 1.72 54.5 f 0
## 4 9 1.56 53 m 0
## 5 9 1.90 57 m 0
## 6 8 2.34 61 f 0
There are a few things in the formatting of the data that can be improved upon:
- Both the
gender
and smoking
can be transformed to factors.
- The
height
variable is written in inches. Assuming that this audience is mainly Portuguese/Belgian, inches are hard to interpret. Let’s add a new column, height_cm
, with the values converted to centimeter using the mutate
function.
fev <- fev %>%
mutate(gender = as.factor(gender)) %>%
mutate(smoking = as.factor(smoking)) %>%
mutate(height_cm = height*2.54)
head(fev)
## # A tibble: 6 x 6
## age fev height gender smoking height_cm
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 9 1.71 57 f 0 145.
## 2 8 1.72 67.5 f 0 171.
## 3 7 1.72 54.5 f 0 138.
## 4 9 1.56 53 m 0 135.
## 5 9 1.90 57 m 0 145.
## 6 8 2.34 61 f 0 155.
That’s better!
Data visualization
Now, let’s make a first explorative plot, showing only the FEV for both smoking categories.
Which type of plot do you suggest?
fev %>%
ggplot(aes(x=smoking,y=fev,fill=smoking)) +
scale_fill_manual(values=c("dimgrey","firebrick")) +
theme_bw() +
geom_boxplot(outlier.shape=NA) +
geom_jitter(width = 0.2, size=0.1) +
ggtitle("Boxplot of FEV versus smoking") +
ylab("fev (l)") +
xlab("smoking status")
Did you expect these results?
It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children?
Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor).
fev %>%
ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
geom_boxplot(outlier.shape=NA) +
geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
theme_bw() +
scale_fill_manual(values=c("dimgrey","firebrick")) +
ggtitle("Boxplot of FEV versus smoking, stratified on age") +
ylab("fev (l)") +
xlab("smoking status")
## Warning: Ignoring unknown parameters: width
This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age “category”, it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true.
This shows that taking into account important confounders (in this case) is crucial! If we simply analysed based on the smoking status and FEV values only, we probably would’ve obtained completely incorrect results.
Can provide an even better visualization of the data, taking into account more useful explanatory variables with respect to the FEV?
fev %>%
ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
geom_boxplot(outlier.shape=NA) +
geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
theme_bw() +
scale_fill_manual(values=c("dimgrey","firebrick")) +
ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") +
ylab("fev (l)") +
xlab("smoking status") +
facet_grid(rows = vars(gender))
## Warning: Ignoring unknown parameters: width
This plot holds one extra level of information, the gender of the child. Especially for higher ages, the median FEV is higher for males as compared to females.
The only source of information that is lacking is height
. To look at the effect of height, we could simply make a scatterplot displaying the FEV in function of a child’s height (in cm). Additionally, we could color the dots based on gender.
fev %>%
ggplot(aes(x=height_cm,y=fev,color=gender)) +
geom_point() +
scale_color_manual(values=c("darkorchid","olivedrab4")) +
theme_bw() +
ggtitle("Boxplot of FEV versus height") +
ylab("fev (l)") +
xlab("height (cm)")
There is a clear relationship between height and FEV. In addition, we see that for the large height values (>175cm), we mainly find male subjects.
LS0tCnRpdGxlOiAiVHV0b3JpYWwgNC40OiBFeHBsb3JpbmcgdGhlIEZFViBkYXRhc2V0IiAgIApvdXRwdXQ6CiAgICBodG1sX2RvY3VtZW50OgogICAgICBjb2RlX2Rvd25sb2FkOiB0cnVlICAgIAogICAgICB0aGVtZTogY29zbW8KICAgICAgdG9jOiB0cnVlCiAgICAgIHRvY19mbG9hdDogdHJ1ZQogICAgICBoaWdobGlnaHQ6IHRhbmdvCiAgICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQotLS0KCkluIHRoaXMgdHV0b3JpYWwsIHlvdSB3aWxsIGxlYXJuIGhvdyB0byBpbXBvcnQsIHRpZHksIHdyYW5nbGUgYW5kIAp2aXN1YWxpemUgZGF0YSB5b3Vyc2VsZiEgWW91IHdpbGwgd29yayB3aXRoIG9uZSBzcGVjaWZpYyBkYXRhc2V0OwoKIyBUaGUgRkVWIGRhdGFzZXQKClRoZSBmb3JjZWQgZXhwaXJhdG9yeSB2b2x1bWUsIEZFViwKaXMgYSBtZWFzdXJlIG9mIGhvdyBtdWNoIGFpciBhIHBlcnNvbiBjYW4gZXhoYWxlIChpbiBsaXRlcikgCmR1cmluZyAgYSBmb3JjZWQgYnJlYXRoLiBJbiB0aGlzIGRhdGFzZXQsIHRoZSBGRVYgb2YgNjA2IGNoaWxkcmVuLApiZXR3ZWVuIHRoZSBhZ2VzIG9mIDYgYW5kIDE3LCB3ZXJlIG1lYXN1cmVkLiBUaGUgZGF0YXNldAphbHNvIHByb3ZpZGVzIGFkZGl0aW9uYWwgaW5mb3JtYXRpb24gb24gdGhlc2UgY2hpbGRyZW46CnRoZWlyIGBhZ2VgLCB0aGVpciBgaGVpZ2h0YCwgdGhlaXIgYGdlbmRlcmAgYW5kLCBtb3N0CmltcG9ydGFudGx5LCB3aGV0aGVyIHRoZSBjaGlsZCBpcyBhIHNtb2tlciBvciBhIG5vbi1zbW9rZXIuCgpUaGUgZ29hbCBvZiB0aGlzIGV4cGVyaW1lbnQgd2FzIHRvIGZpbmQgb3V0IHdoZXRoZXIgb3Igbm90CnNtb2tpbmcgaGFzIGFuIGVmZmVjdCBvbiB0aGUgRkVWIG9mIGNoaWxkcmVuLgoKTm90ZTogdG8gYW5hbHlzZSB0aGlzIGRhdGFzZXQgcHJvcGVybHksIHdlIHdpbGwgbmVlZCBzb21lCnJlbGF0aXZlbHkgYWR2YW5jZWQgbW9kZWxsaW5nIHRlY2huaXF1ZXMuIEF0IHRoZSBlbmQgb2YgdGhpcyAKd2VlaywgeW91IHdpbGwgaGF2ZSBzZWVuIGFsbCB0aHJlZSByZXF1aXJlZCBzdGVwcyB0byBhbmFseXNlCnN1Y2ggYSBkYXRhc2V0ISBGb3Igbm93LCB3ZSB3aWxsIGxpbWl0IG91cnNlbHZlcyB0byBleHBsb3JpbmcKdGhlIGRhdGEuCgojIExvYWQgdGhlIHJlcXVpcmVkIGxpYnJhcmllcwoKYGBge3IsIG1lc3NhZ2UgPSBGQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKIyBJbXBvcnQgdGhlIGRhdGEKCmBgYHtyfQpmZXYgPC0gcmVhZF90c3YoImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9HVFBCL1BTTFMyMC9tYXN0ZXIvZGF0YS9mZXYudHh0IikKaGVhZChmZXYpCmBgYAoKVGhlcmUgYXJlIGEgZmV3IHRoaW5ncyBpbiB0aGUgZm9ybWF0dGluZyBvZiB0aGUKZGF0YSB0aGF0IGNhbiBiZSBpbXByb3ZlZCB1cG9uOgoKMS4gQm90aCB0aGUgYGdlbmRlcmAgYW5kIGBzbW9raW5nYCBjYW4gYmUgdHJhbnNmb3JtZWQgdG8KZmFjdG9ycy4KMi4gVGhlIGBoZWlnaHRgIHZhcmlhYmxlIGlzIHdyaXR0ZW4gaW4gaW5jaGVzLiBBc3N1bWluZyB0aGF0CnRoaXMgYXVkaWVuY2UgaXMgbWFpbmx5IFBvcnR1Z3Vlc2UvQmVsZ2lhbiwgaW5jaGVzIGFyZSBoYXJkIHRvCmludGVycHJldC4gTGV0J3MgYWRkIGEgbmV3IGNvbHVtbiwgYGhlaWdodF9jbWAsIHdpdGggdGhlIHZhbHVlcwpjb252ZXJ0ZWQgdG8gY2VudGltZXRlciB1c2luZyB0aGUgYG11dGF0ZWAgZnVuY3Rpb24uIAoKYGBge3J9CmZldiA8LSBmZXYgJT4lCiAgbXV0YXRlKGdlbmRlciA9IGFzLmZhY3RvcihnZW5kZXIpKSAlPiUKICBtdXRhdGUoc21va2luZyA9IGFzLmZhY3RvcihzbW9raW5nKSkgJT4lCiAgbXV0YXRlKGhlaWdodF9jbSA9IGhlaWdodCoyLjU0KQoKaGVhZChmZXYpCmBgYAoKVGhhdCdzIGJldHRlciEKCiMgRGF0YSB2aXN1YWxpemF0aW9uCgpOb3csIGxldCdzIG1ha2UgYSBmaXJzdCBleHBsb3JhdGl2ZSBwbG90LCBzaG93aW5nCm9ubHkgdGhlIEZFViBmb3IgYm90aCBzbW9raW5nIGNhdGVnb3JpZXMuCgpXaGljaCB0eXBlIG9mIHBsb3QgZG8geW91IHN1Z2dlc3Q/CgpgYGB7cn0KZmV2ICU+JQogIGdncGxvdChhZXMoeD1zbW9raW5nLHk9ZmV2LGZpbGw9c21va2luZykpICsKICBzY2FsZV9maWxsX21hbnVhbCh2YWx1ZXM9YygiZGltZ3JleSIsImZpcmVicmljayIpKSArCiAgdGhlbWVfYncoKSArCiAgZ2VvbV9ib3hwbG90KG91dGxpZXIuc2hhcGU9TkEpICsgCiAgZ2VvbV9qaXR0ZXIod2lkdGggPSAwLjIsIHNpemU9MC4xKSArCiAgZ2d0aXRsZSgiQm94cGxvdCBvZiBGRVYgdmVyc3VzIHNtb2tpbmciKSArCiAgeWxhYigiZmV2IChsKSIpICsKICB4bGFiKCJzbW9raW5nIHN0YXR1cyIpCmBgYAoKRGlkIHlvdSBleHBlY3QgdGhlc2UgcmVzdWx0cz8KCkl0IGFwcGVhcnMgdGhhdCBjaGlsZHJlbiB0aGF0IHNtb2tlIGhhdmUgYSBoaWdoZXIKbWVkaWFuIEZFViB0aGFuIGNoaWxkcmVuIHRoYXQgZG8gbm90IHNtb2tlLiAKU2hvdWxkIHdlIGNoYW5nZSBsZWdpc2xhdGlvbnMgd29ybGR3aWRlIGFuZCBtYWtlIApzbW9raW5nIG9ibGlnYXRvcnkgZm9yIGNoaWxkcmVuPwoKTWF5YmUgdGhlcmUgaXMgc29tZXRoaW5nIGVsc2UgZ29pbmcgb24gaW4gdGhlIGRhdGEuCk5vdywgd2Ugd2lsbCBnZW5lcmF0ZSBhIHNpbWlsYXIgcGxvdCwgYnV0IHdlIHdpbGwKc3RyYXRpZnkgdGhlIGRhdGEgYmFzZWQgb24gYWdlIChhZ2UgYXMgZmFjdG9yKS4KCmBgYHtyfQpmZXYgJT4lCiAgZ2dwbG90KGFlcyh4PWFzLmZhY3RvcihhZ2UpLHk9ZmV2LGZpbGw9c21va2luZykpICsKICBnZW9tX2JveHBsb3Qob3V0bGllci5zaGFwZT1OQSkgKwogIGdlb21fcG9pbnQod2lkdGggPSAwLjIsIHNpemUgPSAwLjEsIHBvc2l0aW9uID0gcG9zaXRpb25faml0dGVyZG9kZ2UoKSkgKwogIHRoZW1lX2J3KCkgKwogIHNjYWxlX2ZpbGxfbWFudWFsKHZhbHVlcz1jKCJkaW1ncmV5IiwiZmlyZWJyaWNrIikpICsKICBnZ3RpdGxlKCJCb3hwbG90IG9mIEZFViB2ZXJzdXMgc21va2luZywgc3RyYXRpZmllZCBvbiBhZ2UiKSArCiAgeWxhYigiZmV2IChsKSIpICsKICB4bGFiKCJzbW9raW5nIHN0YXR1cyIpCmBgYAoKVGhpcyBwbG90IHNlZW1zIHRvIGFscmVhZHkgZ2l2ZSB1cyBhIG1vcmUKcGxhdXNpYmxlIHBpY3R1cmUuIEZpcnN0LCBpdCBzZWVtcyB0aGF0IHdlIGRvIG5vdCBoYXZlCmFueSBzbW9raW5nIGNoaWxkcmVuIG9mIGFnZXMgNiwgNyBvciA4LiBTZWNvbmQsIHdoZW4KbG9va2luZyBhdCB0aGUgcmVzdWx0cyBwZXIgYWdlICJjYXRlZ29yeSIsIGl0IHNlZW1zCm5vIGxvbmdlciB0aGUgY2FzZSB0aGF0IHNtb2tlcnMgaGF2ZSBhIG11Y2ggaGlnaGVyIEZFVgp0aGFuIG5vbi1zbW9rZXJzOyBmb3IgdGhlIGhpZ2hlciBhZ2VzLCB0aGUgY29udHJhcnkKc2VlbXMgdHJ1ZS4KClRoaXMgc2hvd3MgdGhhdCB0YWtpbmcgaW50byBhY2NvdW50IGltcG9ydGFudCBjb25mb3VuZGVycwooaW4gdGhpcyBjYXNlKSBpcyBjcnVjaWFsISBJZiB3ZSBzaW1wbHkgYW5hbHlzZWQgYmFzZWQgb24KdGhlIHNtb2tpbmcgc3RhdHVzIGFuZCBGRVYgdmFsdWVzIG9ubHksIHdlIHByb2JhYmx5IHdvdWxkJ3ZlCm9idGFpbmVkIGNvbXBsZXRlbHkgaW5jb3JyZWN0IHJlc3VsdHMuCgpDYW4gcHJvdmlkZSBhbiBldmVuIGJldHRlciB2aXN1YWxpemF0aW9uIG9mIHRoZSBkYXRhLCB0YWtpbmcKaW50byBhY2NvdW50IG1vcmUgdXNlZnVsIGV4cGxhbmF0b3J5IHZhcmlhYmxlcyB3aXRoIHJlc3BlY3QKdG8gdGhlIEZFVj8KCmBgYHtyfQpmZXYgJT4lCiAgZ2dwbG90KGFlcyh4PWFzLmZhY3RvcihhZ2UpLHk9ZmV2LGZpbGw9c21va2luZykpICsKICBnZW9tX2JveHBsb3Qob3V0bGllci5zaGFwZT1OQSkgKwogIGdlb21fcG9pbnQod2lkdGggPSAwLjIsIHNpemUgPSAwLjEsIHBvc2l0aW9uID0gcG9zaXRpb25faml0dGVyZG9kZ2UoKSkgKwogIHRoZW1lX2J3KCkgKwogIHNjYWxlX2ZpbGxfbWFudWFsKHZhbHVlcz1jKCJkaW1ncmV5IiwiZmlyZWJyaWNrIikpICsKICBnZ3RpdGxlKCJCb3hwbG90IG9mIEZFViB2ZXJzdXMgc21va2luZywgc3RyYXRpZmllZCBvbiBhZ2UgYW5kIGdlbmRlciIpICsKICB5bGFiKCJmZXYgKGwpIikgKwogIHhsYWIoInNtb2tpbmcgc3RhdHVzIikgKyAKICBmYWNldF9ncmlkKHJvd3MgPSB2YXJzKGdlbmRlcikpCmBgYAoKVGhpcyBwbG90IGhvbGRzIG9uZSBleHRyYSBsZXZlbCBvZiBpbmZvcm1hdGlvbiwgdGhlIGdlbmRlciAKb2YgdGhlIGNoaWxkLiBFc3BlY2lhbGx5IGZvciBoaWdoZXIgYWdlcywgdGhlIG1lZGlhbiBGRVYKaXMgaGlnaGVyIGZvciBtYWxlcyBhcyBjb21wYXJlZCB0byBmZW1hbGVzLgoKVGhlIG9ubHkgc291cmNlIG9mIGluZm9ybWF0aW9uIHRoYXQgaXMgbGFja2luZyBpcyBgaGVpZ2h0YC4KVG8gbG9vayBhdCB0aGUgZWZmZWN0IG9mIGhlaWdodCwgd2UgY291bGQgc2ltcGx5IG1ha2UgYQpzY2F0dGVycGxvdCBkaXNwbGF5aW5nIHRoZSBGRVYgaW4gZnVuY3Rpb24gb2YgYSBjaGlsZCdzCmhlaWdodCAoaW4gY20pLiBBZGRpdGlvbmFsbHksIHdlIGNvdWxkIGNvbG9yIHRoZSBkb3RzIGJhc2VkCm9uIGdlbmRlci4KCmBgYHtyfQpmZXYgJT4lCiAgZ2dwbG90KGFlcyh4PWhlaWdodF9jbSx5PWZldixjb2xvcj1nZW5kZXIpKSArCiAgZ2VvbV9wb2ludCgpICsKICBzY2FsZV9jb2xvcl9tYW51YWwodmFsdWVzPWMoImRhcmtvcmNoaWQiLCJvbGl2ZWRyYWI0IikpICsKICB0aGVtZV9idygpICsKICBnZ3RpdGxlKCJCb3hwbG90IG9mIEZFViB2ZXJzdXMgaGVpZ2h0IikgKwogIHlsYWIoImZldiAobCkiKSArCiAgeGxhYigiaGVpZ2h0IChjbSkiKQpgYGAKClRoZXJlIGlzIGEgY2xlYXIgcmVsYXRpb25zaGlwIGJldHdlZW4gaGVpZ2h0IGFuZCBGRVYuCkluIGFkZGl0aW9uLCB3ZSBzZWUgdGhhdCBmb3IgdGhlIGxhcmdlIGhlaWdodCB2YWx1ZXMKKD4xNzVjbSksIHdlIG1haW5seSBmaW5kIG1hbGUgc3ViamVjdHMuCgoKLS0tCgojIFtIb21lXShodHRwczovL2d0cGIuZ2l0aHViLmlvL1BTTFMyMC8pIHstfQoKCgo=