2  Exploratory Data Analysis

2.1 R packages needed for this chapter

Make sure to run the following code chunks in order to start the neccessary R packages.

library(easystats)  # make stats easy again
library(tidyverse)  # data wrangling
library(tableone)  # tables, optional
library(rio)  # import/export data, eg., to excel
library(ggpubr)  # simple data visualization
library(ggstatsplot)  # data visualization ornamented with statistics
Caution

You can only start a package if you have it installed upfront.

Check out the course Statistics1, chapter on introducing R for an accessible introduction to getting started with R and RStudio and on how to install R packages.

Exercise 2.1 (Install the missing packages) Take a minute to install any of the packages above which are not yet installed on your machine. Hint: Click on the pane Packages and then hit the button Install. \(\square\)

2.2 What’s EDA?

Exploratory Data Analysis (EDA) is a procedure to scrutinize a dataset at hand in order learn about it. EDA comprises descriptive statistics, data visualization and data transformation techniques (such as dimension reduction).

It’s not so mathematical deep as modelling, but in practice it’s really important.

There’s this famous saying:

In Data Science, 80% of time spent prepare data, 20% of time spent complain about the need to prepare data.

EDA can roughly be said to comprise the following parts:

  • Importing (and exporting) data
  • Data cleansing (such as deal with missing data etc)
  • Data transformation or “wrangling” (such as long to wide format)
  • Computing desriptive statistics (such as the notorious mean)
  • Analyzing distributions (is it normal?)
  • Finding patterns in data (aka data mining)
  • More complex data transformation techniques (such as factor analysis)

2.3 Data journey

Wickham & Grolemund (2016) present a visual sketch of what could be called the “data journey”, i.e., the steps we are taking in order to learn from data, seen from an hands-on angle, see Figure 2.1.

Figure 2.1: The data journey

2.4 Blitz data

See Section 1.6 for some data sets suitable to get going.

2.5 Data cleansing

The R package {janitor} provides some nice stuff for data cleansing. Check out this case study.

2.6 Convenience functions

There a quite a few functions (residing in some packages) that help you doing EDA from a helicoptor point of view. In other words, you do not have to pay attention to nitty-gritty details, the function will do that for you. This is approach is, well, convenient, but of course comes at a price. You will not have a great amount of choice and influence on the way the data is analyzed and presented.

We’ll use the penguins data set for a demonstration.

penguins <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")

2.6.1 Data Explorer

There are many systems and approaches to explore data. One particular interesting system is the R-package DataExplorer.

R-package DataExplorer

Check it out on its Githup page.

library(DataExplorer)

For example, we can easily get an overview of the distribution of all (quantitative) variables in the data set. What do you notice on inspecting the distributions, see Figure 2.2?

plot_histogram(penguins)
Figure 2.2: An overview of the distribution can easily attained using the function plot_histogram

A quick “introduction” to the data set is provided by the function plot_intro, see Figure 2.3.

plot_intro(penguins)
Figure 2.3: An introduction view on a data set

We can also get a glimpse on all the qualitative data columns in our datra set, see Figure 2.4.

plot_bar(penguins)
Figure 2.4: Bar plots visualizing the distribution of qualitative variables using the function plot_bar

2.6.2 visdat

The r package visdat provides a “fingerprint” of a data set.

Let’s show-case it using the penguins data set.

library(visdat)  # must be installed
vis_dat(penguins)

A fingerprint of the penguins data set; for each variable the data type and the number of missings is shown.

2.6.3 vtree

A bit similar to {DataExplorer}, the R package {vtree} helps to explore visually datasets, see Figure 2.5.

library(vtree)
vtree(penguins, c("sex", "island"))
Figure 2.5: Breaking down the counts in a data set using vtree

2.6.4 TableOne

The R package {tableOne} provides something like the typical “Table 1” in many papers.

From the homepage:

The tableone package is an R package that eases the construction of “Table 1”, i.e., patient baseline characteristics table commonly found in biomedical research papers. The packages can summarize both continuous and categorical variables mixed within one table. Categorical variables can be summarized as counts and/or percentages. Continuous variables can be summarized in the “normal” way (means and standard deviations) or “nonnormal” way (medians and interquartile ranges).

penguins <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")

It’s quite simple to use:

library(tableone)
CreateTableOne(data = penguins)
                               
                                Overall         
  n                                 344         
  rownames (mean (SD))           172.50 (99.45) 
  species (%)                                   
     Adelie                         152 (44.2)  
     Chinstrap                       68 (19.8)  
     Gentoo                         124 (36.0)  
  island (%)                                    
     Biscoe                         168 (48.8)  
     Dream                          124 (36.0)  
     Torgersen                       52 (15.1)  
  bill_length_mm (mean (SD))      43.92 (5.46)  
  bill_depth_mm (mean (SD))       17.15 (1.97)  
  flipper_length_mm (mean (SD))  200.92 (14.06) 
  body_mass_g (mean (SD))       4201.75 (801.95)
  sex (%)                                       
                                     11 ( 3.2)  
     female                         165 (48.0)  
     male                           168 (48.8)  
  year (mean (SD))              2008.03 (0.82)  

To get more detailled results, use the summary method:

tab1 <- CreateTableOne(data = penguins)
summary(tab1)

     ### Summary of continuous variables ###

strata: Overall
                    n miss p.miss mean    sd median  p25  p75  min  max  skew
rownames          344    0    0.0  172  99.4    172   87  258    1  344  0.00
bill_length_mm    344    2    0.6   44   5.5     44   39   48   32   60  0.05
bill_depth_mm     344    2    0.6   17   2.0     17   16   19   13   22 -0.14
flipper_length_mm 344    2    0.6  201  14.1    197  190  213  172  231  0.35
body_mass_g       344    2    0.6 4202 802.0   4050 3550 4750 2700 6300  0.47
year              344    0    0.0 2008   0.8   2008 2007 2009 2007 2009 -0.05
                  kurt
rownames          -1.2
bill_length_mm    -0.9
bill_depth_mm     -0.9
flipper_length_mm -1.0
body_mass_g       -0.7
year              -1.5

=======================================================================================

     ### Summary of categorical variables ### 

strata: Overall
     var   n miss p.miss     level freq percent cum.percent
 species 344    0    0.0    Adelie  152    44.2        44.2
                         Chinstrap   68    19.8        64.0
                            Gentoo  124    36.0       100.0
                                                           
  island 344    0    0.0    Biscoe  168    48.8        48.8
                             Dream  124    36.0        84.9
                         Torgersen   52    15.1       100.0
                                                           
     sex 344    0    0.0             11     3.2         3.2
                            female  165    48.0        51.2
                              male  168    48.8       100.0
                                                           

How to export to MS Office? Well, one simple approach is obviously to copy-paste. Checkout Section 2.9 for more advanced options.

2.6.5 The easystats way

There are some packages, such as {easystats}, which provide comfortable access to basic statistics:

library(easystats)  # once per session
describe_distribution(penguins)
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
rownames 172.50000 99.4484791 172.500 1.0 344.0 0.0000000 -1.2000000 344 0
bill_length_mm 43.92193 5.4595837 9.300 32.1 59.6 0.0531181 -0.8760270 342 2
bill_depth_mm 17.15117 1.9747932 3.125 13.1 21.5 -0.1434646 -0.9068661 342 2
flipper_length_mm 200.91520 14.0617137 23.250 172.0 231.0 0.3456818 -0.9842729 342 2
body_mass_g 4201.75439 801.9545357 1206.250 2700.0 6300.0 0.4703293 -0.7192219 342 2
year 2008.02907 0.8183559 2.000 2007.0 2009.0 -0.0537278 -1.5049366 344 0

describe_distribution provides us with an overview on typical descriptive summaries.

How to export to MS Office? Well, one simple approach is obviously to copy-paste. Checkout Section 2.9 for more advanced options.

For nominal variables, consider data_tabulate:

data_tabulate(penguins, select = c("sex", "island")) |> 
  print_md() 
Frequency Table
Variable Value N Raw % Valid % Cumulative %
sex (NA) 11 3.20 3.20 3.20
female 165 47.97 47.97 51.16
male 168 48.84 48.84 100.00
(NA) 0 0.00 (NA) (NA)
island Biscoe 168 48.84 48.84 48.84
Dream 124 36.05 36.05 84.88
Torgersen 52 15.12 15.12 100.00
(NA) 0 0.00 (NA) (NA)

Note that print_md helps to get a more visually pleasing (HTML) output, and not only raw command line style output.

We can also get grouped tabulations, which amounts to something similar to a contingency table:

penguins %>% 
  group_by(sex) %>% 
  data_tabulate(select = "island", collapse = TRUE)
# Frequency Table

Variable |        Group |     Value |  N | Raw % | Valid % | Cumulative %
---------+--------------+-----------+----+-------+---------+-------------
island   |       sex () |    Biscoe |  5 | 45.45 |   45.45 |        45.45
         |              |     Dream |  1 |  9.09 |    9.09 |        54.55
         |              | Torgersen |  5 | 45.45 |   45.45 |       100.00
         |              |      <NA> |  0 |  0.00 |    <NA> |         <NA>
---------+--------------+-----------+----+-------+---------+-------------
island   | sex (female) |    Biscoe | 80 | 48.48 |   48.48 |        48.48
         |              |     Dream | 61 | 36.97 |   36.97 |        85.45
         |              | Torgersen | 24 | 14.55 |   14.55 |       100.00
         |              |      <NA> |  0 |  0.00 |    <NA> |         <NA>
---------+--------------+-----------+----+-------+---------+-------------
island   |   sex (male) |    Biscoe | 83 | 49.40 |   49.40 |        49.40
         |              |     Dream | 62 | 36.90 |   36.90 |        86.31
         |              | Torgersen | 23 | 13.69 |   13.69 |       100.00
         |              |      <NA> |  0 |  0.00 |    <NA> |         <NA>
-------------------------------------------------------------------------

Checkout the function reference of your favorite package in order to learn what’s on the shelf. For example, here’s the function reference site of datawizard, one of the packages in the easystats ecosystem.

2.6.6 The easystats way

There are some packages, such as {easystats}, which provide comfortable access to basic statistics:

library(easystats)  # once per session
describe_distribution(penguins)
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
rownames 172.50000 99.4484791 172.500 1.0 344.0 0.0000000 -1.2000000 344 0
bill_length_mm 43.92193 5.4595837 9.300 32.1 59.6 0.0531181 -0.8760270 342 2
bill_depth_mm 17.15117 1.9747932 3.125 13.1 21.5 -0.1434646 -0.9068661 342 2
flipper_length_mm 200.91520 14.0617137 23.250 172.0 231.0 0.3456818 -0.9842729 342 2
body_mass_g 4201.75439 801.9545357 1206.250 2700.0 6300.0 0.4703293 -0.7192219 342 2
year 2008.02907 0.8183559 2.000 2007.0 2009.0 -0.0537278 -1.5049366 344 0

describe_distribution provides us with an overview on typical descriptive summaries.

How to export to MS Office? Well, one simple approach is obviously to copy-paste. Checkout Section 2.9 for more advanced options.

For nominal variables, consider data_tabulate:

data_tabulate(penguins, select = c("sex", "island")) |> 
  print_md() 
Frequency Table
Variable Value N Raw % Valid % Cumulative %
sex (NA) 11 3.20 3.20 3.20
female 165 47.97 47.97 51.16
male 168 48.84 48.84 100.00
(NA) 0 0.00 (NA) (NA)
island Biscoe 168 48.84 48.84 48.84
Dream 124 36.05 36.05 84.88
Torgersen 52 15.12 15.12 100.00
(NA) 0 0.00 (NA) (NA)

Note that print_md helps to get a more visually pleasing (HTML) output, and not only raw command line style output.

We can also get grouped tabulations, which amounts to something similar to a contingency table:

penguins %>% 
  group_by(sex) %>% 
  data_tabulate(select = "island", collapse = TRUE)
# Frequency Table

Variable |        Group |     Value |  N | Raw % | Valid % | Cumulative %
---------+--------------+-----------+----+-------+---------+-------------
island   |       sex () |    Biscoe |  5 | 45.45 |   45.45 |        45.45
         |              |     Dream |  1 |  9.09 |    9.09 |        54.55
         |              | Torgersen |  5 | 45.45 |   45.45 |       100.00
         |              |      <NA> |  0 |  0.00 |    <NA> |         <NA>
---------+--------------+-----------+----+-------+---------+-------------
island   | sex (female) |    Biscoe | 80 | 48.48 |   48.48 |        48.48
         |              |     Dream | 61 | 36.97 |   36.97 |        85.45
         |              | Torgersen | 24 | 14.55 |   14.55 |       100.00
         |              |      <NA> |  0 |  0.00 |    <NA> |         <NA>
---------+--------------+-----------+----+-------+---------+-------------
island   |   sex (male) |    Biscoe | 83 | 49.40 |   49.40 |        49.40
         |              |     Dream | 62 | 36.90 |   36.90 |        86.31
         |              | Torgersen | 23 | 13.69 |   13.69 |       100.00
         |              |      <NA> |  0 |  0.00 |    <NA> |         <NA>
-------------------------------------------------------------------------

Checkout the function reference of your favorite package in order to learn what’s on the shelf. For example, here’s the function reference site of datawizard, one of the packages in the easystats ecosystem.

2.6.7 Lab

Take your research data and prepare it using (at least) one of the “convenience” functions for data cleansing presented above.

2.7 Tidyverse

Entering the tidyverse

2.7.1 Intro to the tidyverse

The Tidyverse is probably the R thing with the most publicity. And it’s great. It’s a philosophy baked into an array of R packages. Perhaps central is the idea that a lot of little Lego pieces, if fitting nicely together, provides a simple yet flexible and thus powerful machinery.

There’s a lot of introductory material to the tidyverse around for instance, so I’m not repeating that here.

2.7.2 More advanced tidyverse

2.7.2.1 Repeat a function over many columns

At times, we would like to compute the same functions for many variables, ie columns for tidyverse applications.

Let’s load the penguins data for illustration.

penguins <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")

head(penguins)
rownames species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA NA 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007

Say, we would like to compute the mean value for each numeric variable in the data set:

penguins %>% 
  summarise(across(bill_length_mm:body_mass_g, mean, na.rm = TRUE))
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
43.92193 17.15117 200.9152 4201.754

Synonymously, we could write:

penguins %>% 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
rownames bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
172.5 43.92193 17.15117 200.9152 4201.754 2008.029

Say, we would like to compute the z-value of each numeric variable.

Addmittedly, easystats makes it quite simple:

penguins %>% 
  standardise(select = is.numeric) %>% 
  head()
rownames species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
-1.724511 Adelie Torgersen -0.8832047 0.7843001 -1.4162715 -0.5633167 male -1.257484
-1.714456 Adelie Torgersen -0.8099390 0.1260033 -1.0606961 -0.5009690 female -1.257484
-1.704400 Adelie Torgersen -0.6634077 0.4298326 -0.4206603 -1.1867934 female -1.257484
-1.694345 Adelie Torgersen NA NA NA NA NA -1.257484
-1.684289 Adelie Torgersen -1.3227986 1.0881294 -0.5628905 -0.9374027 female -1.257484
-1.674234 Adelie Torgersen -0.8465718 1.7464261 -0.7762357 -0.6880121 male -1.257484

See the help page of standardise for mor details on how to select variables and on more options.

But for the purpose of illustration, let’s do it with more simple means, i.e. tidyverse only.

penguins %>% 
  transmute(across(bill_length_mm:body_mass_g, 
                .fns = ~ {(.x - mean(.x, na.rm = TRUE)) / sd(.x, na.rm = TRUE)},
                .names = "{.col}_z")) %>% 
  head()
bill_length_mm_z bill_depth_mm_z flipper_length_mm_z body_mass_g_z
-0.8832047 0.7843001 -1.4162715 -0.5633167
-0.8099390 0.1260033 -1.0606961 -0.5009690
-0.6634077 0.4298326 -0.4206603 -1.1867934
NA NA NA NA
-1.3227986 1.0881294 -0.5628905 -0.9374027
-0.8465718 1.7464261 -0.7762357 -0.6880121

It’s maybe more succint to put the z-value computation in its function, and then just apply this function:

z_stand <- function(x){
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
d2 <-
penguins %>% 
  mutate(across(bill_length_mm:body_mass_g, 
                .fns = z_stand))
  
d2 %>% 
  glimpse()
Rows: 344
Columns: 9
$ rownames          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986, …
$ bill_depth_mm     <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936, …
$ flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905, …
$ body_mass_g       <dbl> -0.563316704, -0.500969030, -1.186793445, NA, -0.937…
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

2.7.2.2 Rowwise operations

For technical reasons, it’s a bit cumbersome in (base) R to compute rowwise operations. The thing is, R’s dataframes are organized as vectors of columns so it’s much easier to do stuff columnwise.

However, since recently, computing rowwise operations with the tidyverse has become simpler. Consider the following example. Say we would like to know the highest z-value for each variable we just computed, that is the highest values per individual, ie., by row in the data frame.

d2 %>% 
  drop_na() %>% 
  rowwise() %>% 
  mutate(max_z = max(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))) %>% 
  head()
rownames species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year max_z
1 Adelie Torgersen -0.8832047 0.7843001 -1.4162715 -0.5633167 male 2007 0.7843001
2 Adelie Torgersen -0.8099390 0.1260033 -1.0606961 -0.5009690 female 2007 0.1260033
3 Adelie Torgersen -0.6634077 0.4298326 -0.4206603 -1.1867934 female 2007 0.4298326
5 Adelie Torgersen -1.3227986 1.0881294 -0.5628905 -0.9374027 female 2007 1.0881294
6 Adelie Torgersen -0.8465718 1.7464261 -0.7762357 -0.6880121 male 2007 1.7464261
7 Adelie Torgersen -0.9198375 0.3285561 -1.4162715 -0.7191859 female 2007 0.3285561

2.7.3 Lab

Take your research data and play around using the tidyverse functions shown above.

2.8 Data visualization

Visualize your data – plots can provide insights which can’t (easily) delivered by numbers alone. See Anscombe’s Quartet.

The star on the R visualization sky is called ggplot. It probably is one of the most advanced statistical visualization package on the planet.

However, for a quick start, there are some nice wrappers on ggplot, levering the beauty of ggplot but making the making simple(r).

2.8.1 ggpubr

ggpubr is a simple yet powerful way to visualize data.

library(ggpubr)  # must be installed

Let’s visualize the association of bill length (see Figure 2.6) with body mass, see Figure 2.7.

Figure 2.6: In case you did not know, that’s what is meant by bill length and depth of our penguins
ggscatter(penguins,
          x = "bill_length_mm",
          y = "body_mass_g",
          color = "sex")
Figure 2.7: A simple scatterplot of the penguins data set using ggscatter

Let’s improve a bit on the plot, see Figure 2.8.

ggscatter(penguins,
          x = "bill_length_mm",
          y = "body_mass_g",
          color = "species",
          shape = "species",
          add = "reg.line",
          xlab = "Bill length (mm)",
          ylab = "Body maxx (g)",
          conf.int = TRUE,
          ellipse = TRUE,
          cor.coef = TRUE)
Figure 2.8: A more polished scatterplot of the penguins data set using ggscatter

Check out the manual of the function for further notice on the options.

2.8.2 ggstatsplot

In addtion, consider ggstatsplot)(https://indrajeetpatil.github.io/ggstatsplot/). Let’s use the mtcars dataset for a quick demonstration. For example, assume you would like to compare two group of cars, automatic vs. manual shifting, in terms of fuel economy. Here we go, Figure 2.9.

library(ggstatsplot)
ggbetweenstats(
  data = penguins,
  x = sex,
  y = body_mass_g)
Figure 2.9: ggstatsplot builds on ggplot, but is simpler to use

Checkout the manual of the function to get a handle on all the options and details.

Wait, let’s drop all missing values (NA) before we plot using drop_na(), Figure 2.10.

penguins_no_na <-
penguins |> 
  drop_na()  # from package "tidyverse"

ggbetweenstats(
  data = penguins_no_na,
  x = sex,
  y = body_mass_g)
Figure 2.10: Comparing groups on a quantitative output variable, but dropping missing values before hand

2.8.3 Lab

Take your research data and play around using the visualization functions shown above.

2.9 Exporting to Office

2.9.1 Word

2.9.1.1 Basic approach

It’s no big deal to get it into Word processors as well. Here’s one simple way:

  1. Open a Quarto-Document in RStudio
  2. produce some nice table
  3. copy and paste it to Word

Here’s an example of producing a nice table:

library(easystats)  # once per session only
describe_distribution(penguins) |> print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
rownames 172.50 99.45 172.50 (1.00, 344.00) 0.00 -1.20 344 0
bill_length_mm 43.92 5.46 9.30 (32.10, 59.60) 0.05 -0.88 342 2
bill_depth_mm 17.15 1.97 3.12 (13.10, 21.50) -0.14 -0.91 342 2
flipper_length_mm 200.92 14.06 23.25 (172.00, 231.00) 0.35 -0.98 342 2
body_mass_g 4201.75 801.95 1206.25 (2700.00, 6300.00) 0.47 -0.72 342 2
year 2008.03 0.82 2.00 (2007.00, 2009.00) -0.05 -1.50 344 0

Note that print_md() will “prettify” your output. Without this line, your output will look more command-line-nerdy style.

2.9.1.2 Additional ways

In addition, one user gave the following recommendation on StackOverflow:

Another possible solution: The above strategy did not work for me when I had a similar issue, but it was resolved once I knitted the table1 object and opened the html in browser to copy the html table and successfully paste into word. Doing it within RStudio viewer would not work for me for some reason.

Lastly, there are options to export directly to Word or Powerpoint. The R package {flextable} provides functions for that purpose, see Table 2.11:

library(flextable)
my_flex_tab <- flextable(head(penguins))
my_flex_tab
Table 2.1: A table using the R package flextable

rownames

species

island

bill_length_mm

bill_depth_mm

flipper_length_mm

body_mass_g

sex

year

1

Adelie

Torgersen

39.1

18.7

181

3,750

male

2,007

2

Adelie

Torgersen

39.5

17.4

186

3,800

female

2,007

3

Adelie

Torgersen

40.3

18.0

195

3,250

female

2,007

4

Adelie

Torgersen

2,007

5

Adelie

Torgersen

36.7

19.3

193

3,450

female

2,007

6

Adelie

Torgersen

39.3

20.6

190

3,650

male

2,007

Just save it to the Office format of your choice:

save_as_docx("Table 1 " = my_flex_tab, path = "my_tab.docx")
save_as_pptx("Table 1 " = my_flex_tab, path = "my_tab.pptx")

2.9.2 Excel

2.9.2.1 Just save your data to disk

The most straightforward approach is to convince your EDA function to produce a data frame. Data frames can be written as CSV or XLSX to disk, and then easily imported to office packages.

easystats and tidyverse are two examples where this happens.

library(easystats)   # must be installed
df1 <- describe_distribution(mtcars)

Now, df1 is a data frame:

str(df1)
Classes 'parameters_distribution', 'see_parameters_distribution' and 'data.frame':  11 obs. of  10 variables:
 $ Variable : chr  "mpg" "cyl" "disp" "hp" ...
 $ Mean     : num  20.09 6.19 230.72 146.69 3.6 ...
 $ SD       : num  6.027 1.786 123.939 68.563 0.535 ...
 $ IQR      : num  7.53 4 221.53 84.5 0.84 ...
 $ Min      : num  10.4 4 71.1 52 2.76 ...
 $ Max      : num  33.9 8 472 335 4.93 ...
 $ Skewness : num  0.672 -0.192 0.42 0.799 0.293 ...
 $ Kurtosis : num  -0.022 -1.763 -1.068 0.275 -0.45 ...
 $ n        : int  32 32 32 32 32 32 32 32 32 32 ...
 $ n_Missing: int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "object_name")= chr "mtcars"
 - attr(*, "threshold")= num 0.1

Let’s export as XLSX (via the R package rio) and as CSV:

library(rio)  # must be installed
export(df1, file = "df1.xlsx")
export(df1, file = "df1.csv")

For exporting csv files we could alternatively use write_csv() from the tidyverse or write.csv from base R. Then, checkout the file in Excel. Note that it is possible to import CSV files to Excel.

df2 <- data_tabulate(mtcars$am)
str(df2)
Classes 'dw_data_tabulate' and 'data.frame':    3 obs. of  6 variables:
 $ Variable    : chr  "mtcars$am" "mtcars$am" "mtcars$am"
 $ Value       : Factor w/ 2 levels "0","1": 1 2 NA
 $ N           : int  19 13 0
 $ Raw %       : num  59.4 40.6 0
 $ Valid %     : num  59.4 40.6 NA
 $ Cumulative %: num  59.4 100 NA
 - attr(*, "type")= chr "numeric"
 - attr(*, "object")= chr "mtcars$am"
 - attr(*, "duplicate_varnames")= logi [1:3] FALSE TRUE TRUE
 - attr(*, "total_n")= int 32
 - attr(*, "valid_n")= int 32
export(df2, file = "df2.csv")

2.9.2.2 More advanced

Note that if we use data_tabulate like this:

df3 <- data_tabulate(mtcars, select = c("am", "vs"))

df3 %>% str()
List of 2
 $ :Classes 'dw_data_tabulate' and 'data.frame':    3 obs. of  6 variables:
  ..$ Variable    : chr [1:3] "am" "am" "am"
  ..$ Value       : Factor w/ 2 levels "0","1": 1 2 NA
  ..$ N           : int [1:3] 19 13 0
  ..$ Raw %       : num [1:3] 59.4 40.6 0
  ..$ Valid %     : num [1:3] 59.4 40.6 NA
  ..$ Cumulative %: num [1:3] 59.4 100 NA
  ..- attr(*, "type")= chr "numeric"
  ..- attr(*, "varname")= chr "am"
  ..- attr(*, "object")= chr "am"
  ..- attr(*, "duplicate_varnames")= logi [1:3] FALSE TRUE TRUE
  ..- attr(*, "total_n")= int 32
  ..- attr(*, "valid_n")= int 32
 $ :Classes 'dw_data_tabulate' and 'data.frame':    3 obs. of  6 variables:
  ..$ Variable    : chr [1:3] "vs" "vs" "vs"
  ..$ Value       : Factor w/ 2 levels "0","1": 1 2 NA
  ..$ N           : int [1:3] 18 14 0
  ..$ Raw %       : num [1:3] 56.2 43.8 0
  ..$ Valid %     : num [1:3] 56.2 43.8 NA
  ..$ Cumulative %: num [1:3] 56.2 100 NA
  ..- attr(*, "type")= chr "numeric"
  ..- attr(*, "varname")= chr "vs"
  ..- attr(*, "object")= chr "vs"
  ..- attr(*, "duplicate_varnames")= logi [1:3] FALSE TRUE TRUE
  ..- attr(*, "total_n")= int 32
  ..- attr(*, "valid_n")= int 32
 - attr(*, "class")= chr [1:2] "dw_data_tabulates" "list"
 - attr(*, "collapse")= logi FALSE

We’ll get a list of two data frames.

To export either, we need to access each list list element:

export(df3[[1]], file = "df3.csv")

2.9.3 Lab

Take your research data, produce some nice tables and figures and import them into a word file.

2.10 Case Studies

2.10.1 Data Wrangling

2.10.1.1 Penguins

R package/dataset palmerpenguins

Explore the palmerpenguins dataset, it’s a famous dataset made for learning data analysis.

There’s a great interactive course on EDA based on the penguins. Have a look, it’s great!

Go penguins! Allez!

2.10.2 More Case studies

2.11 Exercises

🧑‍🎓 I want more!

👨‍🏫 Check-out all exercises from Datenwerk with the tag datawrangling.

Tip

Use the translation function of your browers to translate the webpage into your favorite language.

2.12 Cheatsheets

There are a number of nice cheat sheets available on an array of topics related to EDA, made available by the folks at RStudio.

Consider this collection:

So much great stuff! A bit too much to digest in one go, but definitely worthwhile if you plan to dig deeper in data analysis.

2.13 Going further

Wickham & Grolemund (2016) is an highly recommendable resource in order to get a thorough understanding of data analysis using R. Note that this source is focusing on the “how to”, not so much to theoretical foundations. Ismay & Kim (2020) is a gently introduction into many steps on the data journey, including EDA. For a gentle introduction to Data visualization, see ModernDive Chap. 2 (Ismay & Kim, 2020), and get the R code here. Similarly, for data wrangling, see ModernDive Chap. 3, and get the R code here.


  1. Don’t forget to install it before you use it.↩︎