stats-nutshell

A crash course on applied statistics with a focus on statistical modelling
Author

Sebastian Sauer

Published

March 11, 2024

Doi

Preface

A nutshell of little (statistics) stars

Welcome!

This is an introductory course on statistical modelling. Welcome!

The focus of this course is on how to specify a theoretical idea (possibly vague) in a testable statistical model.

Please read me

In order to benefit as much as possible from this course, it is necessary for you to read this preface information. Yoda agrees (s. Figure Yoda finds you should read the manual).

Figure 1: Yoda finds you should read the manual

PDF-Version

Use the print button of your browser to print the html page into a PDF page.

Course description

Analyzing research data can broadly be classified in three parts: explorative data analysis, modeling (including inference), and visualization. Either part is pivotal in its own right, but it can be argued that modeling is at the core of the scientific endeavor. However, in practice, modeling, visualization, and data exploration is heavily intertwined, so that three parts may be recognized (as individual entities) but not usefully separated from each other. This idea provides the rationale of this course: Data exploration, data visualization and data modeling is discussed as an integrated framework.

The focus is on practical data analysis; theoretical concepts are, where mentioned, second class citizens due to time constraints and the didactic aims of the course.

For example, statistical inference – such as p-values and confidence intervals – are not more than touched briefly, as the instructor believes that modeling, not inference, is of prime importance for the auditorium.

We will use the R environment for all computations (freely available). Please bring your own Laptop with R and RStudio installed (installation guides are provided). Data and R code will be provided.

We’re on a crash course

The course is set-up as a “crash course” which indicates that we’ll rather try to cover a breadth of steps rather than digging deep at certain particular points. The rationale of this approach is that before digging deep, it is necessary to gain an overview of the territory. In addition, if one particular topic is not of interest to a given student (perhaps to difficult/simple), not much time is lost.

Be warned! Compare this crash course to a dancing crash course right before your wedding: A lot can be achieved by such a course in some instances, or rather, the worst consequences (of not knowing how to dance) may be fenced off, but one should not expect to be a dancing queen (king) thereafter.

More on modelling

Models and modeling are of pivotal importance in many sciences, not only for providing an explanation of nature en miniature (theoretical models), but also for gauging how closely the empirical data at hand match the theoretical model. Translating a theoretical model into statistical language is called statistical modeling and provides the guiding principle in this introductory course. Regression models will be presented as a lingua franca of statistical modeling, and we will learn that many empirical questions can (comfortably) be analyzed using a regression framework. Depending on the background and aims of the participants (and time permitting), we will shed light on some standard topics such as model comparison, classification models, and typical pitfalls. Given a more advanced auditorium, we may want to explore how causal and non-causal associations can be translated and tested using simple linear statistical models. Foundational ideas of statistical modeling will be accompanied by short examples and case studies to facilitate transfer and practical application after the course.

Course prerequisites

Basic computer usage knowledge is needed (downloading materials from the internet, operating a PC, etc). Basic R knowledge is needed. Basic knowledge of statistical concepts (such as descriptive statistics) is needed. Willingness to learn is essential.

Learning objectives

Upon successful completion of this course, students should be able to:

  • select the right statistical visualization for a variety of data contexts
  • “crunch” or “wrangle” data
  • explain what statistical modeling means
  • formulate basic statistical models
  • differentiate between predictive and explanatory modeling
  • apply the methods to own datasets

Course Literature

This course builds on the freely available e-book ModernDive. Each topic is paralleled by an accompanying chapter from ModernDive. A hard copy can be purchased here. The book is for sale in print here.

Course logistics

This course can be presented as a one-day seminar or split-up in two or more blocks.

The course can be held in English or German.

Please bring your own computer and read the notes regarding course logistics in advance. Note that some upfront preparation is needed from the learners.

R and RStudio1 will be needed throughout the course. Please make sure that the IT is running. In case of technical difficulties with R feel free to use RStudio Cloud; free plans are available.

All learning materials (such as literature, code, data) will be provided in electronic format.

UPFRONT student preparation

  • Install R and RStudio, see ModernDive Chap. 1.1. In case you have your R running on your system, please make sure that you’re uptodate. If outdated, download and install the most recent versions of the software. Similarly, hit the “Update” button in RStudio’s “Packages” tab to update your packages if you have not done so for a couple of months.

  • Sign-in at RStudio Cloud. It’s super helpful because I as the techer can provide you with an environment where all R stuff is ready to use (packages installed etc).

  • Install the necessary R packages as used in the book chapters covered in this course (see the sections on “Needed packages” in each chapter). If in doubt, see here the instructions on how to install R packages. Here’s the actual list on the R packages we’ll need.

  • Students new to R are advised to learn the basics, see ModernDive, Chap 1.2 - 1.5

  • Bring your own laptop

  • Make sure your internet connection is stable and your loudspeaker/headset is working; a webcam is helpful.

  • Students are advised to review the course materials after each session.

  • I recommend that you carefully check the course description to make sure the course fits your needs (not too advanced/basic).

Didactic outline

This course can rather be considered a workshop in the sense that the instructor uses a dialogue-based approach to teaching and that there are numerous exercises during the course. Instead of providing long talks to the students, the instructor feels obligated to engage students in back-and-forth conversations. Similarly, the presentation of a large number of Powerpoint slide is avoided. Instead, a thorough course literature is available (free online), so that students will have no barrier in diving deeply into the materials and ideas presented. However, during class it is more important to transmit the pivotal ideas; details need to be read and worked by the students individually after (and before) the course. As an alternative to presenting a lot of text on slides, in this course there will be a (electronic) whiteboard where concepts are developed dynamically and in pace of the teaching conversation thereby adjusting the “dose” of new thoughts to the actual pace of the instruction.

Schedule

Note

Please not that the focus and the amount covered in a course strongly depends on the background, aims and prior knowledge of audience.

Overview on topics covered

  • Data visualization building the grammar of graphics and ggplot2
  • Data wrangling based on the tidyverse in R
  • Basic concepts of statistical modelling
  • Primer on causal inference
  • Introduction to regression analysis
  • Quick refresher of statistical inference

Instructor

Bio

Sebastian Sauer works as a professor at Ansbach university, teaching statistics and related stuff. Analyzing data to answer questions related to social phenomena is one of his major interests. He is trying to help raising the methological (and particularly statistical) skills in the sciences (ie., scientists). The programming language “R” is one of his favorite tools. He sees himself as a learner, and is particularly interested learning more on quantitative approaches to understand nature. Open Science is a hot topic to him. He hopes to contribute to pressing social problems such as populism by bringing in his statistical and psychological know-how. He writes a blog which serves as a sketchpad for stuff in his mind (not immune to thought updates) at https://data-se.netlify.app/. Sebastian is the author of “Moderne Datenanalyse mit R” (Sauer, 2019). His publication list is available on Google Scholar.

Contact

Check-out my personal homepage here.

Feel free to contact me via email at any time at profsebastiansauerATgmail.com.

Assessment and grades

There is no assessment, there are no grades!

Talk to me

It’s my goal to make this an excellent course and a stimulating and enjoyable experience for all of us. So that I can find out if this is happening, I encourage feedback—be it positive or negative—on all aspects of the course at any time. For example, if something I’m doing is making it difficult for you to learn, then let me know before it’s too late; if you particularly enjoyed something we did in class, say so so that we can do it again.

Course materials

Most of the materials as presented below is made available through the course book ModernDive. Please check the relevant chapters of the book before the course to make sure you have all materials available.

Licence

This is permissive work, see the licence here.

The author is Sebastian Sauer.

Check out the Github repo of this project.

Resources

Recommendations

For students willing to learn more and go deeper (than the concepts explored in the present course), this book on regression modelling, and this book on statistical learning are recommended. For German folks, check out my book on modern data analysis.

Suggested literature for deepening the analytic skills include Statistical Rethinking. For an introduction to graphical causal models, check out Julia Rohrer’s paper. For a more in-depth journey, consider reading this book. While I wholeheartedly recommend such books, we will not be able to discuss many of the ideas presented therein in class (in this course) due to time constraints.

R Packages

All R packages are accessible through the course book; please consult the relevant chapters. Please install all R packages used before the course. Here’s a tutorial on how to install R packages.

The most important R packages for this course are:

  • tidyverse
  • easystats

The following packages are useful for data access (but not strictly mandatory):

  • palmerpenguins
  • ggpubr
  • rio
  • vtree
  • visdat
  • dataexplorer
  • tableOne
  • flextable
  • gapminder
  • nycflights13
  • fivethirtyeight
  • skimr
  • ISLR

For the Bayes models you’ll need some extra software (free, save and stable), but somewhat more hassle to install. Using Bayes in this course is optional. You don’t miss a lot if you don’t use it.

  • rstanarm

For the R package {rstanarm} to run, you’ll need to install RStan. On Windows, this amounts to installing RTools. On Mac, you’ll need to install the XCode CLI2.

In sum, follow the instructions on the RStan website. It’s unfortunately a bit complicated.

Data

All data are accessible through the course book; please consult the relevant chapters.

Labs (case studies)

Practical data analysis skills can be practiced using these labs; in addition Chapter 11 provides two cases studies. Note that such content may be used as homework.

There are a lot of case studies scattered on the internet.

Sketching causal models

Dagitty is great tool for sketching causal graphs (DAGs), it can be usd in your browser or as R package. Here’s an example of a collider bias. Check out this post for an intuitive explanation.

German introductary course

Readers who speak German may check out this Blitzkurs into data analysis using R.

Where are the slides?

There are none. I feel that slides are not optimal for learning. In class, slides can be detrimental if they are too wordy because that distracts from that the dialogue with the instructor, and I hold this very dialogue as essential. Outside of class, slides are neither helpful. Instead, a good book is much more beneficial, because in a book, there’s enough room to patiently explain in sufficient details, an endeavor which is impossible for a slide deck.

To underline my messages to you, dear learners, I will use some sketches on a virtual whiteboard, some interactive apps, live coding, and some (pre-prepared) diagrams. That’s a bit similar to what happens at Khan Academy. You might have noticed that many courses at Coursera follow a similar approach.

I readily confess that this approach is novel to many learners in these days, learners who are accustomed to hundreds of Powerpoint slides. Please be open and I think you will appreciate this didactic style.

Technical Details

Last update: 2024-03-11 17:07:30

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       macOS Big Sur ... 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Berlin
 date     2024-03-07
 pandoc   3.1.12.2 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.2.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.2.0)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.2.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.2.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.2.0)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.2.0)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.2.1)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.2.1)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.2.0)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.2.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 xfun          0.41    2023-11-01 [1] CRAN (R 4.2.0)

 [1] /Users/sebastiansaueruser/Rlibs
 [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library

──────────────────────────────────────────────────────────────────────────────

  1. Desktop version, not the server↩︎

  2. possibly you need also a Fortran compiler, but maybe that’s optional↩︎