Here’s the software stack we’ll be using.

You should work yourself through this stack in the order suggested.

This is by no means the only way to do open science or data science with open source software, and recommended packages are likely to change over time. The below R-based toolchain should be considered as merely one (out of several) consistent implementations of some best practices. However, once participants have mastered this toolchain, they should find it relatively easy to adapt to other ecosystems.

All of the software will be free and open source software, but we will also be using some proprietary Software-as-a-Service (Saas) offerings. For each of the proprietary services, there are open-source and/or self-hosted alternatives, but these are often much less convenient (e.g. self-hosted Jenkins vs GitHub Actions), or they are much less popular in the community, and therefore less useful (e.g. GitLab vs GitHub). Relying on, or pushing proprietary services, especially in an education context, is always awkward, but the disadvantages can sometimes be outweighted by convenience and network effect advantages. For some aspects of open source software development and open science, proprietary services – especially GitHub and the StackExchange network – for better or for worse just are the only game in town. In any event, most of what students will learn in this class is in free and open source software, and the remaining proprietary usage should easily translate to other, competing or open services.

Introduction

For the introductory session, participants should watch this video:

Here’s the corresponding slide deck.

Installation

Participants should install all of the below software and sign up for all the below services before the first class.

If you are using Docker, some of the software is already contained in the image, though “local” installation may still be advisable.

The steps required for installation will depend on your platform and system setup.

Basic Computing Literacy

You should know, or easily find, the answer to questions such as:

  • In what directory (absolute path) are programs I use daily stored on computer?
  • What is the operating system (OS) and version on my computer?
  • In what directory (absolute path) do I store my files?
  • Do I have sufficient privileges to install software? If not, how can I get them?
  • Which file format is better suited to editing and why: A *.docx or a *.pdf?
  • Why do the search queries jaguar car and jaguar -car give different results on Google?
  • Name at least 10 file types.
  • What is the username I usually use on public-facing platforms?
  • What is Two-Factor Authentification? (2FA)?
  • How is my harddisc formatted?
  • How can I upgrade my OS and frequently used software?
  • How is the data on my computer protected from unauthorised access?
  • What is my backup plan?
  • What is a VPN client, and what do I need it for?

If you feel like you need to brush up on some basic computing skills, these resources might be helpful:

Software Carpentry

Project Management

Sign up to GitHub.com

GitHub is a collaboration platform, code repository and git host (more on all of below) along with some helpful project management tools.

Please flesh out your profile on GitHub.com and all of the accounts below a bit by adding a picture, a name and a short description of yourself.

Tasks

Community & Help

Sign up to StackOverflow Sign up to RStudio community

Aside from Google, these are two great places to get help, and to get involved in the community.

A lot of volunteers spend a lot of time on these sites, so it is very important not to waste their efforts, and to only add quality content, as defined by these sites.

Tasks

  • Sign up to StackOverflow.
  • Find an interesting question an StackOverflow and post it to the chat.
  • Sign up to RStudio community and post your profile to the chat so we can follow each other.
  • Find and interesting discussion on RStudio community and post it to the chat.

Additional Resources

In addition to StackExchange and RStudio Community, there are a couple of other platforms where the (very friendly) R community hangs out:

Markup Language

GitHub Flavored Markdown Spec

The full GFM spec is just FYI; there’s nothing to install here.

Plain text has many advantages (more on that later), but one glaring disadvantage: it does not look very nice, and does not implement many of the typesetting conventions that have evolved since Gutenberg (say, bold face).

Markup syntaxes solve this problem. Markup syntaxes are sets of conventions (as in *something* for highlighting) to structure human-generated text in a way that computers can operate on them, such as formatting a piece of text.

There are many, many, such markup languages out there, including HTML but also Markdown and LaTeX.

We will be focusing on Markdown as a source language, and then use open source tools (especially Pandoc), to render our source documents to all sorts of other formats, including PDF (via LaTeX), HTML (such as this website), but also Microsoft Word documents.

Markdown is a very lightweight markup language, that was designed to be maximally human readable, that is, looking meaningful without being compiled by a computer. Most of the syntax takes its clues from how people have already formatted plain text, such as enclosing a *word* with * for highlighting.

Technically, Markdown is a convention for writing such files, as well as a program to convert such files into HMTL, as, for example, this website (which is written in a flavor of Markdown).

By convention, Markdown files use the .md file extension. It’s important to recognize that still, an .md is a plain text file. You could open it with any text editor, or even change the extension to *.txt and nothing would change. The extension .md serves merely to tell computers that the following plain text is marked up in markdown.

Markdown was (originally) quite a minimal standard, and has since branched out into a few specialised “flavors”, offering additional features.

We will be using only two of these flavors: GitHub Flavored Markdown (GFM) and Pandoc’s Markdown (more on that below).

StackOverflow, RStudio Community (Discourse), Gitter chat and many other services also support GFM.

GitHub, a leading code-hosting service, has extended the above original Markdown spec by a couple of additional features. In addition to these formatting niceties, Github also implements some clever cross-referencing and autocompletion magic. When using Github for source control and collaboration, you really must use these in issues, comments, commit messages etc. (they work everywhere).

Resources

Additional Resources

If you like, you can also install a program on your computer to render Markdown to HTML. There are plenty of choices, including the free MarkdownPad for Windows, and Lightpaper for OS X. If you don’t want to install something, Github (see below) also offers a Markdown preview in its browser-based editor. We will be using different programs going forward.

Shell

A shell is a command-line interface (CLI) to your computer (as opposed to a point-and-click graphical user interface, or GUI). You may also know this as “the console”, or “the terminal”.

There are technically different kinds of shells, though the bash shell is the most widespread, and is often used interchangeably with the shell.

A lot of programs that we’ll be using only run at the CLI, so it’s important to know how to use it.

You can use the (Linux) shell that ships with our Docker container, but you should also know your way around the shell that ships with your OS (to, among other things, spin up a Docker image).

On macOS, Linux: Nothing to install, ships with a bash shell or something similar.

On Windows:

  • Install Git for Windows because that comes with at least a git shell. Choose git bash emulation on install.
  • If your version is >= Windows 10 Anniversary Update you can also install Install the Windows Subsystem for Linux (WSL) and use the Windows 10 Bash Shell. However this is a separate system inside your Windows installation, and the programs installed inside it may (as of 2019-01) not be used “normal” windows GUI programs. If you don’t know what this means, do not install the WSL; it can be very confusing.

If you like a fancier shell, you might want to look at the oh-my-zsh project, which has some pretty cool features. However this is strictly optional, will not be supported in class.

Additional Resources

Bash (Optional)

It turns out that bash, the default shell on UNIX-type computers is also a scripting language upon itself. Scripting languages are programming languages which facilitate automated execution of tasks, such as, say, running a bunch of updates and then power cycling your computer.

bash isn’t necessarily the greatest scripting language; especially for more complicated projects, “proper” scripting languages such as Python, Ruby or R might serve you better.

But bash has the advantage that it is available in (almost all) UNIX-type computing environments, so it’s often the easiest way to automate steps. It looks a bit arcane (because it is), but you don’t need much to build powerful scripts that can save you a lot of time.

This entire topic and the below additional resources are recommended for advanced readers. You won’t need this starting out.

Containerisation (optional)

Install Docker Desktop

Docker is an open-source industry standard to define, provision and share computing environments, known as containers. Containers allow you to run computing environments on other computers. Containers are similar to virtual machines (a computer inside a computer), but slimmer and generally neater.

You may want to use Docker in class to quickly get a development environment, but it is also generally a helpful tool.

Resources

Tasks

  • Install and run a docker image:

    1. Download and install Docker Desktop Community. You need to set up a docker ID as part of the download, and you should remember your account credentials.
    2. Launch Docker Desktop. A small whale symbol will show up in your task bar, but not much else.
    3. Launch a system shell and type in
    docker run --env=PASSWORD=yourpassword --rm --publish=8787:8787 rocker/verse

    On macOS and Linux, your default system shell is an application called “Terminal”. On Windows, you can use the “command prompt” or “PowerShell” applications. If you already have git installed on Windows, you can also use the git bash emulator.

    Depending on how fast your internet connection is, this process will take a while to complete.

    1. Open a webbrowser and point it to http://localhost:8787. You should see the login window for the RStudio IDE. You are in the browser, but are using running all computations on your machine through Docker. This will also work if you are offline.
    2. Type in rstudio as a username and your PASSWORD given in the above as a password.
    3. You are now in the RStudio IDE.
    4. To shut down, close the browser window and type ctrl + c in your shell.
  • Learn the difference between a Dockerfile, an image and a container in the context of Docker.

Source Control Management (SCM)

Install Git

Git is just a CLI program. It offers all the functionality of git, but you may also install a Git graphical user interface (GUI).

There plenty of those out there, but one of the easiest is the GitHub Desktop app from GitHub (available only for Windows and macOS).

Install GitHub Desktop

You should install Git and GitHub Desktop even if you are using Docker.

You also need to configure git on your machine, so that git knows you are and to allow you to authenticate against git hosts (GitHub.com in our case) and wherever else you are using Git (such as a SaaS):

  • Your local machine
  • RStudio Cloud
  • Configuring git inside Docker isn’t recommended; it can get difficult you must be careful not to disclose your credentials.

Additional Resources

Branching Model (Optional)

GitHub Flow

There is a varied set of practices and tools that have evolved on top of Git. Together with the powerful git scm, it is these practices and tools, that make massively collaborative software development possible.

One of the simpler practices is GitHub Flow. We will use it to learn the branch and pull-request model.

Package Management (optional)

Linux: already ships with apt macOS: install homebrew Windows: install chocolatey

Installing and upgrading a lot of command line tools and their dependencies gets old quickly. Package managers solve this problem; they provide a clean and elegant way to install (CLI) programs, and even allow you to quickly upgrade everything.

You will need to install a package manager independent of using the Docker image.

Notice that LaTeX, Atom and R (all below) each have their own internal package managers (as do many other software ecosystems). If you’re installing a package for either of those, use the corresponding ecosystem package manager, not your system-wide program (= brew, apt-get, cholocatey).

Text Editor (Optional)

Install Atom

Whenever we write something in this class, it will be in plain text. Plain text, roughly speaking, consists directly and only of letters, encoded in an open standard.

This may seem antiquated, but has several advantages:

  • Plain text can easily be versioned by computer software such as git.
  • Plain text is transparent to the user: it is human-readable. For comparison, try opening a *.doc in a text editor, and see whether you can make out any meaning.
  • Plain text is lightweight and robust. File sizes and memory footprint are tiny.
  • Plain text files future-proof your work and data. *.txt, or, equivalently for data, *.csv can be opened and edited on pretty much any computer today, could be 30 years go, today, and probably still will be widely accessible in 30 years time.

Most operating systems ship with a text editor, but they are quite basic and can be cumbersome to use. Specialized text editors (or just editors) offer more functionality geared towards technical writing or software development.

There are many editors out there, and people have strong views on which is best. In some ways, this is surprising, because of all the software used in collaborative writing or development, editors are the tool that needs the least standardisation. Playing off the advantages of plain text files, everyone can use what works best for them, because they all output the exact same thing: a *.txt.

You are therefore free to choose your own text editor.

You can use the RStudio IDE that comes with our Docker image, but other editors are more fully featured.

Atom has the advantage of being relatively easy to use, free and open source and relatively widely supported. It also comes with some nice Git(Hub) integration.

Atom, as most editors, has a modular design. Many of its features are factored out to separate packages, some of which are contributed by external volunteers.

Here’s a list of packages you might also want to install:

  • atom-beautify
  • atom-html-preview
  • document-outline
  • git-plus
  • language-knitr
  • language-latex
  • latex
  • language-markdown
  • merge-conflicts
  • minimap-split-diff

R Integrated Development Environment (IDE)

Install RStudio

If you are using the Docker container image, RStudio and R are already included.

Aside from text editors, there are also integrated development environments (IDE) (though this distinction has recently been blurring with the arrival of Atom-IDE and others). IDEs are a little like text editors, in that they mostly let you edit plain text files, but they offer a lot of “training wheels” for programming and are often geared towards particular programming languages.

The leading IDE for R is called RStudio, by, confusingly, a company called RStudio. We will be using the open source variant of RStudio (the IDE), but RStudio (the company) also sells commercial licenses to the IDE and other products.

If you are already deeply invested in an IDE or Editor (especially vim or emacs) you may also trick out that program to support R. The Emacs speaks statistics project has great support for R, but Emacs has a steep learning curve.

For most everyone, RStudio will therefore be the strongly recommended choice.

Technical & Scientific Authoring (optional)

All software in this section is included in the Docker image.

Document Conversion (optional)

Install Pandoc

We’ll often want to convert documents from and to different markup formats. For that purpose, we’ll use pandoc.

Pandoc is, originally, a kind of swiss army knife for text document formats, such as, say, between Microsoft Word and HTML.

But as part of this work, Pandoc has also defined its own extension (flavor) to Markdown (largely compatible with GFM), including such features as footnotes, captions, references, and other aspects important for technical and scientific writing.

You should both learn to use Pandoc at the CLI as well as to write in the corresponding Pandoc’s Markdown style.

Typesetting (optional)

Install LaTeX

(La)TeX is strictly speaking a typesetting program, which can create beautiful documents. It has extensive support for all sorts of domain-specific typographic niceties, and is used a lot by academics, especially in math and sciences because.

However, because LaTeX is quite cumbersome to compose and tends to distract writing with a lot of bells and whistles, we will not learn to write LaTeX directly “by hand”. Instead, we will be using Pandoc to compile our Pandoc Markdown source to PDF (via LaTeX), and, because LaTeX can be slow to compile, we will only do so rarely and towards the end of any given project.

Still, it is important to learn some of the basics of LaTeX to use it programmatically.

Bibliography Management (optional)

Install Pandoc Citeproc Install Zotero

Bibliography management is not the focus of this class, but you can learn more about it here.

It is also one of those tools, where there is no strong reason to standardize on any one program, so as long as the bibliography manager exports to one of the formats that pandoc can ingest.

Check if your bibliography manager can export to at least one of these formats.

If you have a choice, a BibTeX or BibLaTeX file (confusingly both named *.bib) are preferable.

Introductory R

All software in this section is included in the Docker image.

“Base” R

Install R

Resources

Additional Resources

Literate Programming

Install knitr Install RMarkdown

Resources

Intermediate R

All software in this section is included in the Docker image.

Tidyverse R

Install dplyr, tidyr, readr, tibble, purrr and stringr

Resources

Plots

Install ggplot2

Resources

Interactive R

All software in this section is included in the Docker image.

HTML, JS & CSS (optional)

The below packages for (web) interactivity in R try to abstract away as much as possible the underlying web technologies (HTML, JavaScript and CSS). You can use them without knowing anything about this stack, but you can accomplish more and understand them in a deeper way if you have at least a cursory understanding of how these technologies work.

Covering them in any depth, or even listing good resources (of which there are gazillions) is beyond the scope of this class, so these should be considered mere starting points.

Interactive Webapps

Install shiny

Resources

Advanced R

Cloud Computing with R

Continuous Integration & Development (CI/CD)

Sign up to Travis CI

Resources

The Cloudyr Project

tba.

Reproducible Research

Defensive Computing

tba.

Storing Datasets

tba.

Publishing results

tba.

Dependency Management

Install packrat

Package development