Here’s the software stack we’ll be using.
You should work yourself through this stack in the order suggested.
This is by no means the only way to do open science or data science with open source software, and recommended packages are likely to change over time. The below R-based toolchain should be considered as merely one (out of several) consistent implementations of some best practices. However, once participants have mastered this toolchain, they should find it relatively easy to adapt to other ecosystems.
All of the software will be free and open source software, but we will also be using some proprietary Software-as-a-Service (Saas) offerings. For each of the proprietary services, there are open-source and/or self-hosted alternatives, but these are often much less convenient (e.g. self-hosted Jenkins vs GitHub Actions), or they are much less popular in the community, and therefore less useful (e.g. GitLab vs GitHub). Relying on, or pushing proprietary services, especially in an education context, is always awkward, but the disadvantages can sometimes be outweighted by convenience and network effect advantages. For some aspects of open source software development and open science, proprietary services – especially GitHub and the StackExchange network – for better or for worse just are the only game in town. In any event, most of what students will learn in this class is in free and open source software, and the remaining proprietary usage should easily translate to other, competing or open services.
The stack is listed below in the approximate order in which software is covered in this class, and together with mandatory study material and tasks.
The first couple of topics and tools should be covered by everyone. Later, more advanced topics can be chosen by students according to their interest, as illustrated in the below flowchart.
Flow Chart of Covered Topics
For the introductory session, participants should watch this video:
Here’s the corresponding slide deck.
Participants should install all of the below software and sign up for all the below services before the first class.
If you are using Docker, some of the software is already contained in the image, though “local” installation may still be advisable.
The steps required for installation will depend on your platform and system setup.
… is expected. To start this class you need to know your way around your own computer and commonly used technologies.
You should know, or easily find, the answer to questions such as:
*.docx
or a *.pdf
?jaguar car
and
jaguar -car
give different results on Google?If you feel like you need to brush up on some basic computing skills, these resources might be helpful:
GitHub is a collaboration platform, code repository and git host (more on all of below) along with some helpful project management tools.
Choose your username carefully: it should be easy to type, and clearly identify you. If you already have such a public username on other platforms (say, twitter), consider reusing that. Your username need not be your real name, but it makes things easier if it includes (parts of) your name.
Please flesh out your profile on GitHub.com and all of the accounts below a bit by adding a picture, a name and a short description of yourself.
Also be careful in choosing your commit email address. This should be an email account you have access to forever.
… and one last thing: By default, GitHub will notify you via E-Mail about pretty much every repository activity, which will result in a lot of email. Here’s how you can customize or disable these notifications. Make sure that you’re not missing anything on GitHub.com either.
Here’s a sensible set of defaults:
github.com/settings/notifications
Sign up to StackOverflow Sign up to RStudio community
This includes setting up a decent profile with picture, links etc. Same username suggestions apply as above.
Aside from Google, these are two great places to get help, and to get involved in the community.
A lot of volunteers spend a lot of time on these sites, so it is very important not to waste their efforts, and to only add quality content, as defined by these sites.
In addition to StackExchange and RStudio Community, there are a couple of other platforms where the (very friendly) R community hangs out:
#rstats
hashtag. Consider joining!The full GFM spec is just FYI; there’s nothing to install here.
Plain text has many advantages (more on that later), but one glaring disadvantage: it does not look very nice, and does not implement many of the typesetting conventions that have evolved since Gutenberg (say, bold face).
Markup syntaxes solve this problem. Markup syntaxes are sets of
conventions (as in *something*
for highlighting) to
structure human-generated text in a way that computers can
operate on them, such as formatting a piece of text.
There are many, many, such markup languages out there, including HTML but also Markdown and LaTeX.
We will be focusing on Markdown as a source language, and then use open source tools (especially Pandoc), to render our source documents to all sorts of other formats, including PDF (via LaTeX), HTML (such as this website), but also Microsoft Word documents.
Markdown is a very lightweight markup language, that was designed to
be maximally human readable, that is, looking meaningful
without being compiled by a computer. Most of the syntax takes
its clues from how people have already formatted plain text, such as
enclosing a *word*
with *
for
highlighting.
Technically, Markdown is a convention for writing such files, as well
as a program to convert such files into HMTL
, as, for
example, this website (which is written in a flavor of Markdown).
By convention, Markdown files use the .md
file
extension. It’s important to recognize that still, an .md
is a plain text file. You could open it with any text editor,
or even change the extension to *.txt
and nothing would
change. The extension .md
serves merely to tell computers
that the following plain text is marked up in markdown.
Markdown was (originally) quite a minimal standard, and has since branched out into a few specialised “flavors”, offering additional features.
We will be using only two of these flavors: GitHub Flavored Markdown (GFM) and Pandoc’s Markdown (more on that below).
StackOverflow, RStudio Community (Discourse), Gitter chat and many other services also support GFM.
GitHub, a leading code-hosting service, has extended the above original Markdown spec by a couple of additional features. In addition to these formatting niceties, Github also implements some clever cross-referencing and autocompletion magic. When using Github for source control and collaboration, you really must use these in issues, comments, commit messages etc. (they work everywhere).
If you like, you can also install a program on your computer to render Markdown to HTML. There are plenty of choices, including the free MarkdownPad for Windows, and Lightpaper for OS X. If you don’t want to install something, Github (see below) also offers a Markdown preview in its browser-based editor. We will be using different programs going forward.
A shell is a command-line interface (CLI) to your computer (as opposed to a point-and-click graphical user interface, or GUI). You may also know this as “the console”, or “the terminal”.
There are technically different kinds of shells, though the
bash
shell is the most widespread, and is often used
interchangeably with the shell.
A lot of programs that we’ll be using only run at the CLI, so it’s important to know how to use it.
You can use the (Linux) shell that ships with our Docker container, but you should also know your way around the shell that ships with your OS (to, among other things, spin up a Docker image).
On macOS, Linux: Nothing to install, ships with a bash shell or something similar.
On Windows:
If you like a fancier shell, you might want to look at the oh-my-zsh project, which has some pretty cool features. However this is strictly optional, will not be supported in class.
It turns out that bash
, the default shell on UNIX-type
computers is also a scripting language upon
itself. Scripting
languages are programming languages which facilitate automated
execution of tasks, such as, say, running a bunch of updates and then
power cycling your computer.
bash
isn’t necessarily the greatest scripting language;
especially for more complicated projects, “proper” scripting languages
such as Python, Ruby or R might serve you better.
But bash
has the advantage that it is available in
(almost all) UNIX-type computing environments, so it’s often the easiest
way to automate steps. It looks a bit arcane (because it is), but you
don’t need much to build powerful scripts that can save you a lot of
time.
This entire topic and the below additional resources are recommended for advanced readers. You won’t need this starting out.
Docker is an open-source industry standard to define, provision and share computing environments, known as containers. Containers allow you to run computing environments on other computers. Containers are similar to virtual machines (a computer inside a computer), but slimmer and generally neater.
You may want to use Docker in class to quickly get a development environment, but it is also generally a helpful tool.
Install and run a docker image:
docker run --env=PASSWORD=yourpassword --rm --publish=8787:8787 rocker/verse
On macOS and Linux, your default system shell is an application called “Terminal”. On Windows, you can use the “command prompt” or “PowerShell” applications. If you already have git installed on Windows, you can also use the git bash emulator.
Depending on how fast your internet connection is, this process will take a while to complete.
http://localhost:8787
. You should see the login window for
the RStudio IDE. You are in the browser, but are using running all
computations on your machine through Docker. This will also
work if you are offline.rstudio
as a username and your
PASSWORD
given in the above as a password.ctrl + c
in your shell.Learn the difference between a Dockerfile
, an image
and a container in the context of Docker.
Git is just a CLI program. It offers all the functionality of git, but you may also install a Git graphical user interface (GUI).
There plenty of those out there, but one of the easiest is the GitHub Desktop app from GitHub (available only for Windows and macOS).
You should install Git and GitHub Desktop even if you are using Docker.
You also need to configure git on your machine, so that git knows you are and to allow you to authenticate against git hosts (GitHub.com in our case) and wherever else you are using Git (such as a SaaS):
There is a varied set of practices and tools that have evolved on top of Git. Together with the powerful git scm, it is these practices and tools, that make massively collaborative software development possible.
One of the simpler practices is GitHub Flow. We will use it to learn the branch and pull-request model.
Linux:
already ships with apt
macOS:
install homebrew
Windows:
install chocolatey
Installing and upgrading a lot of command line tools and their dependencies gets old quickly. Package managers solve this problem; they provide a clean and elegant way to install (CLI) programs, and even allow you to quickly upgrade everything.
You will need to install a package manager independent of using the Docker image.
Consider installing the below software via your package manager. Whether this is advisable or even possible for any given piece of software depends on the software and your operating system. Google around for advice. As a rule of thumb, “heavy” packages (such as LaTeX, Atom or R) are sometimes best installed “by hand”.
Notice that LaTeX, Atom and R (all below) each have their own
internal package managers (as do many other software
ecosystems). If you’re installing a package for either of
those, use the corresponding ecosystem package manager, not
your system-wide program (= brew
, apt-get
,
cholocatey
).
Whenever we write something in this class, it will be in plain text. Plain text, roughly speaking, consists directly and only of letters, encoded in an open standard.
This may seem antiquated, but has several advantages:
*.doc
in a text editor, and see whether you can make out
any meaning.*.txt
, or, equivalently for data, *.csv
can be
opened and edited on pretty much any computer today, could be 30 years
go, today, and probably still will be widely accessible in 30 years
time.Most operating systems ship with a text editor, but they are quite basic and can be cumbersome to use. Specialized text editors (or just editors) offer more functionality geared towards technical writing or software development.
There are many editors out there, and people have strong views on which is
best. In some ways, this is surprising, because of all the software
used in collaborative writing or development, editors are the
tool that needs the least standardisation. Playing off the advantages of
plain text files, everyone can use what works best for them, because
they all output the exact same thing: a *.txt
.
You are therefore free to choose your own text editor.
You can use the RStudio IDE that comes with our Docker image, but other editors are more fully featured.
Atom has the advantage of being relatively easy to use, free and open source and relatively widely supported. It also comes with some nice Git(Hub) integration.
Atom, as most editors, has a modular design. Many of its features are factored out to separate packages, some of which are contributed by external volunteers.
Here’s a list of packages you might also want to install:
atom-beautify
atom-html-preview
document-outline
git-plus
language-knitr
language-latex
latex
language-markdown
merge-conflicts
minimap-split-diff
Before installing RStudio, you must first install R (see below). We may, for now, not use R much by itself, but it can be easier to use all the other tools inside of RStudio, rather than separately.
If you are using the Docker container image, RStudio and R are already included.
Aside from text editors, there are also integrated development environments (IDE) (though this distinction has recently been blurring with the arrival of Atom-IDE and others). IDEs are a little like text editors, in that they mostly let you edit plain text files, but they offer a lot of “training wheels” for programming and are often geared towards particular programming languages.
The leading IDE for R is called RStudio, by, confusingly, a company called RStudio. We will be using the open source variant of RStudio (the IDE), but RStudio (the company) also sells commercial licenses to the IDE and other products.
If you are already deeply invested in an IDE or Editor (especially vim or emacs) you may also trick out that program to support R. The Emacs speaks statistics project has great support for R, but Emacs has a steep learning curve.
For most everyone, RStudio will therefore be the strongly recommended choice.
All software in this section is included in the Docker image.
We’ll often want to convert documents from and to different markup formats. For that purpose, we’ll use pandoc.
Pandoc is, originally, a kind of swiss army knife for text document formats, such as, say, between Microsoft Word and HTML.
But as part of this work, Pandoc has also defined its own extension (flavor) to Markdown (largely compatible with GFM), including such features as footnotes, captions, references, and other aspects important for technical and scientific writing.
You should both learn to use Pandoc at the CLI as well as to write in the corresponding Pandoc’s Markdown style.
(La)TeX is strictly speaking a typesetting program, which can create beautiful documents. It has extensive support for all sorts of domain-specific typographic niceties, and is used a lot by academics, especially in math and sciences because.
However, because LaTeX is quite cumbersome to compose and tends to distract writing with a lot of bells and whistles, we will not learn to write LaTeX directly “by hand”. Instead, we will be using Pandoc to compile our Pandoc Markdown source to PDF (via LaTeX), and, because LaTeX can be slow to compile, we will only do so rarely and towards the end of any given project.
Still, it is important to learn some of the basics of LaTeX to use it programmatically.
Install Pandoc Citeproc Install Zotero
Bibliography management is not the focus of this class, but you can learn more about it here.
It is also one of those tools, where there is no strong reason to standardize on any one program, so as long as the bibliography manager exports to one of the formats that pandoc can ingest.
Check if your bibliography manager can export to at least one of these formats.
If you have a choice, a BibTeX or BibLaTeX file (confusingly both
named *.bib
) are preferable.
All software in this section is included in the Docker image.
Install knitr Install RMarkdown
All software in this section is included in the Docker image.
Install dplyr, tidyr, readr, tibble, purrr and stringr
All software in this section is included in the Docker image.
The below packages for (web) interactivity in R try to abstract away as much as possible the underlying web technologies (HTML, JavaScript and CSS). You can use them without knowing anything about this stack, but you can accomplish more and understand them in a deeper way if you have at least a cursory understanding of how these technologies work.
Covering them in any depth, or even listing good resources (of which there are gazillions) is beyond the scope of this class, so these should be considered mere starting points.
Install plotly for R Install htmlwidgets Install flexdashboard
tba.
tba.
tba.
tba.