Free and Open Source Software for Open Science (“FOSSOS”) is an ongoing series of seminars taught by Max Held at Friedrich-Alexander Universität Erlangen-Nürnberg (FAU) at the Department of Sociology, introducing students to R, as well as the broader open science ecosystem and free and open source best practices and tools.
This repository houses the resources for these classes, and all class-related activities are tracked in the above issues.
… because learning from hackers is learning to win?
#DataScience #rstats Git(Hub) #ReproducibleResearch
[Coding – ] it’s the next best thing we have to a superpower. – Drew Houston via code.org
So we were very worried that what if the astronaut, during mid-course, would select pre-launch, for example? Never would happen, they said. Never would happen. It happened. – Margaret Hamilton
Computers … a bicycle for the mind – Steven Jobs
To me programming is more than an important practical art. It is also a gigantic undertaking in the foundations of knowledge. – Grace Murray Hopper
Think of free speech, not free beer. – Richard Stallman
Open source isn’t like free sunshine; it’s like a free puppy. – Sarah Novotny
Most learning is not the result of instruction. It is rather the result of unhampered participation in a meaningful setting. – Ivan Illich (1971)
Everyone is welcome to this seminar. This is not a “proper” computer science class, and participants do not need any background in CS, statistics or math.
You should just be curious and ready to:
No worries, we’ll bring everyone up to speed in very little time.
You do not need to have completed a prior version of this class, or any other class. If you have some prior training, you will start the class at a different level.
This will be an all-remote, largely asynchronous seminar, held via Gitter chat, GitHub and occasional Zoom video conference.
Thursday, April 30th, 2020 15:00-17:00 (in a video conference, see below).
Throughout the semester, students can work through the material at their own pace and schedule. Support from the instructor and fellow students is available on the Gitter chat.
Depending on needs, short video conferences will be held for selected topics.
We’re going to use a few digital tools to work asynchronously.
These venues are also linked from the top bar of the class website, so you can always easily find them.
Depending on who will be attending the class, instruction may also occur in english or german. In any event, all of the readings and other course material are in english, and participants are expected to be proficient in reading and writing english technical documents.
It is obviously impossible (for most students) to cover all of the material in this course in one semester.
This course (with a slightly different name) will therefore be taught every semester, in a non-consecutive series. Students can join the class every semester, and take the class for however many semesters they wish (if they still have new things to learn). Do not be confused by the name this class takes in some semester (say, “Advanced R …”) – you can still join as a beginner. Depending on the listing (see below) students can also take this class for credit multiple times.
By implication, the group of students in the class in any given semester will be heterogeneous, working at different levels. For example, some students may already have taken a course in the series previously, while others are just starting out. Because the previous experiences and learning speed of students vary greatly anyway, this is not a significant (additional) hindrance. Tasks, expectations and material covered will accordingly differ for each student, depending on the background.
You can generally take this class as an undergraduate (Bachelor) lower-divison seminar (Proseminar) worth 5 ECTS points, or an upper-division seminar (Hauptseminar) worth 7.5 ECTS points. The workload will be adjusted accordingly.
Depending on your major, you may also take the class to fulfill requirements for a Masters program. Please be in touch to discuss the details.
This class was/is listed as:
SOZ M
, Soziologische
Methodenlehre)Soz Qf4
, Arbeit
und Organisation)SOZ M1
,
SOZ M2
Soziologische
Methodenlehre)SOZ M
, Soziologische
Methodenlehre)Digitisation has created both new challenges and yet unrealised potentials for empirical social sciences. Larger, and often streamed datasets require more programmatic and dynamic statistical analyses. Existing commercial programs with graphical user interfaces (GUIs) are expensive, and analyses can easily become intransparent, sometimes contributing to a crisis of reproducibility in the social sciences and beyond (e.g., Mair 2016) or even propagating outright bugs (e.g., Reinhart and Rogoff 2010).
Happily, the open source community has already pioneered a set of technologies and conventions for their software development efforts that have proven useful in solving these problems in many academic fields. Additionally, open source software offers new ways to analyse and visualize data, as well as to present interactive results.
Together, these tools promise a radically open and participatory approach to science, and productive yet skeptical use of emerging data streams.
Unfortunately, learning these tools takes more time than is usually available until any given project deadline.
The goal of this series of seminars is therefore to train participants in a coherent set of leading tools and best practices, including:
Towards the end of each of the seminars, participants will be able to use (parts of) this toolchain to work on their own projects, or to contribute to existing free and open source software.
The Data Science Venn Diagram by Drew Conway (2010)
This course will not focus on math and statistics knowledge or substantive domain expertise, though both are essential for solid data science work. Rather, the emphasis is on what Drew Conway loosely called hacking skills in his Data Science Venn Diagram, that is, simply getting these tools to work together, to learn how to troubleshoot them, and – aspirationally – to absorb some best practices of open source development.
While the course is not a proper computer science class, it should also be valuable to students with coding experience or a CS background who may be interested in the tooling and practices covered.
We will not cover the scaling and efficiency issues of proper “Big Data”, but confine ourselves to in-memory problems. We also limit ourselves to the R ecosystem, though some tools and problems will be similar for other scripting languages such as python.
An introduction to data science and open source may well open up new job opportunities, or serve as a first stepping stone to a career in tech, but that is arguably not the only reason why social scientists should be excited about it. Instead, to learn the way of open source is perhaps to update the ideals of the scientific process for the modern day: radical openness and rigorous reproducibility, maximal inclusivity and promised meritocracy, generous sharing and personal attribution. Open source may also be a worthwhile exercise in participant observation for social scientists: here is a real, if surely flawed utopia, massively coordinating individuals that is neither market nor state.
Less loftily, but not least, the seminar also promises a starter dose of gratification from having built something that actually works, and is of some immediate use to our fellow human – a good feeling sometimes hard to come by in the social sciences.
This course is a little different from most seminars.
Teaching teaching R (and the broader ecosystem) at FAU sociology (as most other smaller, non-tech focused institutions) faces a couple of important constraints:
To meet these constraints, this course will be held as a non-consecutive multi-semester series of seminars, and will, for the most part, operate on a flipped classroom model.
Because students will learn at different speeds, and from different starting points – among other reasons – teacher-centered teaching will be minimal in this class.
Instead, students will study the assigned material outside of class, including online documents, videos and interactive learning applications.
As they encounter problems, or develop own (small) projects, students will track such work on the issue tracker used in class. In class, students will work on their own problems or projects, in small groups and assisted by the instructor as necessary.
This class does not offer a one-size-fits-all set of pre-defined materials and assignments necessary for successful participation. What the class offers is:
Happily, there are a lot of great resources for learning data science tools out there, many of them free, some of them even open source themselves. We will be reusing a lot of these resources, and I (the instructor) do not have to reinvent an (inferior) wheel. There is no one curriculum that’s quite right for us, so I have cobbled together material from different sources.
All resources are listed, in roughly advisable chronological order, along with the stack.
The good news is that there are no academic papers or books for this class and everything students need is available online. There is, however, still a lot of material to work through (to the tune of hours per week), though it is written in a hopefully more accessible style than many academic documents. The listed resources are guaranteed to cover everything you need to use the software, often including tutorials, videos and exercises. Students are not limited to the listed resources; they can also choose their own material, so as long as it covers roughly the same ground. In fact, students are encouraged to share good additional resources with the rest of the class.
There is a lot of duplicate content between the alternative resources listed. Students should browse each of the resources, and then work in-depth through whichever they find most suitable.
Whenever your run into a problem, or have a question, raise an issue on our https://github.com/soztag/fossos/issues. Please also make sure that:
Because students will learn at different speeds, and from different starting points, there is not a schedule for the class. The stack lists the tools (and resources) in the rough order in which they should be studied.
Students can work through this material at their own pace. Likewise, some students may wish to cover a lot of breadth (at shallow depth), while others want to dig in on a particular topic. This is all fine, but students should ensure that they learn something at a useful level to solve real-world problems, as will also be required for the assessment. If in doubt, ask the instructor for guidance.
Every student should first become competent in the practices and tools covered in “Software Carpentry”; these are required for all later topics.
As a loose guide, every student should cover at least one top-level heading (“Interactive R”, “Intermediate R”, etc.) per semester.
There are often several heavily overlapping resources recommended for a tool; students should study whichever best suits their taste. It’s a good idea to browse through all of the resources to make sure you don’t miss anything.
Assessments are an unfortunate, tedious and arguably needless part of teaching – but here we are, so we are going to make the best of it.
Instead of some make belief work or hobby project, assignments in this class are, for the most part, designed to be actually useful to other people. This can be motivating, but it also means that other people are relying upon our work: it has to be delivered by the time, and in the quality expected.
You can work on pretty much anything you like – improving this very class (and its repo), some existing project that you like or even your own new (or existing) project. The only conditions are:
We will begin with relatively easy, small tasks to serve other students in class, then address smaller issues with resources for the broader community, and eventually, fixing “real” bugs or enhancing functionality of open source data science software.
All tasks, big and small, are listed and tracked on the class github repository issue tracker. Students should assign themselves to tasks they will be working on, and report / link to any progress on these tasks in the issue thread.
All students, including those who just want
a “Sitzschein” (pass/fail option) must contribute to a number
of issues labelled as pass/fail
.
These are issues that are smaller in scale and scope.
There is no straightfoward minimum metric (say, number of closed issues) to pass the class. Instead, students should display substantial contributions across a range of helpful activities, as recorded in the issue tracker.
Before working on these issues, students should assign themselves, to avoid us doing duplicate work.
Students who want to receive a grade on the class also have to
complete a couple of issues tagged with graded-x
.
The numbers next to the labels roughly indicate the estimated workload and difficulty of a task (also known as “story points” in agile development). Estimates are frequently wrong, and these points can be adjusted in consultation with the instructor, if some task turns out to be much harder or easier than expected. These story points correspond to ECTS credit points; if you are taking this as a “Proseminar”, you will need to have owned and closed issues worth 5 story points. If you are taking this as a “Hauptseminar”, you will need to have owned and closed 7.5 story points worth of issues.
You will be graded based on how well you have adhered to the best practices and tooling covered in class, as well as (if applicable) the guidelines and standards of the external project (some other repo) or platform (Stack Overflow)
There are different kinds of graded issues:
Labels:
community.rstudio
, stack-overflow
or
bug report
,reprex
, and question
respectively.Though it may also benefit yourself, a well-formulated question or bug report with a reproducible example can also serve the community. This is what we’re aiming for here.
A well-formulated question, in the context of open source development is often a reproducible example, or reprex, for short. This means that you should provide a code snippet (or, if not applicable, a very precise description of steps) that will allow any other user to reproduce the behavior in question, with no additional resources. Producing this can be harder than it sounds, and just narrowing down a problem like that may often help you solve it.
Make sure to read and adhere to all the resources listed community and help.
The three target platforms can be listed roughly in ascending order of precision of the question:
Here, as with all things open source, we must ensure that other people’s time is well-spent engaging our question (or bug report). To ensure that, please follow this procedure:
Sequence Chart for a Reprex
Labels:
community.rstudio
, stack-overflow
,reprex
, and answer
respectively.Same process as for the above.
Labels: external documentation
,
external software
.
These are improvements to external repos (typically also on GitHub), either other software (typically R repositories) or documentation and learning resources (typically those covered in class). The actual work (forking, raising a pull request, etc.) consequently occurs in the external target repository, and this activity is merely tracked in a placeholder issue in the class repository. Simply link to any relevant issues, commits or pull requests on the target repo in a placeholder issue.
This sounds quite challening, but it can be quite doable, especially if you’re starting by improving the documentation.
To start contributing to open source, you might also find these resources helpful:
For contributions to external documentation or software, it is very important that we do not burden the respective maintainers with sub-par work. To ensure that we deliver high quality work, you must follow the following procedure:
Sequence Chart for an External Contribution
Grading criteria are listed for each of the issues. Generally, a good grade will require following the practices and standards appropriate for the type of contribution in question, and students will need to demonstrate adequate command of the toolchain covered in class. For an excellent grade, students will need to go (a bit) beyond the covered material, and work on an especially pressing or complicated problem.
As an alternative to this (graded) assessment, if students already have some prior knowledge and a ready project they wish to work on, this can also be arranged. Students should contact the instructor, and also track their progress on their own project in a placeholder issue on the fossos issue tracker.
The graded tasks (see above) will be graded using the below rubrics. The grading rubric is taken from the University of British Columbia Master of Data Science program (CC BY-SA 3.0).
Dimension | Poor | Unsatisfactory | Satisfactory | Good | Excellent |
---|---|---|---|---|---|
Accuracy 25% | Code fails to run, doesn’t have clear output, or performs the wrong task. | Code performs only some of the correct tasks, the output is not easily understandable and the methods used to achieve the result are inefficient if performance is a concern. | Code performs most of the correct tasks, the output is understandable, however the methods used to achieve the result are inefficient if performance is a concern. | Code performs the correct tasks, the output is reasonably easy to understand, however the methods used to achieve the result are not the most efficient if performance is a concern. | Code runs correctly without crashing, the output is very clear, and the intended or suitably correct methods are employed to achieve the correct result. Student has chosen the most efficient algorithm reasonable if performance is a concern. |
Code Quality 25% | Code is difficult to read and understand due to many issues that affects readability. Code is also poorly organized. | Code is generally easy to read and understand with few non-reoccurring issues and at most two reoccurring issue that affects readability. | Code is generally easy to read and understand with few non-reoccurring issues and at most one reoccurring issue that affects readability | Code is easy to read and understand with only 1-2 minor and non-reoccurring issues that affect readability. | Code is exceptionally easy to read and understand. For example, variable names are clear, an appropriate amount of whitespace is used to maximize visibility, tabs and spaces are not mixed for indentation, sufficient comments are given. Any coding sections of the assignment that were not completed have documentation explaining what a coded solution would look like. Overall, the code is extremely well organized and documented. |
Mechanics 25% | Evaluator was unable to run/open/read assignment submission despite best efforts. This may be because the student forgot to include certain files in the submission or tailored the software to only work on their local machine e.g. the code only works when run from a certain directory on the student’s machine, contains paths to files only on the student’s machine, etc., or they did not submit their assignment correctly or completely, or it was unclear where the relevant parts of the assignment are included in the submission. | Evaluator had to spend some time to get the raw submission to work correctly | Evaluator had to make an obvious, small, quick fix to get things working or the wrong file format was submitted | The submission is self-contained and works flawlessly; it just works in anybody’s hands. | The student did not forget to include all the files in the submission. Any necessary libraries to install are either included or are installed by a script, or are made obvious that that the evaluator must install them. Student used the asked for file format. All assignment instructions were followed. All files were put in a repository, in a reasonable place, with reasonable names; any source files .tex, .Rmd are rendered to a readable output format e.g. .pdf, all figures are included, there is a README file indicating where to find the different aspects of the assignment, etc. |
Robustness 25% | Multiple issues with code repetition exist, and several tests are absent and/or are of poor efficacy | Some form of re-occuring code repetition exists, or tests efficacy is poor. | Some form of re-occuring code repetition exists, or tests efficacy is poor. | Code repetition is mostly minimized and effective tests are present for most functions. | Code repetition is minimized via the use of loops/mapping functions, functions or classes or scripts/files as needed without becoming overly complicated. Functions are short, concise, and cohesive without losing clarity; code can be easily modified. Tests are present to ensure functions work as expected. Exceptions are caught and thrown if necessary, pnce students have learned about exceptions. |
Unfortunately, FAU has no computer lab facilities suitable for teaching this class and participants will have to bring their own computers. This has the advantage that students will learn to set up their own development environments, but adds some unwelcome complexity (different OSes, etc.).
The class will assist students in installing software on their devices, but students are responsible for maintaining their computers. In particular, student laptops must:
FAU-STUD
, eduroam
or FAU.fm
. (If
you need help setting up your WiFi, consult the RRZE Website.)Emphatically, none of this requires a new, powerful or expensive
device, let alone software. You can get a used laptop with / ready for
Linux Ubuntu on EBay for well under €100 (if you buy a used computer,
make sure that the hardware has good Linux support). With some tweaking,
you can even use an inexpensive (x86
) Google Chromebook
(which runs on Linux). For more information, see stack.
If you are facing financial difficulties in obtaining a laptop for the class, please contact the instructor. We’ll figure something out for you.
It is your responsibility to maintain your own computer and operating system (OS), as well as to figure out how to install the below software on your machine (though we will all help one another within reason).
For a ready-made development environment, you can use the RStudio IDE (integrated development environment) inside your web browser. RStudio is best for R development, but has decent support for other languages and includes access to a terminal and version control.
Using RStudio in the browser means that all the software you’re using won’t ever really be installed on your system, but only exist in a virtual image or online service. If you want to do serious development work or are facing edge cases, you may require a “real” installation on your client (see instructions in stack). However, in-browser development is a great way to have a standardized environment ready quickly.
You can run the RStudio IDE in your webbrowser in two ways:
rocker/verse
Docker Image (Recommended)Docker is an open-source industry standard to define, provision and share computing environments, known as containers. Containers allow you to run computing environments on other computers. Containers are similar to virtual machines (a computer inside a computer), but slimmer and generally neater.
A lot of the software you need to run in this class is included in
the rocker/verse
image published by the Rocker Project. For a list of
things you still need to install “locally”, consult the stack.
For installation instructions, see here).
Unfortunately, Docker has some system requirement that many Windows versions do not meet.
As a backup plan to using Docker on your own own operating system,
you may use RStudio Cloud, a data
science Software-as-a-Service (SaaS). RStudio Cloud furnishes you with a
ready RStudio session in a Docker image similar to
rocker/verse
with all necessary system dependencies.
RStudio Cloud is still in alpha and may not be always reliable. Once out of alpha, it may also be a paid service, for which you may have to pay yourself.
Full disclosure: the instructor has worked for RStudio PBC.
You are strongly encouraged to invest the time and effort to set up and maintain a development environment on your own computer.
Otherwise:
You should also study the RStudio Cloud guide.
If you want to install the programs used in this class on your system, rather than use them through a (Docker) container, you may find it easier to do that on Unix-compatible operating systems, including macOS and Linux. Getting Windows to play nicely with open source software can be harder, and some convenient system utilities (such as a package manager) are often missing. It is technically possible to use most, if not all, of the tools above on Windows, but they may behave slightly differently, and supporting them may be more involved.
If you are using a Windows machine, you may consider the following alternatives to get a more Unix-compatible operating system, roughly ranked from easiest to most involved:
There is no guarantee that any of these alternatives or links will work for you; you will have to research them on your own.
A big Thank You to all contributors (in alphabetical order by username):
aDeLe24, adrianapinto19, AndreM92, AndrewR25, banjanama, Ch1pzZz, Char-XWW, Crisphi, DOC-fau, DomBert, Elen93, gvwilson, juliatell, KonsWeb, maxheld83, mkkkt, Mococoa22, MoPau, nke07, PaddyGlass, PaulFalcke, samshaffer97, spaniel01, SvenNekula, t-rop, VerenaHeld,