Git, GitHub and Reproducible Research

Disclaimer

These materials makes use of proprietary products and a simplified workflow in
order to emphasise the main concepts and to save on installation and
configuration time. Some references will be given at the end to direct you to
free and open source solutions and more sustainable workflows; the core ideas
are the same though so learning the material here should make it easier to
migrate to more sensible methods in the future should the need arise.

Background material

This section provides some context for why you might care about git, GitHub and
reproducible research. This document only provides the absolute basics of these
topics but should get you started with them and help direct further learning.

Version control

Version control refers to systems which help to manage the writing and
maintenance of things such as software, documents, and websites. These systems
were developed to manage large software projects but are useful at many levels.
For example, version control can help you avoid the following situation:
my-analysis.R, my-analysis-final.R, my-analysis-final-again.R, and so on. Try
coming back to that 3 months later when a reviewer asks you to re-run something
with a slight modification!

Reproducible research and Open research

In addition to helping you with organising your files, a version control system
and it’s associated tooling can also help the scientific community by helping to
make your research reproducible and open.

For your computational research to be considered /reproducible,/ it needs to be
described in such a way that others can replicate your results. For it to be
open, the materials (code, data and sufficient documentation) need to be
available for others. Simply dumping all of your code into something like GitHub
is not sufficient for your research to be considered reproducible.

Git

Created in 2005 by Linus Torvalds to help with the development of the Linux
kernel, git has become a fundamental tool in software development. In the 2021
Stackoverflow Developer Survey over \(90\%\) of respondents used git; it is
nearly synonymous with version control. If you intend to collaborate in the
writing of substantial amounts of code, taking the time to learn how to use git
is a good idea.

Working with git will be much easier if you get familiar with some of the
terminology first. Unless you are familiar with git already, you should at least
skim these before continuing.

Repository

A repository is a directory containing your files and the history of all the
edits (see commit) made to these files. You can have a repository that only
lives on your machine, but they are often shared on a platform such as GitHub.

Commit

An edit to a file that you have recorded as part of the history of edits is
called a commit. It is both a noun and a verb, you commit an edit and the
repository contains all of your commits. This can be thought of as a stronger
version of saving a file. Each commit gets a unique identifier (called a hash).
Sometimes we use “commit” to refer to the state of all the code after an edit.

Clone

When you make a copy of a repository you are cloning that repository. The
resulting copy is referred to as a clone. Typically this will mean you have
downloaded a copy from a platform such as GitHub.

Pull

Suppose you cloned a repository a while ago and you want to get a copy of all
the commits that have been made to the original repository since then. To get
these commits you pull them, which is a fancy way of saying updating your files.
This is sometimes referred to as fetching.

Push

If you have committed some changes to your clone of a repository and want the
original repository to have these changes made, you push these changes. This is
a fancy way of saying use your edits to update the original files.

Branch

A branch is similar to a clone in that it is a copy of a repository. This
provides a more sophisticated way for people to work on their own version of
code, without messing up the main copy. This is not particularly important
unless you are collaborating with others on a project.

Merge

If someone has made some useful changes on their branch the owner of the
repository may decide to include their commits in the main copy. This process of
including the changes on someone’s branch is called merging the changes.

Fork

When you make a copy of a repository that sits on your GitHub account. This is
similar to, but distinct from cloning and making a branch.

GitHub

What is GitHub?

GitHub, Inc. is a subsidiary of Microsoft. Their website provides freemium
hosting of git repositories. In addition to hosting the repositories, it offers
additional tools to assist with software development. We will make extensive use
of GitHub in this tutorial to avoid you needing to install anything on your
machine. If you are going to use git extensively, it would be wise to learn how
to do this from the command line or some other program.

Setting up a GitHub account

To register an account you will need an email address that can be used for
verification.

  1. Visit https://github.com/ and click Sign Up.
  2. Fill in the forms to create an account.
  3. Verify that account by entering the access code GitHub sends to the email
    address you registered with.
  4. Verify that you can summon the Command Palette with crtl k for Windows and
    Linux and command k on a mac.
  5. The appearance and accessibility settings can be reached by searching for
    them in the command palette.

Zenodo

Zenodo is an open access archive operated by CERN which allows researchers to
archive research materials with a DOI which makes them easier to cite. This is a
more permanent form of storage than GitHub. It is easy to archive a particular
commit of a repository which is good practice if you want to refer to a
particular version of some code in a paper.

Worked example

Now that we have an understanding of version control and its associated tooling,
we can see an example of how this enables us to do more reproducible research.
Suppose that you wanted to include Figure fig:demo-result-1 in a manuscript and
you wanted to ensure your analysis reproducible.

./git-usage-1.png

Code and data

The data and the code that generated this figure are included below. The data is saved in a file stackoverflow-git-data.csv.

year,percentage
2015,69.3
2017,69.2
2018,87.2
2020,82.8
2021,93.43

The code is saved in a file make-plot.R

library(ggplot2)

sods_data <- read.csv("stackoverflow-git-data.csv")

g <- ggplot(
  data = sods_data,
  mapping = aes(x = year, y = percentage)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_text(
    aes(x = 2020, y = 82.8, label = "only GitHub"),
    nudge_x = 0.2,
    nudge_y = -4) +
  labs(
    x = "Year",
    y = "Percentage who used git",
    title = "Git usage has increased",
    subtitle = "Data from Stackoverflow Developer Survey")

ggsave(filename = "git-usage.png",
       plot = g,
       height = 7.4,
       width = 10.5,
       units = "cm")

sink(file = "regression-summary.txt")
summary(lm(percentage ~ year, data = sods_data))
sink()

If we put these into a directory called git-usage we end up with the following

git-usage
├── git-usage.png
├── make-plot.R
├── regression-summary.txt
└── stackoverflow-git-data.csv

Copy the code and data into a suitable place on your machine and run the R
script to ensure that it works. In this worked example we will go through
cleaning this up so it is easier for people (including ourselves) to make sense
of this.

Organising the data and code

As a first step we will use directories to impose a bit of structure. Organising
our files in this way is useful as it makes it far easier for someone to
understand the purpose of each of the files. Follow the following steps to
organise your code more appropriately.

  1. Make a directory called src and move make-plot.R there.
  2. Make a directory called data and move stackoverflow-git-data.csv there.
  3. Make a directory called out which we will write results to.
  4. Fix the call to read.csv so it can find the CSV.
  5. Fix the calls to ggsave and sink so it writes the output into out.

Once you have done this, the R script should look like the following.

sods_data <- read.csv("data/stackoverflow-git-data.csv")

...

ggsave(filename = "out/git-usage.png",
       plot = g,
       height = 7.4,
       width = 10.5,
       units = "cm")

sink(file = "out/regression-summary.txt")
summary(lm(percentage ~ year, data = sods_data))
sink()

After you have run the code, the directory structure should look like the
following.

git-usage
├── data
│   └── stackoverflow-git-data.csv
├── out
│   ├── git-usage.png
│   └── regression-summary.txt
└── src
    └── make-plot.R

Uploading to GitHub

Now that our code is in a reasonable state, we can upload it to GitHub. If you
do not already have a GitHub account, please follow the instructions above,
which describe how to make one. Once you have done this, follow the following
steps:

  1. Visit https://github.com/ and create a new repository by clicking New, you
    will need to pick a name for the repository (I called mine git-usage.) The
    default settings are fine. Click Create repository.
  2. Click creating a new file to start the process of adding src/make-plot.R.
    1. Ensure the name of the file is git-usage/src/make-plot.R.
    2. Copy-and-paste the code in make-plot.R into the editor.
    3. Click Commit new file.
  3. Repeat this process with data/stackoverflow-git-data.csv and the output files
    by clicking on Add file and selecting Create new file. Note that for
    git-usage.png you will need to use Upload file instead of Create new file.

Adding a license

A license specifies what people can do with your code. If you aren’t sure what
license suits your needs, you might find https://choosealicense.com/ has some
helpful information. Most of the time, I will opt for the MIT license.

There are two ways you might add a license. The manual method is to copy and
paste the license text into a file called LICENSE to your repository, filling in
[year] and [fullname] as appropriate. Alternatively, you can Add file and Create
new file
and specify that the file will be called “LICENSE” and it will offer
you some templates to choose from. It will auto-fill the details of your name
and the year.

Adding a README

When you encounter a repository online it can be difficult to understand what
its purpose is and how to use it. “README” is the name given to a file that
contains this sort of information. Typically these will be written in markdown
(similar to RMarkdown). Add a file called README.md to your repository with text
similar to the following.

This repository contains an analysis of git usage through time.

To run this analysis use the following command:

```
Rscript src/make-plot.R
```

The input data is in `data` and the results are in `out`.

Recording the session information

Software gets updated, and sometimes these updates cause things to break. Where
possible, it is very good practise to include details of the versions of
software you have used. When working with R the sessionInfo command makes this
simple. Try adding the following to the end of the make-plot.R script.

sink(file = "out/package-versions.txt")
sessionInfo()
sink()

The next time that you run this script, it will write a description of the
version of R you used and the versions of all the loaded packages to the file
out/package-versions.txt. Try running the script again to make sure this
additional file was generated and contains something similar to the following.

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_3.3.5

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1   splines_4.1.2    tidyselect_1.1.1 munsell_0.5.0
 [5] colorspace_2.0-2 lattice_0.20-45  R6_2.5.1         rlang_0.4.12
 [9] fansi_0.5.0      dplyr_1.0.7      tools_4.1.2      grid_4.1.2
[13] gtable_0.3.0     nlme_3.1-153     mgcv_1.8-38      utf8_1.2.2
[17] withr_2.4.3      ellipsis_0.3.2   digest_0.6.29    tibble_3.1.6
[21] lifecycle_1.0.1  crayon_1.4.2     Matrix_1.3-4     farver_2.1.0
[25] purrr_0.3.4      vctrs_0.3.8      glue_1.6.0       labeling_0.4.2
[29] compiler_4.1.2   pillar_1.6.4     generics_0.1.1   scales_1.1.1
[33] pkgconfig_2.0.3

Once you are happy that this has worked, we need to commit these changes. First
by editing the script, and second, add the package-versions.txt file.

Branching and merging

Suppose that after doing all of this one of your collaborators wants to adjust
the figure. We will now go through the steps involved with doing this using
branches.

Branching to make changes

Figure fig:demo-result-2 is a modification of Figure fig:demo-result-1 with the
desired changes.

./git-usage-2.png

To avoid making changes to the main copy of the code we will work on a branch,
and then when we are happy with the changes we will merge them. To start with,
create a new branch by clicking on the drop-down menu labelled “main” as shown
in Figure fig:create-new-branch. I called it “edit-plot”, but you can use
anything other than “main” (because that is the default branch name used by
GitHub).

./create-new-branch.png

Make desired edits to the code and output

Making sure that you are on your branch — if you’re not sure, click on the
branch button to double check — edit the make-plot.R script so that it has the
following

g <- ggplot(
  data = sods_data,
  mapping = aes(x = year, y = percentage)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "darkgrey") +
  geom_text(
    aes(x = 2020, y = 82.8, label = "only GitHub"),
    size = 3,
    nudge_x = 0.2,
    nudge_y = -6) +
  labs(
    x = "Year",
    y = "Percentage who used git") +
  ylim(c(0,100)) +
  theme_bw()

Once you have made the changes and re-run that script the figure in
git-usage.png will have changed — it should look like Figure fig:demo-result-2
now. Ordinarily, you would update the figure in the same way that you update
code, by committing the changes. However, this is tricky to do via the GitHub
website for image files, so instead, delete the file and upload the modified
one. At this point it might be interesting to move between the main branch and
your new branch to see how the files change between the two.

One motivation for branches is that you can make exploratory changes without
risking messing up your code on the main branch. If you have a collaborator that
wanted to try something, they could do so on a separate branch and then, if you
like their edits, you can merge them into main as we are about to do now.

Merge the changes

To merge your changes via the website, go back to the main page of the
repository and you should see a new button, like the one shown in Figure
fig:pull-request, inviting you to compare the changes on this branch, i.e., to
inspect if you consider this work worthy of inclusion.

./pull-request.png

Inspect the differences between the branches and if you are happy with them
create a pull request by clicking the button as shown in Figure
fig:create-pull-request.

./create-pull-request.png

Once you have created the pull request, the next step is to merge that branch
into the main branch. To do this you just need to click the button shown in
Figure fig:merge-pull-request.

./merge-pull-request.png

Once a branch has been merged it will hang around until you delete it. Since
having old branches around can lead to confusion, it is sensible to delete them
afterwards. As shown in Figure fig:delete-branch there is a button to achieve
this.

./delete-branch.png

At this point you should only have a single branch left and it should have the
modifications to the figure. Congratulations on a reproducible analysis!

Next steps and alternative solutions

Upload to Zenodo

The Zenodo FAQs contain information about how to archive a GitHub repository if
you want a more permanent form of storage. Ideally, one would archive the commit
used to generate the contents of a manuscript so it has a DOI and reference both
the archive and the live version of the code on GitHub in the manuscript.

Learn more about git

Learn more about GitHub

There are lots of features in GitHub that haven’t been covered but may be worth
looking into: the issue tracker, the wiki, VSCode integration, and GitHub Pages
and GitHub Actions.

Alternative solutions

Git

Git has the greatest market share but there are alternatives such as Subversion,
Mercurial, CVS and Darcs. Given that the vast majority of people use git, your
time is probably best spent learning git.

GitHub

While git dominates the market as the choice of version control system, there
are many viable alternatives platforms to GitHub which may be more suitable for
your needs:

Zenodo

There are good general purpose alternatives to Zenodo such as figshare and
Dryad. There are also numerous alternatives that are more field specific, such
as GISAID.

Homework

Question 1

Explain (in 100–200 words) the purpose of git, GitHub, and Zenodo and the
relationship between these things. Describe the value of one feature of GitHub
not overed in this tutorial (in 100–150 words).

Question 2

Explain (in 100–200 words) the role of version control in reproducible
research. Give an example (in 100–150 words) of a situation in which version
control does not suffice to make a piece of work reproducible.

Question 3

Download the following script and data and organise this material in a
repository in a suitable way. Give a brief overview of the decisions you made
along the way (100–200 words).

Question 4

Fork the repository at https://github.com/aezarebski/conflict-resolution-example
and merge the pull request. Note that this will require resolving the conflict
in a sensible way. Explain what you did (in <100 words). If you have done this
well, the commit log should look like Figure fig:merge-task.

./merge-task.png

Question 5

Read the editorial Ten Simple Rules for Reproducible Computational Research and
(in 200–300 words) give a brief explanation of how git and GitHub would or
would not be relevant to each rule.

GitHub

View Github