Git, GitHub and Reproducible Research
These materials makes use of proprietary products and a simplified workflow in
order to emphasise the main concepts and to save on installation and
configuration time. Some references will be given at the end to direct you to
free and open source solutions and more sustainable workflows; the core ideas
are the same though so learning the material here should make it easier to
migrate to more sensible methods in the future should the need arise.
This section provides some context for why you might care about git, GitHub and
reproducible research. This document only provides the absolute basics of these
topics but should get you started with them and help direct further learning.
Version control refers to systems which help to manage the writing and
maintenance of things such as software, documents, and websites. These systems
were developed to manage large software projects but are useful at many levels.
For example, version control can help you avoid the following situation:
my-analysis-final-again.R, and so on. Try
coming back to that 3 months later when a reviewer asks you to re-run something
with a slight modification!
Reproducible research and Open research
In addition to helping you with organising your files, a version control system
and it’s associated tooling can also help the scientific community by helping to
make your research reproducible and open.
For your computational research to be considered /reproducible,/ it needs to be
described in such a way that others can replicate your results. For it to be
open, the materials (code, data and sufficient documentation) need to be
available for others. Simply dumping all of your code into something like GitHub
is not sufficient for your research to be considered reproducible.
Created in 2005 by Linus Torvalds to help with the development of the Linux
kernel, git has become a fundamental tool in software development. In the 2021
Stackoverflow Developer Survey over \(90\%\) of respondents used git; it is
nearly synonymous with version control. If you intend to collaborate in the
writing of substantial amounts of code, taking the time to learn how to use git
is a good idea.
Working with git will be much easier if you get familiar with some of the
terminology first. Unless you are familiar with git already, you should at least
skim these before continuing.
A repository is a directory containing your files and the history of all the
edits (see commit) made to these files. You can have a repository that only
lives on your machine, but they are often shared on a platform such as GitHub.
An edit to a file that you have recorded as part of the history of edits is
called a commit. It is both a noun and a verb, you commit an edit and the
repository contains all of your commits. This can be thought of as a stronger
version of saving a file. Each commit gets a unique identifier (called a hash).
Sometimes we use “commit” to refer to the state of all the code after an edit.
When you make a copy of a repository you are cloning that repository. The
resulting copy is referred to as a clone. Typically this will mean you have
downloaded a copy from a platform such as GitHub.
Suppose you cloned a repository a while ago and you want to get a copy of all
the commits that have been made to the original repository since then. To get
these commits you pull them, which is a fancy way of saying updating your files.
This is sometimes referred to as fetching.
If you have committed some changes to your clone of a repository and want the
original repository to have these changes made, you push these changes. This is
a fancy way of saying use your edits to update the original files.
A branch is similar to a clone in that it is a copy of a repository. This
provides a more sophisticated way for people to work on their own version of
code, without messing up the main copy. This is not particularly important
unless you are collaborating with others on a project.
If someone has made some useful changes on their branch the owner of the
repository may decide to include their commits in the main copy. This process of
including the changes on someone’s branch is called merging the changes.
What is GitHub?
GitHub, Inc. is a subsidiary of Microsoft. Their website provides freemium
hosting of git repositories. In addition to hosting the repositories, it offers
additional tools to assist with software development. We will make extensive use
of GitHub in this tutorial to avoid you needing to install anything on your
machine. If you are going to use git extensively, it would be wise to learn how
to do this from the command line or some other program.
Setting up a GitHub account
To register an account you will need an email address that can be used for
- Visit https://github.com/ and click Sign Up.
- Fill in the forms to create an account.
- Verify that account by entering the access code GitHub sends to the email
address you registered with.
- Verify that you can summon the Command Palette with
crtl kfor Windows and
command kon a mac.
- The appearance and accessibility settings can be reached by searching for
them in the command palette.
Zenodo is an open access archive operated by CERN which allows researchers to
archive research materials with a DOI which makes them easier to cite. This is a
more permanent form of storage than GitHub. It is easy to archive a particular
commit of a repository which is good practice if you want to refer to a
particular version of some code in a paper.
Now that we have an understanding of version control and its associated tooling,
we can see an example of how this enables us to do more reproducible research.
Suppose that you wanted to include Figure fig:demo-result-1 in a manuscript and
you wanted to ensure your analysis reproducible.
Code and data
The data and the code that generated this figure are included below. The data is saved in a file
year,percentage 2015,69.3 2017,69.2 2018,87.2 2020,82.8 2021,93.43
The code is saved in a file
library(ggplot2) sods_data <- read.csv("stackoverflow-git-data.csv") g <- ggplot( data = sods_data, mapping = aes(x = year, y = percentage)) + geom_point() + geom_smooth(method = "lm") + geom_text( aes(x = 2020, y = 82.8, label = "only GitHub"), nudge_x = 0.2, nudge_y = -4) + labs( x = "Year", y = "Percentage who used git", title = "Git usage has increased", subtitle = "Data from Stackoverflow Developer Survey") ggsave(filename = "git-usage.png", plot = g, height = 7.4, width = 10.5, units = "cm") sink(file = "regression-summary.txt") summary(lm(percentage ~ year, data = sods_data)) sink()
If we put these into a directory called
git-usage we end up with the following
git-usage ├── git-usage.png ├── make-plot.R ├── regression-summary.txt └── stackoverflow-git-data.csv
Copy the code and data into a suitable place on your machine and run the R
script to ensure that it works. In this worked example we will go through
cleaning this up so it is easier for people (including ourselves) to make sense
Organising the data and code
As a first step we will use directories to impose a bit of structure. Organising
our files in this way is useful as it makes it far easier for someone to
understand the purpose of each of the files. Follow the following steps to
organise your code more appropriately.
- Make a directory called
- Make a directory called
- Make a directory called
outwhich we will write results to.
- Fix the call to
read.csvso it can find the CSV.
- Fix the calls to
sinkso it writes the output into
Once you have done this, the R script should look like the following.
sods_data <- read.csv("data/stackoverflow-git-data.csv") ... ggsave(filename = "out/git-usage.png", plot = g, height = 7.4, width = 10.5, units = "cm") sink(file = "out/regression-summary.txt") summary(lm(percentage ~ year, data = sods_data)) sink()
After you have run the code, the directory structure should look like the
git-usage ├── data │ └── stackoverflow-git-data.csv ├── out │ ├── git-usage.png │ └── regression-summary.txt └── src └── make-plot.R
Uploading to GitHub
Now that our code is in a reasonable state, we can upload it to GitHub. If you
do not already have a GitHub account, please follow the instructions above,
which describe how to make one. Once you have done this, follow the following
- Visit https://github.com/ and create a new repository by clicking New, you
will need to pick a name for the repository (I called mine
default settings are fine. Click Create repository.
- Click creating a new file to start the process of adding
- Ensure the name of the file is
- Copy-and-paste the code in
make-plot.Rinto the editor.
- Click Commit new file.
- Ensure the name of the file is
- Repeat this process with
data/stackoverflow-git-data.csvand the output files
by clicking on Add file and selecting Create new file. Note that for
git-usage.pngyou will need to use Upload file instead of Create new file.
Adding a license
A license specifies what people can do with your code. If you aren’t sure what
license suits your needs, you might find https://choosealicense.com/ has some
helpful information. Most of the time, I will opt for the MIT license.
There are two ways you might add a license. The manual method is to copy and
paste the license text into a file called
LICENSE to your repository, filling in
[fullname] as appropriate. Alternatively, you can Add file and Create
new file and specify that the file will be called “LICENSE” and it will offer
you some templates to choose from. It will auto-fill the details of your name
and the year.
Adding a README
When you encounter a repository online it can be difficult to understand what
its purpose is and how to use it. “README” is the name given to a file that
contains this sort of information. Typically these will be written in markdown
(similar to RMarkdown). Add a file called
README.md to your repository with text
similar to the following.
This repository contains an analysis of git usage through time. To run this analysis use the following command: ``` Rscript src/make-plot.R ``` The input data is in `data` and the results are in `out`.
Recording the session information
Software gets updated, and sometimes these updates cause things to break. Where
possible, it is very good practise to include details of the versions of
software you have used. When working with R the
sessionInfo command makes this
simple. Try adding the following to the end of the
sink(file = "out/package-versions.txt") sessionInfo() sink()
The next time that you run this script, it will write a description of the
version of R you used and the versions of all the loaded packages to the file
out/package-versions.txt. Try running the script again to make sure this
additional file was generated and contains something similar to the following.
R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /usr/local/lib/R/lib/libRblas.so LAPACK: /usr/local/lib/R/lib/libRlapack.so locale:  LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C  LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8  LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8  LC_PAPER=en_GB.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  ggplot2_3.3.5 loaded via a namespace (and not attached):  magrittr_2.0.1 splines_4.1.2 tidyselect_1.1.1 munsell_0.5.0  colorspace_2.0-2 lattice_0.20-45 R6_2.5.1 rlang_0.4.12  fansi_0.5.0 dplyr_1.0.7 tools_4.1.2 grid_4.1.2  gtable_0.3.0 nlme_3.1-153 mgcv_1.8-38 utf8_1.2.2  withr_2.4.3 ellipsis_0.3.2 digest_0.6.29 tibble_3.1.6  lifecycle_1.0.1 crayon_1.4.2 Matrix_1.3-4 farver_2.1.0  purrr_0.3.4 vctrs_0.3.8 glue_1.6.0 labeling_0.4.2  compiler_4.1.2 pillar_1.6.4 generics_0.1.1 scales_1.1.1  pkgconfig_2.0.3
Once you are happy that this has worked, we need to commit these changes. First
by editing the script, and second, add the
Branching and merging
Suppose that after doing all of this one of your collaborators wants to adjust
the figure. We will now go through the steps involved with doing this using
Branching to make changes
Figure fig:demo-result-2 is a modification of Figure fig:demo-result-1 with the
To avoid making changes to the main copy of the code we will work on a branch,
and then when we are happy with the changes we will merge them. To start with,
create a new branch by clicking on the drop-down menu labelled “main” as shown
in Figure fig:create-new-branch. I called it “edit-plot”, but you can use
anything other than “main” (because that is the default branch name used by
Make desired edits to the code and output
Making sure that you are on your branch — if you’re not sure, click on the
branch button to double check — edit the
make-plot.R script so that it has the
g <- ggplot( data = sods_data, mapping = aes(x = year, y = percentage)) + geom_point() + geom_smooth(method = "lm", colour = "darkgrey") + geom_text( aes(x = 2020, y = 82.8, label = "only GitHub"), size = 3, nudge_x = 0.2, nudge_y = -6) + labs( x = "Year", y = "Percentage who used git") + ylim(c(0,100)) + theme_bw()
Once you have made the changes and re-run that script the figure in
git-usage.png will have changed — it should look like Figure fig:demo-result-2
now. Ordinarily, you would update the figure in the same way that you update
code, by committing the changes. However, this is tricky to do via the GitHub
website for image files, so instead, delete the file and upload the modified
one. At this point it might be interesting to move between the
main branch and
your new branch to see how the files change between the two.
One motivation for branches is that you can make exploratory changes without
risking messing up your code on the main branch. If you have a collaborator that
wanted to try something, they could do so on a separate branch and then, if you
like their edits, you can merge them into
main as we are about to do now.
Merge the changes
To merge your changes via the website, go back to the main page of the
repository and you should see a new button, like the one shown in Figure
fig:pull-request, inviting you to compare the changes on this branch, i.e., to
inspect if you consider this work worthy of inclusion.
Inspect the differences between the branches and if you are happy with them
create a pull request by clicking the button as shown in Figure
Once you have created the pull request, the next step is to merge that branch
main branch. To do this you just need to click the button shown in
Once a branch has been merged it will hang around until you delete it. Since
having old branches around can lead to confusion, it is sensible to delete them
afterwards. As shown in Figure fig:delete-branch there is a button to achieve
At this point you should only have a single branch left and it should have the
modifications to the figure. Congratulations on a reproducible analysis!
Next steps and alternative solutions
Upload to Zenodo
The Zenodo FAQs contain information about how to archive a GitHub repository if
you want a more permanent form of storage. Ideally, one would archive the commit
used to generate the contents of a manuscript so it has a DOI and reference both
the archive and the live version of the code on GitHub in the manuscript.
Learn more about git
- Pro Git by Scott Chacon and Ben Straub is a free book that is the ultimate
guide but is a bit technical at times.
- Atlassian/Bitbucket has excellent tutorials.
- Learn Git Branching is a game revolving around explaining git.
- GitHub Learning Lab has some introductory material on the use of git and
- Stackoverflow questions will often have answers to your questions.
- Inside the Hidden Git Folder – Computerphile gives a bit of a behind the
scenes tour of how git works.
Learn more about GitHub
There are lots of features in GitHub that haven’t been covered but may be worth
looking into: the issue tracker, the wiki, VSCode integration, and GitHub Pages
and GitHub Actions.
Git has the greatest market share but there are alternatives such as Subversion,
Mercurial, CVS and Darcs. Given that the vast majority of people use git, your
time is probably best spent learning git.
While git dominates the market as the choice of version control system, there
are many viable alternatives platforms to GitHub which may be more suitable for
Explain (in 100–200 words) the purpose of git, GitHub, and Zenodo and the
relationship between these things. Describe the value of one feature of GitHub
not overed in this tutorial (in 100–150 words).
Explain (in 100–200 words) the role of version control in reproducible
research. Give an example (in 100–150 words) of a situation in which version
control does not suffice to make a piece of work reproducible.
Download the following script and data and organise this material in a
repository in a suitable way. Give a brief overview of the decisions you made
along the way (100–200 words).
Fork the repository at https://github.com/aezarebski/conflict-resolution-example
and merge the pull request. Note that this will require resolving the conflict
in a sensible way. Explain what you did (in <100 words). If you have done this
well, the commit log should look like Figure fig:merge-task.
Read the editorial Ten Simple Rules for Reproducible Computational Research and
(in 200–300 words) give a brief explanation of how git and GitHub would or
would not be relevant to each rule.