Coderefinery Workshop Spring 2022

Main webpage for the workshop is here. Videos are available on YouTube, and questions asked online via HackMD are gathered here.

Day 1: Git

(Teaching format is a two person conversation, similar to commentators in a sports event. HackMD is used as a place to put links, for people to ask questions. Polls are conducted by potting options and asking people to put “o” next to them. Before starting everyday, some ice-breaker questions are posted on HackMD.)

Version control records snapshots of a project. It implements the concept of branches which help when more than one person works on the same thing at the same time, and implements merging, so that different people’s work can be easily combined.

In this workshop we will use Git, and Github. Github has a nice interface to make use of features from Git, for example visualizing branches. Github also allows users to annotate files, and send portions of code.

Git records/saves snapshots, tracking the content of a folder as it changes over time. Everytime we commit a snapshot, Git records a snapshot of the entire project, saves it, and assigns it a version. These snapshots are stored in the .git directory. .git uses relative paths, meaning you can move everything somewhere else and it will work.

To add a file somefile.txt to the staging area, use git add somefile.txt. This change is now staged, and ready to be committed.

Suppose you changed somefile.txt, you can use git diff somefile.txt to see the changes. git diff compares the staged files, and using it without the filenames will show all changes since the last staging.

git log shows details of development. using the --oneline flag shows a more compact version, and the --stat flag shows which files have been changed exactly. Both of these flags together are a nice combination.

git diff <hash1> <hash2> will show the differences between the 2 commits with hash1 and hash2.

Convention for git commit messages is to have oneline summarizing the commit, if more details are needed, an empty line then paragraphs explaining the extra details.

When renaming a file, using git mv will rename and stage the file at the same time.

If we added a file but it was not committed yet, we can use git rm.

We can make an aliases in git so we do not have to remember long commands. For e.g. using git config --global alias.graph "log --all --graph --decorate --oneline" saves the long command into git graph.

Branching is one of the key features of git. It allows us to test new features in an easy manner. The current position is called HEAD. To check which branch you are currently on, use git branch. To create a new branch, say named experiment, use git branch experiment master, assuming master is the name of the current branch (might be main instead). To move into this new branch, use git checkout experiment. To make a branch and change to it in 1 line, use git checkout -b experiment. Using git branch will show that we are now in this new branch.

After making new changes, and committing them in the experiment branch, using git graph shows us that the current HEAD is 2 commits ahead of the master branch.

To merge the experiment branch into the master branch, go to the master branch and use git merge experiment. When merging two branches, git creates a new commit. Regular commits have 1 parent, and merge commits have 2 (or more) parents.

If you create a new branch from master, make some commits, then merge it back, no merge commit is created, instead the master branch label is moved forward by the commits done. This is called a fast-forward merge.

One can rebase branches instead of merging them, this means that new commits are replayed on top of another branch instead of creating an explicit merge commit. This however changes history and should not be done on public commits.

One can use tags to record milestones of a project. Unlike branches, tags do not ever mode, and are used alongside semantic versioning and identifying information such as cryptographically signed keys. Tags can be used to identify research code at certain levels, for e.g. for a paper submitted to a certain conference, in case you may edit things like figure formats etc later on, it might be good to tag the code used for the conference version of the paper.

A typical workflow is to create new branches for features, and if they work, merge them into the master branch and delete the old branch via git branch -d feature-1. If the new changes are not merged into master for whatever, use git branch -D feature-1. Deleting the branch only deletes the reference to that branch, the commits still exist. Deleting branches is mostly done for decluttering.

To add and commit at the same time, we can use git commit FILENAME [...] for whole files, or git commit -p to interactively select what to include.

Conflicts are a good thing. Suppose we have a file filename.txt which is modified differently in 2 different branches. If the merge conflict needs extra care for e.g. ask a colleague, use git merge --abort.

Day 2: Git

Using git branch -M main will change the name of the current branch to main.

To check if you have ssh keys configured with Github, use ssh -T [email protected].

To search over all lines in a project, use git grep -i text. This is not case-sensitive. For e.g. some repositories have “FIXME” next to lines of code, so one can do git grep -i fixme. To search over all changes, use git log -S fixme.

To inspect old code, use git checkout -b temporary-branch hash, or if you know an old branch name or tag, you can use it instead of the hash. To go to the state of code just before a certain hash, tag or After being done with this simply go to the main branch via git checkout main then delete the branch created temporarily for this inspection by git branch -d temporary-branch.

To see all changes made to a file, use git annotate filename. The output maybe large, in which case you can search for some text function1 via /function1.

Sometimes the code stops working and we know a commit where it used to work. In this case, git bisect is very helpful. First, start with git bisect start. Then, set the points where it used to work and it does not work. This is done via git bisect good hash/branchname/tag and git bisect bad hash/branchname/tag. Once this is done, we change to a new commit which is halfway between the good and bad ones, so we can test if it works. It it works, use git bisect good, else use git bisect bad. Repeat between these two until the specific commit is narrowed down.

Undoing changes is an important concept in any git workflow. The types of changes you can make, or rather should make, depends on whether the commits were pushed to a remote or not. It also depends on whether you want to preserve the commit history or not.

Suppose commit hash has problems and you want to undo it, then we can undo the commit with git revert hash.

If you made a commit that you made a small mistake in, you can make that change and do git add filename followed by git commit --amend. This will change the commit hash. You can also modify the commit message.

We can reset the branch history and move to some point in the past via git reset --hard hash.

git merge, git rebase and git pull modify the current branch, and its very easy to run them on the wrong branch. To undo changes made by these, use git log to go back to the hash of the commit before you made modifications to the wrong branch, and use git reset --hard hash.

Day 3: Git

Today is about collaborating using git.

Review of some terms and concepts from before: repositories, commits, branches, tags (e.g. phd-printed, paper-submitted), cloning, forking.

Origin is just a default name, and when we clone a remote repo, all branches are renamed, for e.g. the branch main becomes origin/main, etc. origin/main is just a read-only pointer.

Forking and cloning are similar, in Github a fork is personal repo built from another repo.

git pull is actually git fetch + git merge. git clone copies everything, meaning all commits and branches.

To keep the repository synchronized, we need to pull and push changes frequently.

First we discuss a centralized workflow link here. Here, we often assume that some remote repo is a central repository. This workflow is often suitable for small research groups.

A pull request often makes sense between forks, since you could otherwise just do a merge. Pull requests can be created within a project as well, e.g. between a development and main branch. There is a git request-pull command which can issue pull requests to a remote, but you need a remote repo for it to make sense.

For the exercise today, first we clone this remote repo. Then create a new branch, make a new file, add and commit then push the branch to the remote repository. When pushing this branch, we need to use -u in. Recall that -u is used when you push your local repo to the remote for the first time - because we are pushing a branch. -u stands for --set-upstream, which connects the local branch to the newly create remote. It can be thought of as telling git to remember this remote as the default place to push and pull from.

After you make this push, it creates a pull request in Github. In a pull request, a contributer requests that their changes are put into the repository. Usually this means merging their branch with the main or development branch. In bigger projects this avoids code coming in without being checked by someone else, and often in large projects its considered bad practice to directly push to common branches like dev or main. In case we do not know where the origin is, use git remote -v. If the “Compare & pull request” button does not appear in Github you can create a pull request manually.

Now, we need to merge the pull request. When doing this, its very important to see which branches are being merged. Each pull request has an accompanying number which can be used as a reference in conversations, for e.g. typing #9 will make a cross-reference to pull requesr #9 in Github. The maintainers can request changes, and this will block merging the pull request until the requested change is implemented.

When making a comment about a specific line, press the + button to the left of the line. We can drag the mouse pointer to comment about specific lines. To make a suggestion, press the button with a +- paper (left most above the textbox) to make the suggestion. To approve the changes, press the Approve Merge button. When merging a pull request, there might be for e.g. 50 commits, so we can press the down arrow next to the “Merge pull request” button which allows you to squash and merge, in case you want to count all 50 commits as one.

If a branch has conflicts, Github will show that in its corresponding pull request.

To get feedback on a branch that you do not want to merge to the main one yet, you can use git push origin branchname then open a draft pull request.

It is a good idea to have the master or main branch write-protected.

Pull requests can be better thought of as change proposals.

In Github you can create issues for ideas as well. It is good practice to create an issue with your ideas before you create a new branch.

In general its a good idea to not merge your own changes, instead let someone else do it.

When cloning a repo, you can change the name of its directory by putting it at the end of the git clone command. Remember it is good practice to create a new branch the first thing if you are modifying code.

git commit -v shows the diff of the current commit which is helpful.

If someone, say person X, makes a push to main, other people cannot push to main since they are missing the commit from X. Hence everyone else needs to pull the remote changes before pushing their own.

To make pull requests from the terminal, see the CLI tools gh, hub, and the git-pr project from Nordic RSE. (Need to check how emacs does this).

From here on we discuss the distributed workflow link here, which involves forking, which is often used in larger projects with contributors in various places. In this workflow anyone can contribute without asking for permission, but the maintainer(s) decide what is merged or not. There is more than 1 remote repo involved in this scenario.

Here, we first fork the repo, and we make pull requests to the main repo. There are two repos we have to keep track of, the fork which is hosted on for e.g. Github, and the local clone. Hence from the perspective from the individual contributor, there are 3 repositories - the “central” repository (here also called the upstream repo) to which they want to contribute to, their fork and their local clone. The fork and upstream repos are in principle independent.

When using the terms close, fix, resolve and their variants (all are listed here), with the commit message Github will know what it means and link pull requests if they are referenced.

Suppose we finish the changes in our local clone of our fork, and committed them locally. We can now use git push origin branchname to push it to your fork. Now we can create a pull request to the upstream repo. You will see a “Compare & pull request” on your Github fork. These pull requests are still from branch to branch, but the branches come from different repos.

Other people now need to fetch from the upstream to incorporate your merge. They can open their fork on Github and there will be a “Fetch upstream” button.

Remember the big picture here is collaborative learning.

By default, issue tracking is disabled in forks, so you will not see an “Issues” button on your fork of the project.

Part 3 of the distributed workflow is skipped, also check the section about hooks and bare/non-bare repos.

Day 4: Reproducible Research

Todays materials start here. Research is reproducible when the results can be duplicated under the same materials as used by the original investigator.

Various factors lead to irreproducible research, such as not enough docs, unavailable data, software, difficulty to rerun the right steps in the right order.

The pyramind is Environment -> (Code, Data) -> Documentation -> Article.

In research we would in general want:

  • Reproducibility, where same data and same analysis yield same results.
  • Replicability, where different data and same analysis yield same results.
  • Robustness, where same data different analysis yield same results.
  • Generalizability, where different data and different analysis give same results.

(Bit unsure of what generalizability means above.)

Sample project directory structure - good idea to have README for data directory as well, separate directory for manuscript, results, src etc. For collaborative writing, manuscripts.io looks very nice.

Add git tags to mark the point at which project was submitted, accepted, rebuttals etc.

This presentation has tips on making data analysis more reproducible.

Snakemake can be used which is similar to Make.

It is good practice to have isolated environments for each project. In such environments, install only dependencies that you need with well defined versions.

When using pip, we can install specific versions via pip install somepackage==1.1.2. To freeze the current environment into a file named requirements.txt, use pip freeze > requirements.txt, and another person can install the dependencies via pip install -r requirements.txt. You could also install directly from an online Git repo using pip install git+https://github.com/anotheruser/anotherproject.git@sometag.

conda is a package and environment manager, for any language technically not just Python. A graphic that illustrates a high-level overview is shown here.

To install specific versions via conda, use conda install somepackage=1.1.2. To create an environment, conda create --name myenv. To create a new environment from requirements.txt, use conda create --name myenv --file requirements.txt, or conda env create -f environment.yml to export the full environment (this gives the name to the environment as defined by the .yml file). If you do not have access to a central installation directory (like in HPC systems), use conda create --prefix /path/to/env. List all environments using conda info -e. Freeze current environment using conda list --export > requirements.txt or conda env export > environment.yml. To clean unnecessary cached files, use conda clean. conda info gives details of the conda installation. To delete an environment use conda env remove -n myenv, or remove the directory of the environment which can be found using conda info.

Note that in a conda environment, we can install pip, and then further packages be insalled with pip within a conda environment.

Some other dependency management tools for Python include virtualenv, pipenv, poetry, pyenv, mamba. Other examples are are included here.

Now suppose we want to make a workflow. For e.g. we might want to run a script on some data files, which may generate some figures or tables or even more data. Then run more scripts on the outputs from previous scripts etc. But we might have a lot of data files, and thus may want to automate or make the process more approachable. Some methods to approach this are:

  • Use a GUI
  • Manually type in as described above
  • Use a script (e.g. Bash script)
  • Snakemake (or similar GNU Make)

Snakemake is inspired by GNU Make, and is based on Python. Compared to the other approaches which are imperative, this is declarative.

Snakemake relies on building blocks called “rules”. Rules relate targets (called “outputs” in the Snakefile) to dependencies (“input”) and commands (“shell”). A sample Snakefile is shown here.

To run Snakemake from scratch, first delete the output from previous runs (which may have been done in the actual code repo) via snakemake --delete-all-output -j 1. Now we can run it using snakemake -j 1 where the j flag stands for number of cores/jobs which has to be specified. Snakemake will rerun computations if for e.g. a data file or codefile used to make the results was changed, and will only rerun what is only needed. Snakemake also integrates with software environments. To get a summary, use snakemake -S - this does not run any jobs but gives an overview of each step and what will be run next.

Onthe other hand, changing an output file does not trigger any new computations if we use snakemake -j 1. You could also archive workflows, more information about this is here.

In some cases we may want to share the software environment used in some computational experiment. Popular containerization softwares are Docker and Singularity, we will look at Docker.

Docker can be thought of as a way to send the computer to the data, when the data is too large or sensitive to travel over a network. A Docker Image is a blueprint which is immutable. The container itself is an instance of an image. The Dockerfile is a set of instructions which creates a container based on an image and does small changes to it. Docker images can be converted into Singularity images. More about Docker is here.

Containers completely eliminate the “works on my machine” problem. For software with many dependencies, containers offer a method to preserve a computational experiment for future reproducibility. However, practice caution when using Docker Images from online sources. (Short optional Docker exercise is detailed here.)

All of the above is related to the Open Science movement, which encourages researchers to share output beyond the manuscript. This also includes data, namely, data management should be FAIR:

  • Findable (e.g. via a DOI)
  • Accessible (e.g. online in a repo)
  • Interoperable (e.g. in .csv format and not like .pdf)
  • Reusable (e.g. proper licensing)

One can use Zenodo to generate a permanent DOI from things like Github repo. Other services which could be used for this are listed here.

After this, there was a segment on social coding.

Day 5: Jupyter and Documentation

Link to material. Introductory lessons on Jupyter lab and Jupyter notebooks. These are great when you have a story to tell, or if you want to show a quick demo of how your code works, or other scenarios where sharing a linear workflow makes sense. They are also used to do interactive work on HPC clusters, or as a teaching tool.

The big picture of Jupyter notebooks is as follows: A notebook server interacts with a kernel which defines the language being used (there are kernels for many programming languages) and serves the notebook locally via HTTP and websockets which can then be accessed using any web browser. The notebook is divided into chunks called cells which can host either code or Markdown text.

One important tool is nbdime which provides tools for diffing and merging Jupyter notebooks. This is extremely helpful for e.g. if an image is changed in which case git diff on its own will show the unintelligible text changed in the raw JSON. We can use binder to host Jupyter notebooks online in a dynamic manner, i.e. in a way other users can change the contents and run it for themselves.

Documentation is important, relevant lesson link is here. Remember your most common collaborator is yourself in the future, so a lot of the time you are writing documentation for yourself. Documentation can come in different forms:

  • Tutorials, often oriented for newcomers.
  • How-to-guides, which are goal oriented and describe how to solve a specific problem.
  • Explanations, which are understanding-oriented and aim to explain a concept.
  • References, which are information-oriented and describe the machinery.

This project is a good example of a project with good documentation.

In-code documentation is a good starting point, assuming version control is used (which should be used) the documentation is always available with the code and this can also be used to auto-generate documentation for functions and classes in html/pdf etc later on.

Having README files is also good practice, often written in Markdown or RST.

HTML Static site generators can convert Markdown or RST files into nicer HTML webpages. Some examples of static site generators include Sphinx, Jekyll, pkgdown, MkDocs, Gitbook, Hugo, Hexo, and Zola.

In this workshop we will make use of Sphinx, which in turn uses Sphinx-flavoured Markdown. First install sphinx onto the coderefinery conda environment, then we can start with the documentation by typing sphinx-quickstart. In the livestream, they choose the root path as the current directory, then press n for separating source and build directories, name the project “documentation example”, author “me”, project release “0.1”. This creates skeleton code and files used in Sphinx.

We can modify index.rst with new features etc. Note that the indentation is strict in this file. After modifying this file by adding a line feature-a.md after three spaces for the indentation, we have to modify the configuration file conf.py and add 'myst_parser' to the extensions list and create a new list source_suffix = ['.rst','.md]. To create the documentation, first create the file feature-a.md and write about this feature in Markdown format. After that we can build the documentation site via sphinx-build . _build. To view the documentation, use xdg-open _build/index.html if on Linux. The theme can be changed from the conf.py.

Day 6: Testing and modular code development

Link to software testing materials here, and link to modular code development here.

Unstested software is similar to uncalibrated detectors. A (good) experimental scientist always establishes the accuracy of whichever experimental device they use, and programmers should also work in a similar fashion.

Tests are simply units of code which check whether the observed results match with expected results, with the aim of establishing accuracy.

This is very important since as the project grows, things can break silently without being noticed immediately. It helps detect errors early on, and is important for reproducibility since the accuracy of the derivative works rely on the accuracy of the current work. Tests also help when refactoring since if the refactored code passes the tests then it most probably means the refactoring was good.

Well structured code is easy to test, ideally functions should be pure with no side-effects and no global variables.

In this workshop we will briefly discuss 3 types of tests:

  • Unit tests: Functions which test one unit at a time, used to test small components that make up the overall system.
  • Integration tests: Tests which verify whether multiple modules are working well together.
  • Regression tests: Similar to integration tests, operate on whole codebase. Instead of assuming the test author knows the expected results, make use of past code versions to see if the same behaviour is observed for the new version of the code.

    Test-driven development is an approach to programming where one writes the tests before the code, where we first write the test, write an empty function template, verify the test fails then complete the program until the test passes, and refactor if needed.

    Continuous Integration (CI) is when you automatically test every single commit/push, where integration tests are done before making the change.

    Code coverage measures and documents which lines of code have been traversed in a test run. This is important in case you break the code and all the tests pass, in which case you can check the code coverage to narrow down possibilities where the code may have bugs.

    It is good practice to test before committing, fix broken tests immediately, never deactivate tests “temporarily”, test controlled errors i.e. things are expected to fail, create a good and easy testing environment (e.g. using make), and test with numerical tolerance.

    In Python, you can create tests by writing functions that use assert statements, then run the tests via pytest tests1.py. The approx function in the pytest library can be helpful when checking for numerical accuracy taking into consideration the nature of floating point arithmetic.

    In the type along exercise session, they showed how Github Actions can be used to run tests automatically whenever code is pushed. Assuming that the tests are in example.py in the root directory of the git repo, on Github click the “Actions” button and create a workflow by pressing the “Python Application” button. This shows a template, and at the last line simply add example.py after pytest so that Github Actions will know which file contains the tests. Now commit the change with this new file to the repo, and see the results.

    The next part of the type along exercise was showing how tests are done in Julia.

    When running tests on code that makes use of random numbers, you need to set the random seed, pre-calculate the results that come from the numbers generated from this seed and use them for the expected result in the tests. Integration tests are used for cases where this becomes laborious, for more information see the relevant section in the tutorial page.

    The next section is on modular code development, and is meant to be a follow along the live-coding. One definition of modular code is code that can be copy-pasted into another project and just work. The steps performed are similar to those in the instructor guide in the corresponding tutorial web page.

Thoughts

Author: Nazaal

Created: 2022-04-04 Mon 23:39

Validate