Get the top-level directory of a Git repository

(Note: the Python code of this post is based on code from a comment from Ryne Everett on this stackoverflow page).

A computational project is ideally organised in one project folder containing several subfolders corresponding for example to data, scripts, results and documentation.

# Example file tree of a dummy project
# The project path is: /home/matthieu/myProject

myProject/
├── data
│   ├── samples
│   │   └── samples-info.tsv
│   └── sequences
│       ├── genome-xxx.fa
│       └── genome-yyy.fa
├── doc
│   ├── LICENSE
│   └── README
├── results
└── scripts
    ├── 01-genome-comparison
    │   ├── 010-extract-proteins.py
    │   └── 020-reciprocal-blast.sh
    └── 02-statistics
        └── 010-plots.R

When some scripts need to access data files and write output files, it is often useful to be able to determine the absolute path of the top folder of the project, and then build the path to the files to interest from it. If a project folder is version controlled with Git, this can be done easily using a Git command:

1
git rev-parse --show-toplevel

Building paths from the top project folder enables one to move around the scripts within the project file hierarchy without having to worry about manually updating path variables in the scripts, as long as the data and results folders are stable. In addition, since the absolute path to the top project folder is determined at runtime, one can move around the whole project folder (or share it with collaborators) without breaking any hard-coded path in the scripts.

Here are examples showing how to do it for Python, R and bash scripts. The Python code was directly taken from a comment by Ryne Everett on stackoverflow for the important piece, and the R and bash codes were translated from the Python code.

1 Python

Let's assume that the script 010-extract-proteins.py needs to access the file genome-xxx.fa located in data/sequences:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Import
import subprocess
import os

# Get the project top folder
TOP_DIR = subprocess.Popen(['git', 'rev-parse', '--show-toplevel'],
	       stdout=subprocess.PIPE).communicate()[0].rstrip().decode('utf-8') 

# Build the path to the data file
DATA_DIR = os.path.join(TOP_DIR, "data/sequences")
DATA_FILE = os.path.join(DATA_DIR, "genome-xxx.fa")

# Build the path to the output file (data/sequences/proteme-xxx.fa)
OUT_FILE = os.path.join(DATA_DIR, "proteome-xxx.fa")

The Python script can now be moved anywhere within the project, it will be able to access the data file and to write the output file to the correct location.

2 R

If we need to access from the R script the file genome-yyy.fa located in data/sequences:

1
2
3
4
5
6
7
8
9
# Get the project top folder
TOP_DIR = system2("git", args = c("rev-parse", "--show-toplevel"), stdout = T)

# Build the path to the data file
DATA_DIR = file.path(TOP_DIR, "data/sequences")
DATA_FILE = file.path(DATA_DIR, "genome-yyy.fa")

# Build the path to the output file (results/plot.png)
PLOT_FILE = file.path(TOP_DIR, "results", "plot.png")

Again, the R script can now be moved anywhere within the project, it will be able to access the data file and to write the plot file to the correct location.

3 bash

Finally, how can the bash script 020-reciprocal-blast.sh" access the files =genome-xxx.fa and genome-yyy.fa located in data/sequences?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Get the project top folder
TOP_DIR=`git rev-parse --show-toplevel`

# Build the path to the data files
DATA_DIR=${TOP_DIR}/"data/sequences"
DATA_FILE_XXX=${DATA_DIR}/"genome-xxx.fa"
DATA_FILE_YYY=${DATA_DIR}/"genome-yyy.fa"

# Build the path to the output file (results/reciprocal-blast.tsv)
OUT_FILE=${TOP_DIR}/"results/reciprocal-blast.tsv"

And that's it!