Working with git

April 2, 2020

Git a distributed version control system (and is a program often used independently of python). A version control system tracks the history of changes in projects with many files, including data files, and codes, which many people access simultaneously. Git facilitates identification of changes made, fetching revisions from a cloud repository in git format, and pushing revisions to the cloud.

GitHub is a cloud server that specializes in serving data in the form of git repositories. Many other such cloud services exists, such as Atlassian's BitBucket.

The notebooks that form these lectures are in a git repository served from GitHub. In this notebook, we describe how to access materials from this remote git repository. We will also use this opportunity to introduce some object-oriented terminology like classes, objects, constructor, data members, and methods, which are pervasive in python. Those already familiar with this terminology and GitHub may skip to the next activity.

Our materials in GitHub

Lecture notes, exercises, codes, and all accompanying materials can be found in the GitHub repository at https://github.com/jayggg/mth271content

One of the reasons we use git is that many continuously updated datasets, like the COVID-19 dataset, are served in git format. Another reason is that we may want to use current news and fresh data in our activities. Such activities may be prepared with very little lead time, so cloud git repositories are ideal for pushing in new materials as they get developed: once they are in the cloud, you have immediate access to them. After a lecture, the materials may be revised and updated per your feedback and these revisions will also be available for you from GitHub. Therefore, it is useful to be conversant with GitHub.

Let us spend a few minutes today on how to fetch materials from the git repository. In particular, executing this notebook will pull the updated data from GitHub and place it in a location you specify (below).

If you want to know more about git, there are many resources online, such as the Git Handbook. The most common way to fetch materials from a remote repository is using git's command line tools, but for our purposes, the python code in this notebook will suffice.

Git Repo class in python

We shall use the python module gitpython to work with git. (We already used this module in the first overview lecture. The documentation of gitpython contains a lot of information on how to use its facilities. The main facility is the class called Repo which it uses to represent git repositories.

In [1]:
from git import Repo

Python is an object-oriented language. Everything in the workspace is an object. An object is an instance of a class. The definition and features of the class Repo were imported into this workspace by the above line of code. A class has members, which could be data members or attributes (which themselves are objects residing in the class' memory layout), or function members, called methods, which provide functionalities of the class.

You can query the functionalities of Repo using help. Open a cell and type in

help(Repo)

You will see that the ouput contains the extensive documentation for objects of class Repo, including all its available methods.

Below, we will use the method called clone_from. Here is the class documentation for that method:

In [2]:
help(Repo.clone_from)
Help on method clone_from in module git.repo.base:

clone_from(url, to_path, progress=None, env=None, multi_options=None, **kwargs) method of builtins.type instance
    Create a clone from the given URL
    
    :param url: valid git url, see http://www.kernel.org/pub/software/scm/git/docs/git-clone.html#URLS
    :param to_path: Path to which the repository should be cloned to
    :param progress: See 'git.remote.Remote.push'.
    :param env: Optional dictionary containing the desired environment variables.
        Note: Provided variables will be used to update the execution
        environment for `git`. If some variable is not specified in `env`
        and is defined in `os.environ`, value from `os.environ` will be used.
        If you want to unset some variable, consider providing empty string
        as its value.
    :param multi_options: See ``clone`` method
    :param kwargs: see the ``clone`` method
    :return: Repo instance pointing to the cloned directory

Classes have a special method called constructor, which you would find listed among its methods as __init__.

In [3]:
help(Repo.__init__)
Help on function __init__ in module git.repo.base:

__init__(self, path=None, odbt=<class 'git.db.GitCmdObjectDB'>, search_parent_directories=False, expand_vars=True)
    Create a new Repo instance
    
    :param path:
        the path to either the root git directory or the bare git repo::
    
            repo = Repo("/Users/mtrier/Development/git-python")
            repo = Repo("/Users/mtrier/Development/git-python.git")
            repo = Repo("~/Development/git-python.git")
            repo = Repo("$REPOSITORIES/Development/git-python.git")
            repo = Repo("C:\Users\mtrier\Development\git-python\.git")
    
        - In *Cygwin*, path may be a `'cygdrive/...'` prefixed path.
        - If it evaluates to false, :envvar:`GIT_DIR` is used, and if this also evals to false,
          the current-directory is used.
    :param odbt:
        Object DataBase type - a type which is constructed by providing
        the directory containing the database objects, i.e. .git/objects. It will
        be used to access all object data
    :param search_parent_directories:
        if True, all parent directories will be searched for a valid repo as well.
    
        Please note that this was the default behaviour in older versions of GitPython,
        which is considered a bug though.
    :raise InvalidGitRepositoryError:
    :raise NoSuchPathError:
    :return: git.Repo

The __init__ method is called when you type in Repo(...) with the arguments allowed in __init__. Below, we will see how to initialize a Repo object using our github repository.

Your local copy of the repository

Next, each of you need to specify a location on your computer where you want the course materials to reside. This location can be specified as a string, where subfolders are delineated by forward slash. Please revise the string below to suit your needs.

In [4]:
coursefolder = '/Users/Jay/tmpdir/'

Python provides a module os to perform operating system dependent tasks in a portable (platform-independent) way. If you did not give the full name of the folder, os can attempt to produce it as follows:

In [5]:
import os
os.path.abspath(coursefolder)
Out[5]:
'/Users/Jay/tmpdir'

Please double-check that the output is what you expected on your operating system: if not, please go back and revise coursefolder before proceeding. (Windows users should see forward slashes converted to double backslashes, while mac and linux users will usually retain the forward slashes.)

We proceed to download the course materials from GitHub. These materials will be stored in a subfolder of coursefolder called mth271content, which is the name of the git repository.

In [6]:
repodir = os.path.join(os.path.abspath(coursefolder), 'mth271content')
repodir   # full path name of the subfolder
Out[6]:
'/Users/Jay/tmpdir/mth271content'

Again, the value of the string variable repodir output above describes the location on your computer where your copy of the course materials from GitHub will reside.

Two cases

Now there are two cases to consider:

  1. Are you downloading the remote git repository for the first time?
  2. Or, are you returning to the remote repository to update the materials?

In Case 1, you want to clone the repository. This will create a local copy (on your computer) of the remote cloud repository.

In Case 2, you want to pull updates (only) from the repository, i.e., only changes in the remote cloud that you don't have in your existing local copy.

To decide which case you are in, I will assume the following. If the folder whose name is the value of the string repodir already exists, then I will assume you are in Case 2. Otherwise, you are in Case 1. To find out if a folder exists, we can use another facility from os:

In [7]:
os.path.isdir(repodir)
Out[7]:
True

The output above should be False if you are running this notebook for the first time, per my assumption above. When you run it after you have executed this notebook successfully at least once, you would already have cloned the repository, so the folder will exist.

Clone or pull

The code below uses the conditionals if and else (included in the prerequisite reading for this lecture) to check if the folder exists: If it does not exist, a new local copy of the GitHub repository is cloned into your local hard drive. If it exists, then only the differences (or updates) between your local copy and the remote repository are fetched, so that your local copy is up to date with the remote.

In [8]:
if os.path.isdir(repodir):      # if repo exists, pull newest data 
    repo = Repo(repodir) 
    repo.remotes.origin.pull()
else:                           # otherwise, clone from remote
    repo = Repo.clone_from('https://github.com/jayggg/mth271content', 
                           repodir)
  • Here repo is an object of class Repo.
  • Repo(repodir) invokes the constructor, namely the __init__ method.
  • Repo.clone_from(...) calls the clone_from(...) method.

Now you have the updated course materials in your computer in a local folder. The object repo stores information about this folder, which you gave to the constructor in the string variable repodir, in a data member called working_dir. You can access any data members of an object in memory, and you do so just like you access a method, using a dot . followed by the member name. Here is an example:

In [9]:
repo.working_dir
Out[9]:
'/Users/Jay/tmpdir/mth271content'

Note how the Repo object was either initialized with repodir (if that folder exists) or set to clone a remote repository at a URL.

Updated and future materials

The following instructions are for those of you who want to keep tracking the git repository closely in the future. Suppose you want to update your local folder with new materials from GitHub. But at the same time, you want to experiment and modify the notebooks as you like. This can create conflicting versions, which we should know how to handle.

Consider the situation where I have pushed changes to a file into the remote git repository that you want your local folder to reflect. But you have been working with the same file locally and have made changes to it - perhaps you have put a note to yourself to look something up, or perhaps you have found a better explanation, or better code, than what I gave. You want to keep your changes.

You should know that once you modify a file that is tracked by git as a local copy of a remote file, and you ask git to update, git will refuse to overwrite your changes. Because the remote version of the file and the local version of the file are now in conflict, a simple git pull command will fail. Git provides constructs to help resolve such conflicts, but let's try to keep things simple today. The following method is a solution that doubles the number of files, but has the advantage of simplicity:

Go to the repodir location in your computer. Copy the jupyter subfolder as, say jupyterCopy. Overwrite the copy of this notebook (called 03_Working_with_git.html) in the jupyterCopy folder with this file, which you saved after making your changes to variables like coursefolder above. Note that jupyerCopy is untracked by git: there is no remote folder in the cloud repository with that name. So any changes you make in jupyterCopy will be left untouched by git. So you can freely change any jupyter notebooks within this folder. The next time you run this file from jupyterCopy it will pull updates from the remote repository into the original jupyter folder. This way you get your updates from the cloud in jupyter and at the same time get to retain your modifications in jupyterCopy.

Alternately, if you like working on the command line, instead of running this notebook, you can run the python file update_course.py on the command line. You should move this file outside of the repository and save it after changing the value of the string coursefolder to your specific local folder name.