Git a distributed version control system (and is a program often used independently of python). A version control system tracks the history of changes in projects with many files, including data files, and codes, which many people access simultaneously. Git facilitates identification of changes made, fetching revisions from a cloud repository in git format, and pushing revisions to the cloud.
The notebooks that form these lectures are in a git repository served from GitHub. In this notebook, we describe how to access materials from this remote git repository. We will also use this opportunity to introduce some object-oriented terminology like classes, objects, constructor, data members, and methods, which are pervasive in python. Those already familiar with this terminology and GitHub may skip to the next activity.
Lecture notes, exercises, codes, and all accompanying materials can be found in the GitHub repository at https://github.com/jayggg/mth271content
One of the reasons we use git is that many continuously updated datasets, like the COVID-19 dataset, are served in git format. Another reason is that we may want to use current news and fresh data in our activities. Such activities may be prepared with very little lead time, so cloud git repositories are ideal for pushing in new materials as they get developed: once they are in the cloud, you have immediate access to them. After a lecture, the materials may be revised and updated per your feedback and these revisions will also be available for you from GitHub. Therefore, it is useful to be conversant with GitHub.
Let us spend a few minutes today on how to fetch materials from the git repository. In particular, executing this notebook will pull the updated data from GitHub and place it in a location you specify (below).
If you want to know more about git, there are many resources online, such as the Git Handbook. The most common way to fetch materials from a remote repository is using
git's command line tools, but for our purposes, the python code in this notebook will suffice.
Repoclass in python¶
We shall use the python module
gitpython to work with
git. (We already used this module in the first overview lecture. The documentation of
gitpython contains a lot of information on how to use its facilities. The main facility is the class called
Repo which it uses
to represent git repositories.
from git import Repo
Python is an object-oriented language. Everything in the workspace is an object. An object is an instance of a class. The definition and features of the class
Repo were imported into this workspace by the above line of code. A class has members, which could be data members or attributes (which themselves are objects residing in the class' memory layout), or function members, called methods, which provide functionalities of the class.
You can query the functionalities of
Open a cell and type in
You will see that the ouput contains the extensive documentation for objects of class
Repo, including all its available methods.
Below, we will use the method called
clone_from. Here is the class documentation for that method:
Help on method clone_from in module git.repo.base: clone_from(url, to_path, progress=None, env=None, multi_options=None, **kwargs) method of builtins.type instance Create a clone from the given URL :param url: valid git url, see http://www.kernel.org/pub/software/scm/git/docs/git-clone.html#URLS :param to_path: Path to which the repository should be cloned to :param progress: See 'git.remote.Remote.push'. :param env: Optional dictionary containing the desired environment variables. Note: Provided variables will be used to update the execution environment for `git`. If some variable is not specified in `env` and is defined in `os.environ`, value from `os.environ` will be used. If you want to unset some variable, consider providing empty string as its value. :param multi_options: See ``clone`` method :param kwargs: see the ``clone`` method :return: Repo instance pointing to the cloned directory
Classes have a special method called constructor, which you would find listed among its methods as
Help on function __init__ in module git.repo.base: __init__(self, path=None, odbt=<class 'git.db.GitCmdObjectDB'>, search_parent_directories=False, expand_vars=True) Create a new Repo instance :param path: the path to either the root git directory or the bare git repo:: repo = Repo("/Users/mtrier/Development/git-python") repo = Repo("/Users/mtrier/Development/git-python.git") repo = Repo("~/Development/git-python.git") repo = Repo("$REPOSITORIES/Development/git-python.git") repo = Repo("C:\Users\mtrier\Development\git-python\.git") - In *Cygwin*, path may be a `'cygdrive/...'` prefixed path. - If it evaluates to false, :envvar:`GIT_DIR` is used, and if this also evals to false, the current-directory is used. :param odbt: Object DataBase type - a type which is constructed by providing the directory containing the database objects, i.e. .git/objects. It will be used to access all object data :param search_parent_directories: if True, all parent directories will be searched for a valid repo as well. Please note that this was the default behaviour in older versions of GitPython, which is considered a bug though. :raise InvalidGitRepositoryError: :raise NoSuchPathError: :return: git.Repo
__init__ method is called when you type in
Repo(...) with the arguments allowed in
__init__. Below, we will see how to initialize a
Repo object using our github repository.
Next, each of you need to specify a location on your computer where you want the course materials to reside. This location can be specified as a string, where subfolders are delineated by forward slash. Please revise the string below to suit your needs.
coursefolder = '/Users/Jay/tmpdir/'
Python provides a module
os to perform operating system dependent tasks in a portable (platform-independent) way. If you did not give the full name of the folder,
os can attempt to produce it as follows:
import os os.path.abspath(coursefolder)
Please double-check that the output is what you expected on your operating system: if not, please go back and revise
coursefolder before proceeding. (Windows users should see forward slashes converted to double backslashes, while mac and linux users will usually retain the forward slashes.)
We proceed to download the course materials from GitHub. These materials will be stored in a subfolder of
mth271content, which is the name of the git repository.
repodir = os.path.join(os.path.abspath(coursefolder), 'mth271content') repodir # full path name of the subfolder
Again, the value of the string variable
repodir output above describes the location on your computer where your copy of the course materials from GitHub will reside.
Now there are two cases to consider:
In Case 1, you want to clone the repository. This will create a local copy (on your computer) of the remote cloud repository.
In Case 2, you want to pull updates (only) from the repository, i.e., only changes in the remote cloud that you don't have in your existing local copy.
To decide which case you are in, I will assume the following. If the folder whose name is the value of the string
repodir already exists, then I will assume you are in Case 2. Otherwise, you are in Case 1. To find out if a folder exists, we can use another facility from
The output above should be
False if you are running this notebook for the first time, per my assumption above. When you run it after you have executed this notebook successfully at least once, you would already have cloned the repository, so the folder will exist.
The code below uses the conditionals
else (included in the prerequisite reading for this lecture) to check if the folder exists: If it does not exist, a new local copy of the GitHub repository is cloned into your local hard drive. If it exists, then only the differences (or updates) between your local copy and the remote repository are fetched, so that your local copy is up to date with the remote.
if os.path.isdir(repodir): # if repo exists, pull newest data repo = Repo(repodir) repo.remotes.origin.pull() else: # otherwise, clone from remote repo = Repo.clone_from('https://github.com/jayggg/mth271content', repodir)
repois an object of class
Repo(repodir)invokes the constructor, namely the
Now you have the updated course materials in your computer in a local folder. The object
repo stores information about this folder, which you gave to the constructor in the string variable
repodir, in a data member called
working_dir. You can access any data members of an object in memory, and you do so just like you access a method, using a dot
. followed by the member name. Here is an example:
Note how the
Repo object was either initialized with
repodir (if that folder exists) or set to clone a remote repository at a URL.
The following instructions are for those of you who want to keep tracking the git repository closely in the future. Suppose you want to update your local folder with new materials from GitHub. But at the same time, you want to experiment and modify the notebooks as you like. This can create conflicting versions, which we should know how to handle.
Consider the situation where I have pushed changes to a file into the remote git repository that you want your local folder to reflect. But you have been working with the same file locally and have made changes to it - perhaps you have put a note to yourself to look something up, or perhaps you have found a better explanation, or better code, than what I gave. You want to keep your changes.
You should know that once you modify a file that is tracked by git as a local copy of a remote file, and you ask git to update, git will refuse to overwrite your changes. Because the remote version of the file and the local version of the file are now in conflict, a simple git pull command will fail. Git provides constructs to help resolve such conflicts, but let's try to keep things simple today. The following method is a solution that doubles the number of files, but has the advantage of simplicity:
Go to the
repodir location in your computer. Copy the
jupyter subfolder as, say
jupyterCopy. Overwrite the copy of this notebook (called
03_Working_with_git.html) in the
jupyterCopy folder with this file, which you saved after making your changes to variables like
coursefolder above. Note that
jupyerCopy is untracked by git: there is no remote folder in the cloud repository with that name. So any changes you make in
jupyterCopy will be left untouched by git. So you can freely change any jupyter notebooks within this folder. The next time you run this file from
jupyterCopy it will pull updates from the remote repository into the original
jupyter folder. This way you get your updates from the cloud in
jupyter and at the same time get to retain your modifications in
Alternately, if you like working on the command line, instead of running this notebook, you can run the python file update_course.py on the command line. You should move this file outside of the repository and save it after changing the value of the string
coursefolder to your specific local folder name.