Git a distributed version control system (and is a program often used independently of python). A version control system tracks the history of changes in projects with many files, including data files, and codes, which many people access simultaneously. Git facilitates identification of changes made, fetching revisions from a cloud repository in git format, and pushing revisions to the cloud.
GitHub is a cloud server that specializes in serving data in the form of git
repositories. Many other such cloud services exists, such as Atlassian's BitBucket.
The notebooks that form these lectures are in a git repository served from GitHub. In this notebook, we describe how to access materials from this remote git repository. We will also use this opportunity to introduce some object-oriented terminology like classes, objects, constructor, data members, and methods, which are pervasive in python. Those already familiar with this terminology and GitHub may skip to the next activity.
Lecture notes, exercises, codes, and all accompanying materials can be found in the GitHub repository at https://github.com/jayggg/mth271content
One of the reasons we use git is that many continuously updated datasets, like the COVID-19 dataset, are served in git format. Another reason is that we may want to use current news and fresh data in our activities. Such activities may be prepared with very little lead time, so cloud git repositories are ideal for pushing in new materials as they get developed: once they are in the cloud, you have immediate access to them. After a lecture, the materials may be revised and updated per your feedback and these revisions will also be available for you from GitHub. Therefore, it is useful to be conversant with GitHub.
Let us spend a few minutes today on how to fetch materials from the git repository. In particular, executing this notebook will pull the updated data from GitHub and place it in a location you specify (below).
If you want to know more about git, there are many resources online, such as the Git Handbook. The most common way to fetch materials from a remote repository is using git
's command line tools, but for our purposes, the python code in this notebook will suffice.
Repo
class in python¶We shall use the python module gitpython
to work with git
. (We already used this module in the first overview lecture. The documentation of gitpython
contains a lot of information on how to use its facilities. The main facility is the class called Repo
which it uses
to represent git repositories.
from git import Repo
Python is an object-oriented language. Everything in the workspace is an object. An object is an instance of a class. The definition and features of the class Repo
were imported into this workspace by the above line of code. A class has members, which could be data members or attributes (which themselves are objects residing in the class' memory layout), or function members, called methods, which provide functionalities of the class.
You can query the functionalities of Repo
using help
.
Open a cell and type in
help(Repo)
You will see that the ouput contains the extensive documentation for objects of class Repo
, including all its available methods.
Below, we will use the method called clone_from
. Here is the class documentation for that method:
help(Repo.clone_from)
Classes have a special method called constructor, which you would find listed among its methods as __init__
.
help(Repo.__init__)
The __init__
method is called when you type in Repo(...)
with the arguments allowed in __init__
. Below, we will see how to initialize a Repo
object using our github repository.
Next, each of you need to specify a location on your computer where you want the course materials to reside. This location can be specified as a string, where subfolders are delineated by forward slash. Please revise the string below to suit your needs.
coursefolder = '/Users/Jay/tmpdir/'
Python provides a module os
to perform operating system dependent tasks in a portable (platform-independent) way. If you did not give the full name of the folder, os
can attempt to produce it as follows:
import os
os.path.abspath(coursefolder)
Please double-check that the output is what you expected on your operating system: if not, please go back and revise coursefolder
before proceeding. (Windows users should see forward slashes converted to double backslashes, while mac and linux users will usually retain the forward slashes.)
We proceed to download the course materials from GitHub. These materials will be stored in a subfolder of coursefolder
called mth271content
, which is the name of the git repository.
repodir = os.path.join(os.path.abspath(coursefolder), 'mth271content')
repodir # full path name of the subfolder
Again, the value of the string variable repodir
output above describes the location on your computer where your copy of the course materials from GitHub will reside.
Now there are two cases to consider:
In Case 1, you want to clone the repository. This will create a local copy (on your computer) of the remote cloud repository.
In Case 2, you want to pull updates (only) from the repository, i.e., only changes in the remote cloud that you don't have in your existing local copy.
To decide which case you are in, I will assume the following. If the folder whose name is the value of the string repodir
already exists, then I will assume you are in Case 2. Otherwise, you are in Case 1. To find out if a folder exists, we can use another facility from os
:
os.path.isdir(repodir)
The output above should be False
if you are running this notebook for the first time, per my assumption above. When you run it after you have executed this notebook successfully at least once, you would already have cloned the repository, so the folder will exist.
The code below uses the conditionals if
and else
(included in the prerequisite reading for this lecture) to check if the folder exists: If it does not exist, a new local copy of the GitHub repository is cloned into your local hard drive. If it exists, then only the differences (or updates) between your local copy and the remote repository are fetched, so that your local copy is up to date with the remote.
if os.path.isdir(repodir): # if repo exists, pull newest data
repo = Repo(repodir)
repo.remotes.origin.pull()
else: # otherwise, clone from remote
repo = Repo.clone_from('https://github.com/jayggg/mth271content',
repodir)
repo
is an object of class Repo
. Repo(repodir)
invokes the constructor, namely the __init__
method.Repo.clone_from(...)
calls the clone_from(...)
method.Now you have the updated course materials in your computer in a local folder. The object repo
stores information about this folder, which you gave to the constructor in the string variable repodir
, in a data member called working_dir
. You can access any data members of an object in memory, and you do so just like you access a method, using a dot .
followed by the member name. Here is an example:
repo.working_dir
Note how the Repo
object was either initialized with repodir
(if that folder exists) or set to clone a remote repository at a URL.
The following instructions are for those of you who want to keep tracking the git repository closely in the future. Suppose you want to update your local folder with new materials from GitHub. But at the same time, you want to experiment and modify the notebooks as you like. This can create conflicting versions, which we should know how to handle.
Consider the situation where I have pushed changes to a file into the remote git repository that you want your local folder to reflect. But you have been working with the same file locally and have made changes to it - perhaps you have put a note to yourself to look something up, or perhaps you have found a better explanation, or better code, than what I gave. You want to keep your changes.
You should know that once you modify a file that is tracked by git as a local copy of a remote file, and you ask git to update, git will refuse to overwrite your changes. Because the remote version of the file and the local version of the file are now in conflict, a simple git pull command will fail. Git provides constructs to help resolve such conflicts, but let's try to keep things simple today. The following method is a solution that doubles the number of files, but has the advantage of simplicity:
Go to the repodir
location in your computer. Copy the jupyter
subfolder as, say jupyterCopy
. Overwrite the copy of this notebook (called 03_Working_with_git.html
) in the jupyterCopy
folder with this file, which you saved after making your changes to variables like coursefolder
above. Note that jupyerCopy
is untracked by git: there is no remote folder in the cloud repository with that name. So any changes you make in jupyterCopy
will be left untouched by git. So you can freely change any jupyter notebooks within this folder. The next time you run this file from jupyterCopy
it will pull updates from the remote repository into the original jupyter
folder. This way you get your updates from the cloud in jupyter
and at the same time get to retain your modifications in jupyterCopy
.
Alternately, if you like working on the command line, instead of running this notebook, you can run the python file update_course.py on the command line. You should move this file outside of the repository and save it after changing the value of the string coursefolder
to your specific local folder name.
Author: Jay Gopalakrishnan
License: ©2020. CC-BY-SA
$\ll$Table of Contents