Exercise: Google's toy graph

At SNAP (Stanford Network Analysis Project), you will find a large graph data set (with over 5 million edges) called the Google Web Graph. (This graph was released by google for a programming competition.) Download this graph, examine the file, and guess the format. You will need to load this data into your computer's memory to solve this exercise. Think carefully about what tools you would use so as not to run out of memory.

Task: Setting restart probability $r= 1-0.85,$ compute the pageranks of all vertices on this graph. Reuse the power method function you wrote in Exercise: Power method for large graphs.

In [1]:
import os
import urllib
import shutil
import numpy as np
In [2]:
# The file is located here: 
url = "https://snap.stanford.edu/data/web-Google.txt.gz"

# Download and copy it here using the code below:
f =  '../../data_external/web-Google.txt.gz' 

if not os.path.exists(f): 
    r = urllib.request.urlopen(url)
    fo = open(f, 'wb')
    shutil.copyfileobj(r, fo)
    fo.close()