In this exercise you will apply PCA to a large library of facial images to extract dominant patterns across images. The dataset is called Labeled Faces in the Wild, or LFW, (source) and is popular in computer vision and facial recognition. It is made up of over a thousand 62 x 47 pixel face images from the internet, the first few of which are displayed below.
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.datasets import fetch_lfw_people # this will download images if
faces = fetch_lfw_people(min_faces_per_person=60) # you don't already have them
fig, ax = plt.subplots(4, 7, figsize=(12, 10))
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='pink')
axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])
Task 1: We refer to the principal components of face image datasets as eigenfaces. Display the first 28 eigenfaces of this dataset. (They will have little resemblance to the first 28 images displayed above.)
Task 2: Let $N$ be the least number of dimensions to which can you reduce the dataset without exceeding 5% relative error in the Frobenius norm. Find $N$. (This requires you to combine what you learnt in the SVD lecture on the Frobenius norm of the error in best low-rank approximation with what you just learnt in the PCA lecture.)
Task 3: Repeat PCA, restricting to $N$ eigenfaces (with $N$ as in Task 2), holding back the last seven images in the dataset. Compute the representations of these last seven images in terms of the $N$ eigenfaces. How do they compare visually with the original seven images?
Task 4: Restricting to only images of Ariel Sharon and Hugo Chavez, represent (and plot) them as points on a three-dimensional space whose axes represent the principal axes 4, 5, and 6. Do you see the points somewhat clustered in two groups? (The principal directions 0, 1, 2, 3 are excluded in this task since they seem to reflect lighting, shadows, and generic facial features, so will likely not be useful in delineating individuals.)