Exercise: Ovarian cancer data¶

Download and copy the datafile ovariancancer.npy into data_external folder. This file contains data of 216 patients, the first 121 of which have ovarian cancer, and the remaining 95 do not. For each patient, expressions of some biomarkers through 4000 spectroscopic measurements are provided. The original data source is ccr.cancer.gov. The data is also packaged together with Matlab® and they maintain an online documentation page on it. High-dimensional biological and genetic datasets are often highly correlated, i.e., patients can be expected to have significant overlap in genes and biomarkers. Therefore such datasets will generally benefit from PCA and dimensional reduction. In this exercise, you will work with a realistic dataset which exemplifies such a dimensional reduction.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import numpy as np
X = np.load('../../data_external/ovariancancer.npy')
X.shape

(216, 4000)

Task 1: Project the 4000-variable data into its first 3 principal components and view the projections in a three-dimensional plot.

Task 2: Plot the cumulative explained variance for this dataset. What is the percentage of variance lost in restricting the data from 4000 to 3 dimensions? How many dimensions are needed to keep 95% of the variance?