Exercise: Word vectors

The following corpus contains statements by two Republican presidents, a quote from the bible, and three quotes from the internet.

In [1]:
c = {                                                \
'With malice toward none, with charity for all ...'  + 
'let us strive on to finish the work we are in ... ' + 
'to do all which may achieve and cherish a just and lasting peace, ' + 
'among ourselves, and with all nations.',         

'There is NO WAY (ZERO!) that Mail-In Ballots ' + 
'will be anything less than substantially fraudulent.',
'In 1998, Oregon became the first state in the US ' + 
'to conduct all voting exclusively by mail.',

'Over the last two decades, about 0.00006% of total ' + 
'vote-by-mail votes cast were fraudulent.',

'Trump voted by mail in the Florida primary.',  

'Wherefore laying aside all malice, and all guile, and ' + 
'hypocrisies, and envies, and all evil speakings',

Task 1: Use scikit-learn's CountVectorizer to make the term-document matrix, particularly noting what the rows and columns correspond to (and compare with the LSA lecture). Display it as a data frame labeled with words and document keys. Does CountVectorizer lemmatize the words?

Task 2: Combine CountVectorizer (see its doc string for help) with a tokenizer function you write using spacy's lemmatization (per what you learnt in the LSA lecture). Remake the term-document matrix. Display your answer. (Your matrix size will depend on whether you used stop_words='english' argument of CountVectorizer, and may even depend on which version of spacy you are using, since lemmatization has changed across versions.)

Task 3: Use LSA to compute three dimensional representations of all documents and words using your term-document matrix from Task 2. Print out your vector representation of vote (which will obviously depend on the matrix).

Task 4: Write a function to compute the cosine of the angle between the spans of two word vectors. Compute the cosine of the angle between malice and vote. Compute the cosine of the angle between mail and vote.

Task 5: In order to moderate the influence of words that appear very frequently, the TF-IDF matrix in often used instead of the term-document matrix. The term frequency-inverse document frequency (TF–IDF) matrix weights the word counts by a measure of how often they appear in the documents according to a formula found in scikit-learn user guide. Compute the TF-IDF matrix for the above corpus using TfidfVectorizer.

Task 6: Recompute the two cosines of Task 4, now using the TF-IDF matrix of Task 5 and compare.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd