The following corpus contains statements by two Republican presidents, a quote from the bible, and three quotes from the internet.
c = { \
'Lincoln1865':
'With malice toward none, with charity for all ...' +
'let us strive on to finish the work we are in ... ' +
'to do all which may achieve and cherish a just and lasting peace, ' +
'among ourselves, and with all nations.',
'TrumpMay26':
'There is NO WAY (ZERO!) that Mail-In Ballots ' +
'will be anything less than substantially fraudulent.',
'Wikipedia':
'In 1998, Oregon became the first state in the US ' +
'to conduct all voting exclusively by mail.',
'FortuneMay26':
'Over the last two decades, about 0.00006% of total ' +
'vote-by-mail votes cast were fraudulent.',
'TheHillApr07':
'Trump voted by mail in the Florida primary.',
'KingJamesBible':
'Wherefore laying aside all malice, and all guile, and ' +
'hypocrisies, and envies, and all evil speakings',
}
Task 1: Use scikit-learn's CountVectorizer
to make the term-document matrix, particularly noting what the rows and columns correspond to (and compare with the LSA lecture). Display it as a data frame labeled with words and document keys. Does CountVectorizer
lemmatize the words?
Task 2: Combine CountVectorizer
(see its doc string for help) with a tokenizer function you write using spacy's lemmatization (per what you learnt in the LSA lecture). Remake the term-document matrix. Display your answer. (Your matrix size will depend on whether you used stop_words='english'
argument of CountVectorizer
, and may even depend on which version of spacy you are using, since lemmatization has changed across versions.)
Task 3: Use LSA to compute three dimensional representations of all documents and words using your term-document matrix from Task 2. Print out your vector representation of vote
(which will obviously depend on the matrix).
Task 4: Write a function to compute the cosine of the angle between the spans of two word vectors. Compute the cosine of the angle between malice
and vote
. Compute the cosine of the angle between mail
and vote
.
Task 5: In order to moderate the influence of words that appear very frequently, the TF-IDF matrix in often used instead of the term-document matrix. The term frequency-inverse document frequency (TF–IDF) matrix weights the word counts by a measure of how often they appear in the documents according to a formula found in scikit-learn user guide. Compute the TF-IDF matrix for the above corpus using TfidfVectorizer
.
Task 6: Recompute the two cosines of Task 4, now using the TF-IDF matrix of Task 5 and compare.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd