The following corpus contains statements by two Republican presidents, a quote from the bible, and three quotes from the internet.

In [1]:

```
c = { \
'Lincoln1865':
'With malice toward none, with charity for all ...' +
'let us strive on to finish the work we are in ... ' +
'to do all which may achieve and cherish a just and lasting peace, ' +
'among ourselves, and with all nations.',
'TrumpMay26':
'There is NO WAY (ZERO!) that Mail-In Ballots ' +
'will be anything less than substantially fraudulent.',
'Wikipedia':
'In 1998, Oregon became the first state in the US ' +
'to conduct all voting exclusively by mail.',
'FortuneMay26':
'Over the last two decades, about 0.00006% of total ' +
'vote-by-mail votes cast were fraudulent.',
'TheHillApr07':
'Trump voted by mail in the Florida primary.',
'KingJamesBible':
'Wherefore laying aside all malice, and all guile, and ' +
'hypocrisies, and envies, and all evil speakings',
}
```

**Task 1:** Use scikit-learn's `CountVectorizer`

to make the term-document matrix, particularly noting what the rows and columns correspond to (and compare with the LSA lecture). Display it as a data frame labeled with words and document keys. Does `CountVectorizer`

lemmatize the words?

**Task 2:** Combine `CountVectorizer`

(see its doc string for help) with a tokenizer function you write using spacy's lemmatization (per what you learnt in the LSA lecture). Remake the term-document matrix. Display your answer. (Your matrix size will depend on whether you used `stop_words='english'`

argument of `CountVectorizer`

, and may even depend on which version of spacy you are using, since lemmatization has changed across versions.)

**Task 3:** Use LSA to compute three dimensional representations of all documents and words using your term-document matrix from Task 2. Print out your vector representation of `vote`

(which will obviously depend on the matrix).

**Task 4:** Write a function to compute the cosine of the angle between the spans of two word vectors. Compute the cosine of the angle between `malice`

and `vote`

. Compute the cosine of the angle between `mail`

and `vote`

.

**Task 5:** In order to moderate the influence of words that appear very frequently, the TF-IDF matrix in often used instead of the term-document matrix. The term frequency-inverse document frequency (TF–IDF) matrix weights the word counts by a measure of how often they appear in the documents according to a formula found in scikit-learn user guide. Compute the TF-IDF matrix for the above corpus using `TfidfVectorizer`

.

**Task 6:** Recompute the two cosines of Task 4, now using the TF-IDF matrix of Task 5 and compare.

In [2]:

```
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
```