Latent Semantic Analysis

June 1, 2020

In the study of information retrieval systems, a fundamental question is how to extract documents from a large collection in response to a user query. A simplistic way is to pick out all documents which contain the query words. Is there a more "intelligent" way? Documents usually have interrelated concepts and if a query could be matched to a concept, perhaps the results extracted would look more intelligent. Documents are written in natural language, using copious amounts of words, yet the number of topics that people write about are usually much smaller than the number of words they use. Latent Semantic Analysis (LSA) is a technique to associate concepts in a space of much lower dimension than a space of words in order to help with the complex task of information retrieval.

Of course, a number of details have to be worked out. How can one associate words to a vector space? How can one identify topics in this space? How can one represent queries? It should therefore not be surprising that this is a whole field of study in itself: see e.g., [MRS]. Yet, we are able to take a peek into this machinery because the essential mathematical tool used in LSA is something you already know, namely the SVD.

I'm sure yesterday's news is very much on your mind, with the best and the worst of humanity on display. Shocking police violence and a successful astronaut launch dominated the news headlines. Having failed to get the news out of my mind, I am going to use sentences from current news for introducing LSA.

The next graph, obtained from LSA's interpretation of four news headlines on a two-dimensional space made in this lecture, may well be a representation of the country's current state. Today's lecture will show you how to analyze text and graphically display words and their apparent connections like those displayed below.

Nation

If this is a proxy for the country's current state, where we go from here seems critical in this moment.

Natural language processing

Using a few headlines, we make a corpus of text documents to illustrate the basics of LSA as a python dictionary, called c below.

In [1]:
c = {'May31': 
     'Two crises convulse a nation: a pandemic and police violence', 
          
     'May30a': 
     'Nation’s first astronaut launch to orbit from home soil in nearly a decade', 
    
     'May30b': 
     'Death of George Floyd at the hands of police set off protests',

     'May27':
     'SpaceX launch of NASA astronauts is postponed over weather'}

In this corpus, c['May31'] is a document, and the c has three more documents. Each document is composed of many words, or terms. We need to simplify the complexities of natural language to be able to compute anything. With my apologies to the writers among you, we proceed by taking the view that the order of words, declensions and conjugations, and often-used words like articles and prepositions, are all immaterial. Then we view concepts as merely associations of the remaining root words, associations marked by their joint appearances in documents. LSA is only useful under the assumption that words that are close in semantics will occur in similar documents as the corpus of documents become large.

Applying the above-mentioned language simplifications to even a small corpus is a lot of work, if you try to do it from scratch. But thankfully, there are several python modules that excel in natural language processing (NLP). Below, I will use spaCy, one of the recent additions to the python NLP tool set. (Please install it and also make sure to install their English dataset en_core_web_sm, say by python3 -m spacy download en_core_web_sm, before proceeding.)

In [2]:
import spacy
from spacy import displacy

# Install dataset: python3 -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')  

Consider the first sentence in our corpus.

In [3]:
doc0 = nlp('Two crises convulse a nation')

The spacy module is able to process sentences, and identify nouns, verbs, direct objects, and their interrelationships. In the cell below, after processing a sentence, the result is saved in an SVG figure. The saved image is then displayed as an image in the next cell.

In [4]:
svg = displacy.render(doc0, style="dep", jupyter=False)
with open('../figs/sentence0.svg', 'w') as f: f.write(svg)

Within a Jupyter notebook, one may also directly render the resulting image (without needing to save the image into a file) by specifying jupyter=True instead. Here is an example.

In [5]:
doc1 = nlp('SpaceX launch of NASA astronauts is postponed over weather')
displacy.render(doc1, style='ent', jupyter=True, options={'distance':90})         
SpaceX launch of NASA ORG astronauts is postponed over weather

Just in case you are not reading this in a Jupyter notebook and the image does not render on your reading device, I am reproducing the image that displacy generated:

Second sentence

As you can see from the annotated sentence, the module can even identify some named entities in the real world: it knows about NASA, but it still does not know about SpaceX! (We will fix this later in this lecture by adding our own named entity terms.)

We shall use the package's capabilities for tokenization and lemmatization. Tokenization is the process of dividing a sentence or a document into words via demarcation characters like white spaces. Lemmatization is the process of identifying the so-called "lemma" of a word, allowing us to group together inflected forms of the word into a single item. Here is the result of tokenization and lemmatization on the above sentence. Note how the originally found words "astronauts" and "postponed" have changed in the output.

In [6]:
[w.lemma_ for w in doc1 if not w.is_stop]
Out[6]:
['spacex', 'launch', 'NASA', 'astronaut', 'postpone', 'weather']

Here we have also removed stop words, a collection of the most common words in a language as previously identified and categorized by the NLP program. In the above example, the words "of", "is", and "over" have been removed. You can view spacy's collection of all stop words if you use the following import statement.

from spacy.lang.en.stop_words import STOP_WORDS

Term-document matrix

The important mathematical object for LSA is the term-document matrix, a matrix whose rows correspond to terms, whose columns correspond to documents, and whose element at position $(t, d)$ is 1 if the document in column $d$ contains the term in row $t$, and is 0 otherwise. (You will find variations on this matrix in the literature, e.g., the tranpose, ir refinements beyond 0/1 entries, are often used.) Let's make this matrix with a quick hack (where we have now also asked spacy to ignore punctuations). The matrix will be displayed as a pandas data frame to easily visualize term and document labels of rows and columns.

In [7]:
import pandas as pd
from scipy.sparse import lil_matrix

d = {}
for j, dok in enumerate(c.keys()):
    tokens = [w.lemma_ for w in nlp(c[dok])
              if not w.is_stop and w.pos_ != 'PUNCT']
    for t in tokens:
        d[t] = d.setdefault(t, [])
        d[t] += [j]
A = lil_matrix((len(d.keys()), len(c.keys())), dtype=int)
for i, t in enumerate(d.keys()):
    for j in d[t]:
        A[i, j] = 1   
Adf = pd.DataFrame(A.toarray(), index=d.keys(), columns=c.keys()); Adf
Out[7]:
May31 May30a May30b May27
crisis 1 0 0 0
convulse 1 0 0 0
nation 1 1 0 0
pandemic 1 0 0 0
police 1 0 1 0
violence 1 0 0 0
astronaut 0 1 0 1
launch 0 1 0 1
orbit 0 1 0 0
home 0 1 0 0
soil 0 1 0 0
nearly 0 1 0 0
decade 0 1 0 0
death 0 0 1 0
George 0 0 1 0
Floyd 0 0 1 0
hand 0 0 1 0
set 0 0 1 0
protest 0 0 1 0
spacex 0 0 0 1
NASA 0 0 0 1
postpone 0 0 0 1
weather 0 0 0 1

We might want to have a combination of first and last names treated as a single entity, but the code is not yet smart enough to do that. We'll fix that later, after introducing the idea of LSA. For the moment, note how words have been represented as row vectors and documents as column vectors. This is enough to understand the basics of LSA, as we see next.

The idea of LSA

The idea is to perform an SVD of the term-document matrix and use its low-rank approximation, with a rank $k$ much less than the number of words. The dominant singular vectors may then be expected to capture patterns in the association of words. Of course, this is not an exact technique, but it does give us something numerical to work with for analysis of large amounts of textual data. For our example of the 4-document corpus, we shall use the best rank-2 approximation (as discussed in the SVD lecture), the difference now being that we don't actually need the low-rank matrix, but rather the SVD components that go into it.

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set();
from numpy.linalg import norm
from scipy.linalg import svd
In [9]:
u, s, vt = svd(A.toarray())

Here is the first important step in creating mathematical objects to represent documents. Using the best rank $k$ approximation, the first $k$ right singular vectors are used to represent each document as a $k$ vector.

In [10]:
k = 2                                   # Limit to rank k
Vt = vt[:k, :]
pd.DataFrame(Vt, columns=c.keys())      # Documents as k-vectors
Out[10]:
May31 May30a May30b May27
0 -0.269907 -0.829243 -0.109002 -0.477101
1 0.458490 -0.149538 0.854138 -0.194611

The second important step is to represent words (or terms) as mathematical objects in the same space. Unlike documents, the words/terms are represented by the first $k$ left singular vectors, weighted by the associated singular values. The first five word tokens are displayed below as vectors.

In [11]:
US = u[:, :k] @ np.diag(s[:k])
usp = pd.DataFrame(US, index=d.keys()) # Words as k-vectors
usp.head()
Out[11]:
0 1
crisis -0.269907 0.458490
convulse -0.269907 0.458490
nation -1.099150 0.308952
pandemic -0.269907 0.458490
police -0.378909 1.312628

Many words are mapped to the same point in such a small example. In other words, there is not enough data in our small corpus to distinguish between such words.

Nonetheless, even in our very small dataset, it is very interesting to see the associations between words in terms of how different the word vectors are. Ignoring the magnitude of word vectors, one may measure the difference between two word vectors (both drawn from the origin) using a device different from the norm. When magnitude is ignored, the difference between vectors is captured by the angle the word vectors make with each other, or by the cosine of the angle. Two vectors of the same magnitude are farther apart if the cosine of their angle is smaller. Remember that it's very easy to compute the cosine of the angle between two unit vectors, since it is equal to their dot product.

In [12]:
astronaut = usp.loc['astronaut', :].to_numpy()
crisis    = usp.loc['crisis', :].to_numpy()
police    = usp.loc['police', :].to_numpy()

Here is an example of an uncanny association the program has made:

The word crisis is closer to police than to astronaut! This conclusion follows from the two cosine computations below.

In [13]:
crisis.dot(police) / norm(police) / norm(crisis)
Out[13]:
0.9686558216875333
In [14]:
crisis.dot(astronaut) / norm(astronaut) / norm(crisis)
Out[14]:
0.27103529721595343

Let's dig into this a bit more. In our small example, since words are two-dimensional vectors, we can plot them to see how they are dispersed in terms of angles measured from the origin. Below, the origin is marked as a red star, and points representing the terminal point of word vectors are annotated with the word.

In [15]:
w = {}; us = np.round(US, 8) # w[(x,y)] = list of words at that point
usr = list(set([tuple(us[i, :]) for i in range(us.shape[0])]))
for i in range(len(usr)):
    w[usr[i]] = []
    for j in range(usp.shape[0]):
        if norm(usp.iloc[j, :] - usr[i]) < 1e-6:
            w[usr[i]] += [usp.index[j]]            
fig = plt.figure(figsize=(10, 8)); ax = fig.gca()
ax.arrow(0, 0, crisis[0], crisis[1],       width=0.015, alpha=0.3)
ax.arrow(0, 0, police[0], police[1],       width=0.015, alpha=0.3)
ax.arrow(0, 0, astronaut[0], astronaut[1], width=0.015, alpha=0.3)
ax.scatter(US[: , 0], US[: ,1], alpha=0.5)
ax.scatter(0, 0,  color='r', marker='*', s=150, alpha=0.6);
for i, key in enumerate(w.keys()):
    ax.annotate(', '.join(w[key]), (key[0], key[1]))
ax.set_xlim((-1.5, 0.7)); ax.set_ylim((-0.5, 1.5));
ax.set_title('Alignment of Word Vectors');

The first takeaway from this figure is that the angles the word vectors make is clearly in accordance with the previous cosine computation.

The second is more enigmatic. In our small corpus of four sentences, there were two categories of news, one of violence, and one of exploration. While we as humans can instinctively make that categorization, it is uncanny that some mathematics and a few simple lines of code can separate the words associated to the two categories into different areas of a "word space". The word that appears somewhat in the middle of the two categories is nation, as it ought to. (The same figure, after a rotation, modification of arrows, and cleaned up positioning, is what I presented at the beginning of the lecture.) You should now have an idea of why LSA can be useful when applied to a large corpus with many more words, documents, and hidden associations (or latent semantics).

Language is complex

Let me return to the news headlines. During this entire spring term, bad news have been accumulating, of how the pandemic and its repercussions are battering our country, highlighting and amplifying many of our systemic problems, and finally even more bad news of yet another police violence. All this made the few glorious moments last weekend especially precious. When SpaceX lifted NASA astronauts Bob Behnken and Doug Hurley into orbit on a reusable rocket that returned to an autonomous droneship, it was a moment of reassurance that our science, industry, and innovation remain peerless. Let me now focus on this bit of positive news and add more sentences on these exciting developments to our text corpus.

In [16]:
c.update(
{
 'May30Launch':
 'Go NASA! Go SpaceX! Godspeed, Bob and Doug!', 
    
 'NYTimes':
 'NASA and SpaceX officials more often than not ' + 
 'just call the pilots of this historic mission Bob and Doug.',
 
 'May30NASAblog':
 'The first stage of the SpaceX rocket has landed ' + 
 'successfully on the droneship, Of Course I Still Love You.',

 'May31NYTimes':
 'After a 19 hour trip, NASA astronauts Bob and Doug ' + 
 'successfully docked their capsule and entered the space station.',             
})

Do you see the complexities of dealing with real examples of natural language?

The ocean droneship, controlled by an autonomous robot to help the rocket land, has a curious name: "Of Course I Still Love You". Standard tokenization would simply split it into component words. It would be better to keep it as a single entity. We will do so below with spacy's facilities. But, before that, just in case you don't know, that curious name for the ship is taken from the novel The Player of Games by Iain M. Banks. Elon Musk gave his droneship that name in tribute to Banks. Let me add a sentence from Musk and another from Bank's novel to our text corpus.

In [17]:
c.update(
{     
 '2015Musk':     
 'West Coast droneship under construction will ' + 
 'be named Of Course I Still Love You',
 
 'IainBanks':    
 'These friends of yours are ships. ' + 
 'Yes, both of them. ' + 
 'What are they called? ' + 
 'Of Course I Still Love You and Just Read The Instructions. ' + 
 'They are not warships? ' + 
 'With names like that?'
})

To deal with text items like the droneship name, we need to use the phrase matching capabilities of spacy. Three examples of terms to match are added to a TerminologyList below. Spacy also does some default phrase matching, e.g., it identifies the phrase "nearly a decade" as a temporal unit. It is up to us whether we want to use the entire phrase as a token or not. Below, we will modify the tokenization step to keep all phrases as tokens with _ in place of white space so we can recognize them easily.

In [18]:
from spacy.matcher import PhraseMatcher

terms = ['SpaceX', 
         'Of Course I Still Love You',
         'Just Read The Instructions']

patterns = [nlp.make_doc(text) for text in terms]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('TerminologyList', None, *patterns)

Next, we use a slicing feature (called Span) of spacy to capture the matched phrases as tokens. We also use the ents attribute provided by spacy to add named entities (a real-world object with a name) to the document object.

In [19]:
from spacy.tokens import Span

def tokensfromdoc(doc):
    d = nlp(doc)
    matches = matcher(d)
    for match_id, start, end in matches:
        term = Span(d, start, end, label='myterms')
        d.ents = list(d.ents) + [term]            
    tokens = [w.lemma_ for w in d 
              # no pronouns
              if w.pos_ != 'PRON'   \
              # no punctuations
              and w.pos_ != 'PUNCT' \
              # not Beginning of a named entity
              and w.ent_iob_ != 'B' \
              # not Inside a named entity
              and w.ent_iob_ != 'I' \
              # not a stop word
              and not w.is_stop]
    tokens += [de.string.rstrip().replace(' ', '_') for de in d.ents]  
    return tokens

def dictokens(corpora):
    d = {}
    for j, dok in enumerate(corpora.keys()):
        for t in tokensfromdoc(corpora[dok]):
            d[t] = d.setdefault(t, [])
            d[t] += [j]
    return d

The above function dictokens makes a dictionary with lemmatized words as keys and document numbers as values. This can be used to make the term-document matrix as we did for the initial example.

In [20]:
def tdmatrix(d, corpora):    
    A = lil_matrix((len(d.keys()), len(corpora.keys())), dtype=int)
    for i, t in enumerate(d.keys()):
        for j in d[t]:
            A[i, j] = 1   
    return A
In [21]:
d = dictokens(c)
In [22]:
d = dictokens(c)
A = tdmatrix(d, c)
Adf = pd.DataFrame(A.toarray(), index=d.keys(), columns=c.keys())

This array is now a bit too big to meaningfully display here, but here are a few elements of one row, which now displays the droneship name as a single token.

In [23]:
Adf.loc[['Of_Course_I_Still_Love_You'], 'NYTimes':].T
Out[23]:
Of_Course_I_Still_Love_You
NYTimes 0
May30NASAblog 1
May31NYTimes 0
2015Musk 1
IainBanks 1

Queries and retrieval

Returning to the question of information retrieval posed at the beginning of the lecture, let's consider how to handle queries. Free text query, is a form of query popular on internet searches, where query terms are typed in without any connecting operators. Query terms can be any collection of words extracted from the corpus. A query vector can be made by taking the mean of these query word vectors and normalizing it to a unit vector. (Again this is not a foolproof strategy, but it is a simple prescription that often works well.) The cosine separation between the query vector and each document vector is then computed. The most relevant documents are considered to be the ones that make the smallest angle with the query vector, so they are returned first in the output list. Here is a quick implementation suitable for small datatsets.

In [24]:
def retrieve(querytokns, W, Vt, c):

    """Given a list of query word token numbers "querytokns",
    all words vectors "W"  and all document vectors "Vt.T"
    extracted from a corpus c, retrieve the documents
    relevant to the query. """

    q = W[querytokns, :].mean(axis=0)
    nrm = norm(q)
    q /= nrm
    idx = np.argsort(Vt.T @ q)[::-1]
    kl = list(c.keys())
    keys = [kl[i] for i in idx]
    docs = [c[k] for k in keys]
    return docs, keys, idx

To use this on our current corpus example, let's make the word and document vectors first.

In [25]:
uu, ss, vvt = svd(A.toarray())   # SVD & rank k approximation
k = 4
U = uu[:, :k]; S = ss[:k]; 
Vt = vvt[:k, :]                  # Document vectors
W = uu[:, :k] @ np.diag(ss[:k])  # Word vectors 

Here is an example of a query with two words, astronaut and first, and the first three matching documents generated by the above strategy.

In [26]:
myquery = np.where((Adf.index=='astronaut') | (Adf.index=='first'))[0]
docs, keys, idx = retrieve(myquery, W, Vt, c)
docs[:3]
Out[26]:
['Nation’s first astronaut launch to orbit from home soil in nearly a decade',
 'SpaceX launch of NASA astronauts is postponed over weather',
 'The first stage of the SpaceX rocket has landed successfully on the droneship, Of Course I Still Love You.']

The first result has both search words, while the other two has one of the two search words. Below is another example, where somewhat surprisingly, a document without the query word (but certainly what we would consider a relevant document) is listed within the top three matches.

In [27]:
myquery = np.where(Adf.index=='droneship')[0]
docs, keys, idx = retrieve(myquery, W, Vt, c)
docs[:3]
Out[27]:
['The first stage of the SpaceX rocket has landed successfully on the droneship, Of Course I Still Love You.',
 'These friends of yours are ships. Yes, both of them. What are they called? Of Course I Still Love You and Just Read The Instructions. They are not warships? With names like that?',
 'West Coast droneship under construction will be named Of Course I Still Love You']


Let me conclude this introduction to the subject of text analysis and information retrieval by noting that the concept of mapping words to vectors is finding increasingly significant uses, such as in automatic translation. I have tried to present ideas in minimal examples, but you should be aware that there are many extensions in the literature. Some extensions are easy to see as emerging from computational experience. An example is a generalization that we will see in an exercise that modifies the term-document matrix to account for the number of times a term occurs in a document. The resulting matrix will have frequency-weighted entries, not just 0 and 1 as above. This is built into scikit-learn's text analysis facilities, which we shall use in the exercise.