5.1. Vector Space Model¶

Vector Space Models map arbitrary inputs to numeric vectors of fixed length. For a given task, you are free to define a set of \(N\) relevant features, which can be extracted from the input. Each of the \(N\)-feature extraction functions returns how often the corresponding feature appears in the input. Each component of the vector-representation belongs to one feature and the value at this component is the count of this feature in the input.

Example: Vector Space Model in General

Assume that your task is to classify texts into the classes poetry and scientific paper. The classifier shall be learned by a Machine Learning Algorithm which requires fixed-length numeric vectors at it’s input. You think about relevant features and come to the conclusion, that

the average length of sentences (in words)
the number of proper names
the number of adjectives

may be relevant features. Then all input-texts, independent of their length can be mapped to vectors of length \(N=3\), whose components are the frequencies of this features in the text. E.g. the text

Mary loves the kind and handsome little boy. His name is Peter and he lived next door to Mary's jealous friend Anne.

maps into the vector

\[ (11,4,4) \]

5.1.1. Bag-of-Word Model¶

In the general case the vector space model implies a vector, whose components are the frequencies of pre-defined features in the given input. In the special case of text (=documents), a vector space model is applied, where the features are defined to be all words of the vocabulary \(V\). I.e. each component in the resulting vector corresponds to a word \(w \in V\) and the value of the component is the frequency of this word in the given document. This vector space model for texts is the so called Bag of Words (BoW) model and the frequency of a word in a given document is denoted term-frequency. Accordingly a set of documents is modelled by a Bag of Words matrix, whose rows belong to documents and whose columns belong to words.

Example: Bag of Word matrix

Assume, that the given playground-corpus contains only two documents

Document 1: not all kids stay at home
Document 2: all boys and girls stay not at home

The BoW model of these documents is then

	all	and	at	boys	girls	home	kids	not	stay
Document 1	1	0	1	0	0	1	1	1	1
Document 2	1	1	1	1	1	1	0	1	1

In this example the words in the matrix have been alphabetically ordered. This is not necessary.

5.1.2. Bag-of-Word Variants¶

The entries of the BoW-matrix, as introduced above, are the term-frequencies. I.e. the entry in row \(i\), column \(j\), \(tf(i,j)\) determines how often the term (word) of column \(j\), appears in document \(i\).

Another option is the binary BoW. Here, the binary entry in row \(i\), column \(j\) just indicates if term \(j\) appears in document \(i\). The entry has value 1 if the term appears at least once, otherwise it is 0.

TF-IDF BoW: The drawback of using term-frequency \(tf(i,j)\) as matrix entries is that all terms are weighted similarly, in particular rare words such as words with a strong semantic focus are weighted in the same way as very frequent words, such as articles. TF-IDF is a weighted term frequency (TF). The weights are the inverse document frequencies (IDF). Actually, there are different definitions for the calculation of TF-IDF. A common definition is

\[ \mbox{tf-idf}(i,j) = tf(i,j) \cdot \log(\frac{N}{df_j}), \]

where \(tf(i,j)\) is the frequency of term \(j\) in document \(i\), \(N\) is the total number of documents and \(df_j\) is the number of documents, which contain term \(j\). For words, which occure in all documents

\[ \log(\frac{N}{df_j}) = \log(\frac{N}{N}) = 0, \]

i.e. such words are disregarded in a BoW with TF-IDF entries. Otherwise, words with a very strong semantic focus usually appear in only a few documents. Then the small value of \(df_j\) yields a low IDF, i.e. the term-frequency of such a word is weighted strongly.

5.1.3. One-Hot-Encoding of Words¶

In the extreme case of documents, which contain only a single word, the corresponding tf-based BoW-vector, has only one component of value 1 (in the column, which belongs to this word), all other entries are zero. This is actually a common conventional numeric encoding of words, the so called One-Hot-Encoding.

Example: One-Hot-Encoding of words

Assume, that the entire Vocabular is

\[ V=(\mbox{all, and, at, boys, girls, home, kids, not, stay}). \]

A possible One-Hot-Encoding of these words is then


all	1	0	0	0	0	0	0	0	0
and	0	1	0	0	0	0	0	0	0
at	0	0	1	0	0	0	0	0	0
boys	0	0	0	1	0	0	0	0	0
girls	0	0	0	0	1	0	0	0	0
home	0	0	0	0	0	1	0	0	0
kids	0	0	0	0	0	0	1	0	0
not	0	0	0	0	0	0	0	1	0
stay	0	0	0	0	0	0	0	0	1

5.1.4. BoW-based document similarity¶

Numeric vector presentation of documents are not only required for Machine-Learning based NLP tasks. Another important application category is Information Retrieval (IR). Information Retrieval deals with algorithms and models for searching information in large document collections. Web-search like www.google.com is only one example for IR. In such document-search applications the user defines a query, usually in terms of one or more words. The task of the IR system is then to

return relevant documents, which match the query
rank the returned documents, such that the most important is at the top of the search-result

Challenges in this context are:

How to deduce what the user actually wants, given only a few query-words?
How to calculate the relevance of a document with respect to the query words?

Question 1 will be addressed later further below, when Distributional Semantic Models, in particular Word Embeddings are introduced. The second question will be answered now.

The conventional approach for document search is to

model all documents in the index as numerical vectors, e.g. by BoW
model the query as a numerical vector in the same way as the documents are modelled
determine the most relevant documents by just determining the document-vectors, which have the smallest distance to the query-vector.

This means that the question on relevance is solved by determining nearest vectors in a vector space. An example is given below. Here we assume, that there are only 3 documents in the index and there are only 4 different words, occuring in these documents. Document 2, for example, contains word 1 with a frequency of 4 and word 2 with a frequency of 5. The query consists of word 1 and word 2. The BoW-matrix and the attached query-vector are:

	word 1	word 2	word 3	word 4
document 1	1	1	1	1
document 2	4	5	0	0
document 3	0	0	1	1
Query	1	1	0	0

Given these vector-representations, it is easy to determine the distances between each document and the query.

The obvious type of distance is the Euclidean Distance: For two vectors, \(\underline{a}=(a_1,\ldots,a_n)\) and \(\underline{b}=(b_1,\ldots,b_n)\), the Euclidean Distance is defined to be

\[ d_E(\underline{a},\underline{b})=\sqrt{\sum_{i=1}^n (a_i-b_i)^2} \]

Similarity and Distance are inverse to each other, i.e. the similarity between vectors increases with decreasing distance and vice versa. For each distance-measure a corresponding similarity-measure can be defined. E.g. the Euclidean-distance-based similarity measure is

\[ s_E(\underline{a},\underline{b})=\frac{1}{1+d_E(\underline{a},\underline{b})} \]

Now let’s determine the Euclidean distance between the query and the 3 documents in the example abover:

Euclidean distance between query and document 1:

\[ d_E(\underline{q},\underline{d}_1)=\sqrt{(1-1)^2+(1-1)^2+(1-0)^2+(1-0)^2} = \sqrt{2} = 1.41 \]

Euclidean distance between query and document 2:

\[ d_E(\underline{q},\underline{d}_2)=\sqrt{(4-1)^2+(5-1)^2+(0-0)^2+(0-0)^2} = \sqrt{25} = 5.00 \]

Euclidean distance between query and document 3:

\[ d_E(\underline{q},\underline{d}_3)=\sqrt{(0-1)^2+(0-1)^2+(1-0)^2+(1-0)^2} = \sqrt{4} = 2.00 \]

Comparing these 3 distances, one can conclude, that document 1 has the smallest distance (and the highest similarity) to the query and is therefore the best match.

Is this what we expect?

No! Document 2 contains the query words not only once but with a much higher frequency. One would expect, that this stronger prevalence of the query words implies that Document 2 is more relevant.

So what went wrong?

The answer is, that the Euclidean Distance is just the wrong distance-measure for this type of application. In a query each word is contained only once. Therefore, Euclidean-distance penalizes longer documents with more words.

The solution to this problem is

either normalize all vectors - document vectors and query-vector - to unique length,
or apply another distance measure

The standard similarity-measure for BoW vectors is the Cosine Similarity \(s_C(\underline{a},\underline{b})\), which is calculated as follows:

\[ s_C(\underline{a},\underline{b})=\frac{\underline{a} \cdot \underline{b}^T}{\left| \left| \underline{a} \right| \right| \ \cdot \ \left| \left| \underline{b} \right| \right|} \]

From the cosine-similarity measure the cosine-distance can be calculated as follows:

\[ d_C(\underline{a},\underline{b})= 1-s_C(\underline{a},\underline{b}). \]

For the query-example above, the Cosine-Similarities are:

Cosine Similarity between query and document 2:

\[ s_C(\underline{q},\underline{d}_1)=\frac{1 \cdot 1 + 1 \cdot 1 + 1 \cdot 0 + 1 \cdot 0}{\sqrt{4} \cdot \sqrt{2}} = \frac{1}{\sqrt{2}} = 0.707 \]

Cosine Similarity between query and document 2:

\[ s_C(\underline{q},\underline{d}_2)=\frac{4 \cdot 1 + 5 \cdot 1 + 0 \cdot 0 + 0 \cdot 0}{\sqrt{41} \cdot \sqrt{2}} = \frac{9}{\sqrt{82}} = 0.994 \]

Cosine Similarity between query and document 3:

\[ s_C(\underline{q},\underline{d}_3)=\frac{0 \cdot 1 + 0 \cdot 1 + 1 \cdot 0 + 1 \cdot 0}{\sqrt{2} \cdot \sqrt{2}} = \frac{0}{2} = 0 \]

These calculated similarities match our subjective expectation: The similarity between document 3 and query \(q\) is 0 (the lowest possible value), since they have no word in common. The similarity between document 2 and the query q is close to the maximum similarity-value of 1, since both query-words appear with a high frequency in this document.

Another similarity measure is the Pearson Correlation \(s_P(\underline{a},\underline{b})\). The Pearson correlation coefficient measures linearity, i.e. its maximum value of \(s_P(\underline{a},\underline{b})=1\) is obtained, if there is a linear correlation between the two vectors. Pearson correlation is calculated as follows:

\[ s_P(\underline{a},\underline{b})=\frac{\underline{a}_d \cdot \underline{b}_d^T}{\left| \left| \underline{a}_d \right| \right| \ \cdot \ \left| \left| \underline{b}_d \right| \right|}, \]

where

\[ \underline{a}_d=(a_1-\overline{a}, \ a_2-\overline{a}, \ldots , a_n-\overline{a}) \]

and

\[ \underline{b}_d=(b_1-\overline{b}, \ b_2-\overline{b}, \ldots , b_n-\overline{b}) \]

and

\(\overline{a}\) is the mean over the components in \(\underline{a}\) and \(\overline{b}\) is the mean over the components in \(\underline{b}\).

The corresponding distance measure is:

\[ d_P(\underline{a},\underline{b})= 1-s_P(\underline{a},\underline{b}). \]

There exists much more distance- and similarity-measures, see e.g. scipy.spatial.distance or Distance Measures in Data Science.

5.1.5. BoW Drawbacks¶

BoW representation of documents and the One-Hot-Encoding of single words, as described above, are methods to map words and documents to numeric vectors, which can be applied as input for arbitrary Machine Learning algorithms. Hovever, these representations suffer from crucial drawbacks:

The vectors are usually very long - there length is given by the number of words in the vocabulary. Moreover, the vectors are quite sparse, since the set of words appearing in one document is usually only a very small part of the set of all words in the vocabulary.
Semantic relations between words are not modelled. This means that in this model there is no information about the fact that word car is more related to word vehicle than to word lake.
In the BoW-model of documents word order is totally ignored. E.g. the model can not distinguish if word not appeared immediately before word good or before word bad.

All of these drawbacks can be solved by

applying Distributional Semantic Models for caluclating better vector-representations of words
pass these vector-representations of words to the input of neural network architectures such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs) or Transformers (see later chapters of this lecture).

Natural Language Processing Lecture

5.1. Vector Space Model¶

5.1.1. Bag-of-Word Model¶

5.1.2. Bag-of-Word Variants¶

5.1.3. One-Hot-Encoding of Words¶

5.1.4. BoW-based document similarity¶

5.1.5. BoW Drawbacks¶

5.2. Distributional Semantic Models¶

5.2.1. Count-based DSM¶

5.2.1.1. Variants of count-based DSMs¶

5.2.2. Prediction-based DSM¶

5.2.2.1. Continous Bag-Of-Words (CBOW)¶

5.2.2.2. Skip-Gram¶

5.2.2.3. GloVe¶

5.2.2.4. FastText¶

5.2.3. DSM Downstream tasks¶