# Implementation of BoWs

- Author:      Johannes Maucher
- Last update: 05.11.2020

This notebook demonstrates how documents can be described in a vector space model. Applying this type of model

1. similarities between documents 
2. similarities between documents and a query 

can easily be calculated.

## Read documents from a textfile 

It is assumed that a set of documents is stored in a textfile, as e.g. in [MultiNewsFeeds2014-09-12.txt](../Data/MultiNewsFeeds2014-09-12.txt). The individual documents are separated by line-break. In this case the documents can be assigned to the list _listOfNews_ as follows:

In [23]:
filename="../Data/MultiNewsFeeds2014-09-12.txt"
#filename="../Data/MultiNewsFeeds2016-10-14.txt"
listOfNews=[]
with open(filename,"r",encoding="utf-8") as fin:
    for line in fin:
        line = line.strip()
        print(line)
        listOfNews.append(line)
print("Number of Lines:  ",len(listOfNews))
fin.close()

﻿Kommunikation: Gut kommunizieren macht glücklich    Wie kaum ein anderer hat der Psychologe Friedemann Schulz von Thun untersucht, was Kommunikation ausmacht. In seinem neuen Buch entwickelt er eine Quintessenz davon.
Ukraine-Krise: Bundesregierung toleriert ukrainischen Mauerbau    Im Kanzleramt hat man Verständnis für die Pläne der ukrainischen Regierung, eine Mauer entlang der Grenze zu Russland zu bauen. Dies sei allein Entscheidung der Ukraine.
Radfahren: Eine Deutsche bringt die Chinesen aufs Fixie    Einst war Peking Fahrrad-Welthauptstadt. Dann kamen die Autos. Ines Brunn kämpft dagegen: Die Inhaberin eines Fixie-Shops will die Chinesen wieder aufs Rad bringen.
Zurück in die Wildnis : Jenseits allen Komforts    Drei Jahre lang suchte der Fotograf Antoine Bruy Aussteiger in Europa. Seine Bilder zeigen Menschen, die der Konsum- und Leistungsgesellschaft den Rücken gekehrt haben.
Antisemitismus: Zentralrat beklagt "Schockwellen von Judenhass"    Judenfeindliche Hetze wie bei Prot

## Split documents into words, normalize and remove stopwords
In _listOfNews_ each document is stored as a single string variable. Each of these document-strings is now split into a set of words. All words are transformed to a lower-case representation and stop-words are removed.

In [24]:
from nltk.corpus import stopwords
stopwordlist=stopwords.words('german')
docWords = [[word.strip('?!.:",') for word in document.lower().split() 
             if word.strip('?!.:",') not in stopwordlist] for document in listOfNews]
#print(docWords)

Display the list of words of the first 5 documents:

In [25]:
idx=0
for doc in docWords[:5]:
    print('------ document %d ----------'%idx)
    for d in doc:
        print(d)
    idx+=1

------ document 0 ----------
﻿kommunikation
gut
kommunizieren
macht
glücklich
kaum
psychologe
friedemann
schulz
thun
untersucht
kommunikation
ausmacht
neuen
buch
entwickelt
quintessenz
davon
------ document 1 ----------
ukraine-krise
bundesregierung
toleriert
ukrainischen
mauerbau
kanzleramt
verständnis
pläne
ukrainischen
regierung
mauer
entlang
grenze
russland
bauen
sei
allein
entscheidung
ukraine
------ document 2 ----------
radfahren
deutsche
bringt
chinesen
aufs
fixie
einst
peking
fahrrad-welthauptstadt
kamen
autos
ines
brunn
kämpft
dagegen
inhaberin
fixie-shops
chinesen
aufs
rad
bringen
------ document 3 ----------
zurück
wildnis

jenseits
komforts
drei
jahre
lang
suchte
fotograf
antoine
bruy
aussteiger
europa
bilder
zeigen
menschen
konsum-
leistungsgesellschaft
rücken
gekehrt
------ document 4 ----------
antisemitismus
zentralrat
beklagt
schockwellen
judenhass
judenfeindliche
hetze
protesten
israel
juden
tief
getroffen
sagte
vorsitzende
zentralrats
kundgebung
zeichen
setzen


## Generate Dictionary
The elements of the list _docWords_ are itself lists. Each of these lists contains all relevant words of a document. The set of all relevant words in the document collection, i.e. relevant words, which appear in at least one document, are stored in a [gensim-dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html). In the dictionary to each of the relevant words an unique integer ID is assigned: 

In [26]:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(docWords)
dictionary.save('multiNews.dict') # store the dictionary, for future reference
print(dictionary.token2id)

{'ausmacht': 0, 'buch': 1, 'davon': 2, 'entwickelt': 3, 'friedemann': 4, 'glücklich': 5, 'gut': 6, 'kaum': 7, 'kommunikation': 8, 'kommunizieren': 9, 'macht': 10, 'neuen': 11, 'psychologe': 12, 'quintessenz': 13, 'schulz': 14, 'thun': 15, 'untersucht': 16, '\ufeffkommunikation': 17, 'allein': 18, 'bauen': 19, 'bundesregierung': 20, 'entlang': 21, 'entscheidung': 22, 'grenze': 23, 'kanzleramt': 24, 'mauer': 25, 'mauerbau': 26, 'pläne': 27, 'regierung': 28, 'russland': 29, 'sei': 30, 'toleriert': 31, 'ukraine': 32, 'ukraine-krise': 33, 'ukrainischen': 34, 'verständnis': 35, 'aufs': 36, 'autos': 37, 'bringen': 38, 'bringt': 39, 'brunn': 40, 'chinesen': 41, 'dagegen': 42, 'deutsche': 43, 'einst': 44, 'fahrrad-welthauptstadt': 45, 'fixie': 46, 'fixie-shops': 47, 'ines': 48, 'inhaberin': 49, 'kamen': 50, 'kämpft': 51, 'peking': 52, 'rad': 53, 'radfahren': 54, '': 55, 'antoine': 56, 'aussteiger': 57, 'bilder': 58, 'bruy': 59, 'drei': 60, 'europa': 61, 'fotograf': 62, 'gekehrt': 63, 'jahre': 6

In [27]:
print("Total number of documents in the dictionary: ",dictionary.num_docs)
print("Total number of corpus positions: ",dictionary.num_pos)
print("Total number of non-zeros in the BoW-Matrix: ",dictionary.num_nnz)
print("Total number of different words in the dictionary: ",len(dictionary))

Total number of documents in the dictionary:  48
Total number of corpus positions:  975
Total number of non-zeros in the BoW-Matrix:  914
Total number of different words in the dictionary:  811


## Bag of Word (BoW) representation
Now arbitrary text-strings can be efficiently represented with respect to this dictionary. E.g. the code snippet below demonstrates how the text string _"putin beschützt russen"_ is represented as a list of tuples. The first element of such a tuple is the dictionary index of a word in the text-string and the second number defines how often this word occurs in the text-string. The list contains only tuples for words which occur in the text-string and in the dictionary. This representation is called **sparse Bag of Word** representation (sparse because it contains only the non-zero elements).

In [28]:
newDoc = "putin beschützt russen"
newVec = dictionary.doc2bow(newDoc.lower().split())
print("Sparse BoW representation of %s: %s"%(newDoc,newVec))
for idx,freq in newVec:
    print("Index %d refers to word %s. Frequency of this word in the document is %d"%(idx,dictionary[idx],freq))

Sparse BoW representation of putin beschützt russen: [(189, 1), (397, 1), (729, 1)]
Index 189 refers to word beschützt. Frequency of this word in the document is 1
Index 397 refers to word putin. Frequency of this word in the document is 1
Index 729 refers to word russen. Frequency of this word in the document is 1


From this output we infer, that 

* the word at index 189 in the dictionary, which is _beschützt_ , appears once in the text-string _newDoc_
* the word at index 397 in the dictionary, which is _putin_ , appears once in the text-string _newDoc_
* the word at index 729 in the dictionary, which is _russen_ , appears once in the text-string _newDoc_ .

The text-string _"schottland stimmt ab"_ is represented as a list of 2 tuples (see code snippet below). The first says, that the word at index 229 ( _ab_ ) appears once, the second tuple says that the word at index _807_ ( _schottland_ ) also appears once in the text-string. Since the word _stimmt_ does not appear in the dictionary, there is no corresponding tuple for this word in the list.  


In [29]:
newDoc2 = "schottland stimmt ab"
newVec2 = dictionary.doc2bow(newDoc2.lower().split())
print("Sparse BoW representation of %s: %s"%(newDoc2,newVec2))
for idx,freq in newVec2:
    print("Index %d refers to word %s. Frequency of this word in the document is %d"%(idx,dictionary[idx],freq))

Sparse BoW representation of schottland stimmt ab: [(229, 1), (807, 1)]
Index 229 refers to word ab. Frequency of this word in the document is 1
Index 807 refers to word schottland. Frequency of this word in the document is 1


## Efficient Corpus Representation
A corpus is a collection of documents. Such corpora may be annotated with meta-information, e.g. each word is tagged with its part of speech (POS-Tag). In this notebook, the list _docWords_, is a corpus without any annotations. So far this corpus has been applied to build the dictionary. In practical NLP tasks corpora are usually very large and therefore require an efficient representation. Using the already generated dictionary, each document (list of relevant words in a document) in the list _docWords_ can be transformed to its sparse BoW representation. 

In [30]:
corpus = [dictionary.doc2bow(doc) for doc in docWords]
corpora.MmCorpus.serialize('multiNews.mm', corpus)
print("------------------------- First 10 documents of the corpus ---------------------------------")
idx=0
for d in corpus[0:10]:
    print("-------------document %d ---------------" %idx)
    print(d)
    idx+=1

------------------------- First 10 documents of the corpus ---------------------------------
-------------document 0 ---------------
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
-------------document 1 ---------------
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1)]
-------------document 2 ---------------
[(36, 2), (37, 1), (38, 1), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1)]
-------------document 3 ---------------
[(55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1)]
-------------document 4 ---------------
[(76, 1), (77, 1), (78, 1), (

## Similarity Analysis
Typical information retrieval tasks include the task of determining similarities between collections of documents or between a query and a collection of documents. Using gensim a fast similarity calculation and search is supported. For this, first a **cosine-similarity-index** of the given corpus is calculated as follows: 

In [31]:
index = similarities.SparseMatrixSimilarity(corpus, num_features=len(dictionary))

Now, assume that for a given query, e.g. _"putin beschützt russen"_ the best matching document in the corpus must be determined. The sparse BoW representation of this query has already been calculated and stored in the variable _newVec_. The similarity between this query and all documents in the corpus can be calculated as follows:

In [32]:
sims = index[newVec]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.0), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.13608277), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.12309149), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.0), (29, 0.0), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.0), (42, 0.0), (43, 0.23570226), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.0)]


The tuples in this output contain as first element the index of the document in the corpus. The second element is the cosine-similarity between this corpus-document and the query. 

In order to get a sorted list of increasingly similar documents, the `argsort()`-method can be applied as shown below. The last value in this list is the index of the most similar document:

In [33]:
print(sims.argsort())

[ 0 25 26 27 28 29 30 31 32 33 24 34 36 37 38 39 40 41 42 44 45 35 46 47
 21  1  2  3  4  5  6  7  8  9 22 10 12 13 14 15 16 17 18 19 20 23 11 43]


In this example _document 43_ best matches to the query. The cosine-similarity between the query and _document 43_ is _0.2357_.

Question: Manually verify the calculated similiarity value between the query and _document 43_.

In the same way the similarity between documents in the corpus can be calculated. E.g. the similiarity between _document 1_ and all other documents in the corpus is determined as follows:

In [34]:
sims = index[corpus[1]]
print((list(enumerate(sims))))
print(sims.argsort())

[(0, 0.0), (1, 1.0), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.05143445), (14, 0.0), (15, 0.15877683), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.044543542), (27, 0.0727393), (28, 0.0), (29, 0.0), (30, 0.05006262), (31, 0.048795003), (32, 0.0), (33, 0.04761905), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.04364358), (41, 0.045501575), (42, 0.0), (43, 0.0), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.0)]
[ 0 22 46 24 25 28 29 32 34 35 36 37 38 39 42 43 44 45 21 20 23 18  2  3
  4  5  6  7  8  9 19 11 10 12 14 16 17 47 40 26 41 33 31 30 13 27 15  1]


Thus _document 15_ is the most similar document to _document 1_. As can easily be verified both documents refer to the same topic (crisis in ukraine).

## TF-IDF representation
So far in the BoW representation of the documents the _term frequency (tf)_ has been applied. This value measures how often the term (word) appears in the document. If document similarity is calculated on such tf-based BoW representation, common words which appear quite often (in many documents) but have low semantic focus have a strong impact on the similarity-value. In most cases this is a drawback, since similarity should be based on terms with a high semantic focus. Such semantically meaningful words usually appear only in a few documents. The _term frequency inversed document frequency measure (tf-idf)_ does not only count the frequency of a term in a document, but weighs those terms stronger, which occur only in a few documents of the corpus. 

In _gensim_ the _tfidf_ - model of a corpus can be calculated as follows:

In [35]:
tfidf = models.TfidfModel(corpus)

The _tf-idf_-representation of the first 3 documents in the corpus are:

In [36]:
idx=0
for d in corpus[:3]:
    print("-------------tf-idf BoW of document %d ---------------" %idx)
    print(tfidf[d])
    idx+=1

-------------tf-idf BoW of document 0 ---------------
[(0, 0.24353425982511087), (1, 0.1999289070955868), (2, 0.24353425982511087), (3, 0.24353425982511087), (4, 0.24353425982511087), (5, 0.24353425982511087), (6, 0.24353425982511087), (7, 0.24353425982511087), (8, 0.24353425982511087), (9, 0.24353425982511087), (10, 0.1999289070955868), (11, 0.1744214109180963), (12, 0.24353425982511087), (13, 0.24353425982511087), (14, 0.24353425982511087), (15, 0.24353425982511087), (16, 0.24353425982511087), (17, 0.24353425982511087)]
-------------tf-idf BoW of document 1 ---------------
[(18, 0.23522128928414127), (19, 0.23522128928414127), (20, 0.16846758720673122), (21, 0.23522128928414127), (22, 0.23522128928414127), (23, 0.23522128928414127), (24, 0.23522128928414127), (25, 0.23522128928414127), (26, 0.23522128928414127), (27, 0.19310439248245848), (28, 0.19310439248245848), (29, 0.16846758720673122), (30, 0.16846758720673122), (31, 0.23522128928414127), (32, 0.19310439248245848), (33, 0.16846

In this representation the second element in the tuples is not the term frequency, but the _tfidf_. Note that default configuration of [tf-idf in gensim](http://radimrehurek.com/gensim/models/tfidfmodel.html) calculates tf-idf values such that each document-vector has a norm of _1._ The tfidf-model without normalization is generated at the end of this notebook.

Question: Find the maximum tf-idf value in these 3 documents. To which word does this maximum value belong? How often does this word occur in the document?

The _tf-idf_-representation of the text-string _"putin beschützt russen"_ is determined as follows:

In [37]:
newVecTfIdf = tfidf[newVec]
print("tf BoW representation of %s is:\n %s"%(newDoc,newVec))
print("tf-idf BoW representation of %s is:\n %s"%(newDoc,newVecTfIdf))

tf BoW representation of putin beschützt russen is:
 [(189, 1), (397, 1), (729, 1)]
tf-idf BoW representation of putin beschützt russen is:
 [(189, 0.6115372747558391), (397, 0.5020401609118563), (729, 0.6115372747558391)]


Question: Explain the different values in the tfidf BoW representation _newVecTfIdf_. 

**TF-IDF-Model without normalization:**

In [38]:
tfidfnoNorm = models.TfidfModel(corpus,normalize=False)

Display tf-idf BoW of first 3 documents:

In [40]:
idx=0
for d in corpus[:3]:
    print("-------------tf-idf BoW of document %d ---------------" %idx)
    print(tfidfnoNorm[d])
    idx+=1

-------------tf-idf BoW of document 0 ---------------
[(0, 5.584962500721157), (1, 4.584962500721157), (2, 5.584962500721157), (3, 5.584962500721157), (4, 5.584962500721157), (5, 5.584962500721157), (6, 5.584962500721157), (7, 5.584962500721157), (8, 5.584962500721157), (9, 5.584962500721157), (10, 4.584962500721157), (11, 4.0), (12, 5.584962500721157), (13, 5.584962500721157), (14, 5.584962500721157), (15, 5.584962500721157), (16, 5.584962500721157), (17, 5.584962500721157)]
-------------tf-idf BoW of document 1 ---------------
[(18, 5.584962500721157), (19, 5.584962500721157), (20, 4.0), (21, 5.584962500721157), (22, 5.584962500721157), (23, 5.584962500721157), (24, 5.584962500721157), (25, 5.584962500721157), (26, 5.584962500721157), (27, 4.584962500721157), (28, 4.584962500721157), (29, 4.0), (30, 4.0), (31, 5.584962500721157), (32, 4.584962500721157), (33, 4.0), (34, 11.169925001442314), (35, 5.584962500721157)]
-------------tf-idf BoW of document 2 ---------------
[(36, 11.169925

Verify the tf-idf-values as calculated in the code-cell above, by own tf-idf-formula:

In [41]:
import numpy as np
tf=1 #term frequency
NumDocs=dictionary.num_docs #number of documents
df=1 #number of documents in which the word appears
tfidf=tf*np.log2(float(NumDocs)/df)
print(tfidf)

5.584962500721156


## Tokenisation and Document models with Keras
This section demonstrates how [Keras](https://keras.io/api/preprocessing/text/) can be applied for tokenisation and BoW document-modelling. I.e. no new techniques are introduced here. Instead it is shown how Keras can be applied to implement already known procedures. This is usefunl, because Keras will be applied later on to implement Neural Networks.

### Tokenizsation

#### Text collections as lists of strings
Tokens are atomic text elements. Depending on the NLP task and the selected approach to solve this task, tokens can either be
* characters
* words (uni-grams)
* n-grams

Single texts are often represented as variables of type `string`. Collections of texts are then represented as lists of strings.

Below, a collection of 3 texts is generated as a list of `string`-variables:

In [45]:
text1="""Florida shooting: Nikolas Cruz confesses to police Nikolas Cruz is said
to have killed 17 people before escaping and visiting a McDonalds."""
text2="""Winter Olympics: Great Britain's Dom Parsons wins skeleton bronze medal
Dom Parsons claims Great Britain's first medal of 2018 Winter Olympics with bronze in the men's skeleton."""
text3="""Theresa May to hold talks with Angela Merkel in Berlin
The prime minister's visit comes amid calls for the UK to say what it wants from Brexit."""

In [46]:
print(text1)

Florida shooting: Nikolas Cruz confesses to police Nikolas Cruz is said
to have killed 17 people before escaping and visiting a McDonalds.


In [47]:
textlist=[text1,text2,text3]

#### Keras class Tokenizer

In Keras methods for preprocessing texts are contained in `keras.preprocessing.text`. From this module, we apply the `Tokenizer`-class to 
* transform words to integers, i.e. generating a word-index
* represent texts as sequences of integers
* represent collections of texts in a Bag-of-Words (BOW)-matrix

In [48]:
from keras.preprocessing import text

Using TensorFlow backend.


Generate a `Tokenizer`-object and fit it on the given list of texts:

In [49]:
tokenizer=text.Tokenizer()
tokenizer.fit_on_texts(textlist)

The `Tokenizer`-class accepts a list of arguments, which can be configured at initialisation of a `Tokenizer`-object. The default-values are printed below:

In [50]:
print("Configured maximum number of words in the vocabulary: ",tokenizer.num_words) #Maximum number of words to regard in the vocabulary
print("Configured filters: ",tokenizer.filters) #characters to ignore in tokenization
print("Map all characters to lower case: ",tokenizer.lower) #Mapping of characters to lower-case
print("Tokenizsation on character level: ",tokenizer.char_level) #whether tokens are words or characters

Configured maximum number of words in the vocabulary:  None
Configured filters:  !"#$%&()*+,-./:;<=>?@[\]^_`{|}~	

Map all characters to lower case:  True
Tokenizsation on character level:  False


In [51]:
print("Number of documents: ",tokenizer.document_count)

Number of documents:  3


Similar as the `dictionary` in gensim (see above), the Keras `Tokenizer` provides a word-index, which uniquely maps each word to an integer:

In [52]:
print("Index of words: ",tokenizer.word_index)

Index of words:  {'to': 1, 'the': 2, 'nikolas': 3, 'cruz': 4, 'winter': 5, 'olympics': 6, 'great': 7, "britain's": 8, 'dom': 9, 'parsons': 10, 'skeleton': 11, 'bronze': 12, 'medal': 13, 'with': 14, 'in': 15, 'florida': 16, 'shooting': 17, 'confesses': 18, 'police': 19, 'is': 20, 'said': 21, 'have': 22, 'killed': 23, '17': 24, 'people': 25, 'before': 26, 'escaping': 27, 'and': 28, 'visiting': 29, 'a': 30, 'mcdonalds': 31, 'wins': 32, 'claims': 33, 'first': 34, 'of': 35, '2018': 36, "men's": 37, 'theresa': 38, 'may': 39, 'hold': 40, 'talks': 41, 'angela': 42, 'merkel': 43, 'berlin': 44, 'prime': 45, "minister's": 46, 'visit': 47, 'comes': 48, 'amid': 49, 'calls': 50, 'for': 51, 'uk': 52, 'say': 53, 'what': 54, 'it': 55, 'wants': 56, 'from': 57, 'brexit': 58}


The method `word_docs()` returns for each word the number of documents, in which the word appears:

In [50]:
print("Number of docs, in which word appears: ",tokenizer.word_docs)

Number of docs, in which word appears:  defaultdict(<class 'int'>, {'a': 1, 'people': 1, 'cruz': 1, 'killed': 1, 'florida': 1, 'escaping': 1, 'mcdonalds': 1, 'to': 2, 'shooting': 1, 'and': 1, 'police': 1, 'have': 1, 'confesses': 1, 'nikolas': 1, 'before': 1, 'visiting': 1, 'is': 1, '17': 1, 'said': 1, 'the': 2, 'winter': 1, 'olympics': 1, 'dom': 1, 'medal': 1, 'of': 1, "britain's": 1, 'parsons': 1, 'first': 1, 'with': 2, "men's": 1, 'in': 2, 'wins': 1, 'skeleton': 1, 'great': 1, '2018': 1, 'bronze': 1, 'claims': 1, 'wants': 1, 'theresa': 1, 'uk': 1, 'it': 1, 'hold': 1, 'brexit': 1, 'merkel': 1, "minister's": 1, 'calls': 1, 'prime': 1, 'comes': 1, 'berlin': 1, 'talks': 1, 'what': 1, 'say': 1, 'for': 1, 'visit': 1, 'angela': 1, 'amid': 1, 'may': 1, 'from': 1})


### Represent texts as sequences of word-indices:

The following representation of texts as sequences of word-indicees is a common input to Neural Networks implemented in Keras.

In [53]:
textSeqs=tokenizer.texts_to_sequences(textlist)
for i,ts in enumerate(textSeqs):
    print("text %d sequence: "%i,ts)

text 0 sequence:  [16, 17, 3, 4, 18, 1, 19, 3, 4, 20, 21, 1, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
text 1 sequence:  [5, 6, 7, 8, 9, 10, 32, 11, 12, 13, 9, 10, 33, 7, 8, 34, 13, 35, 36, 5, 6, 14, 12, 15, 2, 37, 11]
text 2 sequence:  [38, 39, 1, 40, 41, 14, 42, 43, 15, 44, 2, 45, 46, 47, 48, 49, 50, 51, 2, 52, 1, 53, 54, 55, 56, 57, 58]


### Represent text-collection as binary BoW:
A Bag-Of-Words representation of documents contains $N$ rows and $|V|$ columns, where $N$ is the number of documents in the collection and $|V|$ is the size of the vocabulary, i.e. the number of different words in the entire document collection.

The entry $x_{i,j}$ of the BoW-Matrix indicates the **relevance of word $j$ in document $i$**.

In this lecture 3 different types of **word-relevance** are considered:

1. **Binary BoW:** Entry $x_{i,j}$ is *1* if word $j$ appears in document $i$, otherwise 0.
2. **Count-based BoW:** Entry $x_{i,j}$ is the frequency of word $j$ in document $i$.
3. **Tf-idf-based BoW:** Entry $x_{i,j}$ is the tf-idf of word $j$ with respect to document $i$.

The BoW-representation of texts is a common input to conventional Machine Learning algorithms (not Neural Netorks like CNN and RNN).

#### Binary BoW

In [54]:
print(tokenizer.texts_to_matrix(textlist))

[[0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]


#### Count-based BoW
Represent text-collection as BoW with word-counts:

In [55]:
print(tokenizer.texts_to_matrix(textlist,mode="count"))

[[0. 2. 0. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]


#### Tf-idf-based BoW
In the BoW representation above the term frequency (tf) has been applied. This value measures how often the term (word) appears in the document. If document similarity is calculated on such tf-based BoW representation, common words which appear quite often (in many documents) but have low semantic focus, have a strong impact on the similarity-value. In most cases this is a drawback, since similarity should be based on terms with a high semantic focus. Such semantically meaningful words usually appear only in a few documents. The term frequency inversed document frequency measure (tf-idf) does not only count the frequency of a term in a document, but weighs those terms stronger, which occur only in a few documents of the corpus.

In [56]:
print(tokenizer.texts_to_matrix(textlist,mode="tfidf"))

[[0.         1.17360019 0.         1.55141507 1.55141507 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.91629073 0.91629073
  0.91629073 0.91629073 0.91629073 0.91629073 0.91629073 0.91629073
  0.91629073 0.91629073 0.91629073 0.91629073 0.91629073 0.91629073
  0.91629073 0.91629073 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.69314718 0.         0.         1.55141507
  1.55141507 1.55141507 1.55141507 1.55141507 1.55141507 1.55141507
  1.55141507 1.55141507 0.69314718 0.69314718 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.91629073 0.91629073 0.91629073 0.916