Introduction¶

Author: Prof. Dr. Johannes Maucher
Institution: Stuttgart Media University
Document Version: 0.9 (Incomplete DRAFT !!!)
Last Update: 23.09.2021

Lecture Contents¶

Introduction

What is NLP?
Challenges of NLP
Applications

Accessing Text

from local files
from online files
from HTML
from RSS Feeds
from Tweets

Preprocessing

Segmentation into words and sentences
Regular Expressions
Normalisation
Stemming and Lemmatisation
Error Correction / Levensthein Distance

PoS-Tagging

Part of Speech
Tagsets
Tagging-Algorithms

N-Gram Language Models

Applications of LM
Probability of word sequences
Smoothing
Evaluation of LMs

Vector Representations of Words

Word-Embedding
Word2Vec: CBow and Skipgram
Learning word-embeddings
Apply pretrained word-embeddings

Document Models and Similarities

Bag-of-Word
Similarity Measures
Binary Count, Count, TF-IDF
Applying gensim/Keras for BoW

Topic Extraction

Latent Semantic Indexing (LSI)
LSI Topic Extraction with gensim

Text Classification

Recap: ML in general
Evaluation metrics
Naive Bayes Classifier
BoW plus conventional ML (sklearn)

Neural Networks

Recap: Feedforward Nets (MLP)
Recap: CNN
Recurrent Neural Networks
Keras implementation of LSTM and CNN

Attention and Self-Attention

Sequence-to-Sequence architectures
Language Modelling
Machine Translation
Attention- and Self-Attention Layer

Transformer

Encoder-Decoder architectures
Transformer
BERT
Apply BERT from Tensorflow-Hub

What is NLP?¶

Natural Language Processing (NLP) strieves to build computers, such that they can understand and generate natural language. Since computers usually only understand formal languages (programming languages, assembler, etc), NLP techniques must provide the transformation from natural language to a formal language and vice versa.

Transformation between natural and formal language

This lecture focuses on the direction from natural language to formal language. However, in the later chapters also techniques for automatic language generation are explained. In any case, only natural language in written form is considered. Speech recognition, i.e. the process of transforming speech audio signals into written text, is not in the scope of this lecture.

As a science NLP is a subfield of Artificial Intelligence, which itself belongs to Computer Science. In the past linguistic knowledge has been a key-komponent for NLP.

The old approach of NLP, the so called Rule-based-approach can be described by representing linguistic rules in a formal language and parsing text according to this rule. In this way, e.g. the syntactic structure of sentences can be derived and from the syntactic structure semantics are infered.

The enormous success of NLP during the last few years is based on Data-based-approaches, which increasingly substitute the old Rule-based-approach. The idea of this approach is to learn language statistics from large amounts of digitally available texts (copora). For this, modern Machine Learning (ML) techniques, such as Deep Neural Networks are applied. The learned statistics can then be applied e.g. for Part-of-Speech-Tagging, Named-Entity-Recognition, Text Summarisation, Semantic Analysis, Language Translation, Text Generation, Question-Answering, Dialog-Systems and many other NLP tasks.

As the picture below describes, Rule-based-approaches require expert-knowledge of the linguists, whereas Data-based approaches require large amount of data, ML-algorithms and performant Hardware.

The following statement of Fred Jelinek expresses the increasing dominance of Data-based-approaches:

Every time I fire a linguist, the performance of the speech recognizer goes up.

—Fred Jelinek1

Example

Consider the NLP task Spam Classification. In a Rule-based approach one would define rules like if text contains Viagra then class=spam, if sender address is part of a given black-list then class=spam, etc. In a Data-based-approach such rules are not required. Instead a large corpus of e-mails labeled with either spam or ham is required. A Machine Learning Algorithm, like e.g. a Naive Bayes Classifier, will learn a statistical model from the given training data. The learned model can then be applied for spam-classification.

NLP Process Chain and Lecture Contents¶

In order to realize NLP tasks one usually has to implement a chain of processing steps for accessing, storing and preprocessing before the task specific model can be learned and applied. The following figure sketches an entire processing chain in general.

This processing chain defines the content of this lecture:

Methods for accessing text from different types of sources
Text preprocessing like segmentation, normalisation, POS-tagging, etc
Models for representing words and texts.
Statistical Language Models.
Architectures for implementing NLP tasks such as word-completion, auto-correction, information retrieval, document classification, automatic translation, automatic text generation, etc.

The lecture has a practical focus, i.e. for most of the techniques the implementation in Python is demonstrated.

The challenge of ambiguity¶

In contrast to many formal languages (programming languages), natural language is ambiguous on different levels:

Segmentation: Shall Stamford Bridge be segmentated into 2 words? Is Web 2.0 one expression?…
Homonyms (ball) and Synonyms (bike and bycicle)
Part-of-Speech: Is love a verb, an adjective or a noun?
Syntax: The sentence John saw the man on the mountain with a telescope has multiple syntax trees.
Semantic: We saw her dug (Wir sahen ihre Ente oder Wir sahen, wie sie sich geduckt hat)
Ambiguity of pronouns, e.g. The bottle fell into the glass. It broke.

Some popular NLP applications¶

Spam Filter / Document Classification
Sentiment Analysis / Trend Analysis
Automatic Correction of Words and Syntax (e.g. in Word)
Auto completion (WhatsApp, Search Engine)
Information Retrieval / Web Search
Automatic Text Generation: Sportberichterstellung, Open AI’s GPT.
Text Summarisation
Automatic Translation
Question Answering, Dialogue Systems, Digital Assitants, Chatbots
Personal Profiling, e.g. for employment ZEIT-Artikel zum automatischen Recruiting
Political orientation, e.g. for election campaigns Cambridge Analytica and Michael Kosinski
Recommender Systems

1: Frederick Jelinek (18 November 1932 – 14 September 2010) was a Czech-American researcher in information theory, automatic speech recognition, and natural language processing.

Natural Language Processing Lecture