Introduction

  • Author: Prof. Dr. Johannes Maucher

  • Institution: Stuttgart Media University

  • Document Version: 0.9 (Incomplete DRAFT !!!)

  • Last Update: 23.09.2021

Lecture Contents

Introduction

  • What is NLP?

  • Challenges of NLP

  • Applications

Accessing Text

  • from local files

  • from online files

  • from HTML

  • from RSS Feeds

  • from Tweets

Preprocessing

  • Segmentation into words and sentences

  • Regular Expressions

  • Normalisation

  • Stemming and Lemmatisation

  • Error Correction / Levensthein Distance

PoS-Tagging

  • Part of Speech

  • Tagsets

  • Tagging-Algorithms

N-Gram Language Models

  • Applications of LM

  • Probability of word sequences

  • Smoothing

  • Evaluation of LMs

Vector Representations of Words

  • Word-Embedding

  • Word2Vec: CBow and Skipgram

  • Learning word-embeddings

  • Apply pretrained word-embeddings

Document Models and Similarities

  • Bag-of-Word

  • Similarity Measures

  • Binary Count, Count, TF-IDF

  • Applying gensim/Keras for BoW

Topic Extraction

  • Latent Semantic Indexing (LSI)

  • LSI Topic Extraction with gensim

Text Classification

  • Recap: ML in general

  • Evaluation metrics

  • Naive Bayes Classifier

  • BoW plus conventional ML (sklearn)

Neural Networks

  • Recap: Feedforward Nets (MLP)

  • Recap: CNN

  • Recurrent Neural Networks

  • Keras implementation of LSTM and CNN

Attention and Self-Attention

  • Sequence-to-Sequence architectures

  • Language Modelling

  • Machine Translation

  • Attention- and Self-Attention Layer

Transformer

  • Encoder-Decoder architectures

  • Transformer

  • BERT

  • Apply BERT from Tensorflow-Hub

What is NLP?

Natural Language Processing (NLP) strieves to build computers, such that they can understand and generate natural language. Since computers usually only understand formal languages (programming languages, assembler, etc), NLP techniques must provide the transformation from natural language to a formal language and vice versa.

Transformation between natural and formal language

This lecture focuses on the direction from natural language to formal language. However, in the later chapters also techniques for automatic language generation are explained. In any case, only natural language in written form is considered. Speech recognition, i.e. the process of transforming speech audio signals into written text, is not in the scope of this lecture.

As a science NLP is a subfield of Artificial Intelligence, which itself belongs to Computer Science. In the past linguistic knowledge has been a key-komponent for NLP.

Sciences, used by NLP

The old approach of NLP, the so called Rule-based-approach can be described by representing linguistic rules in a formal language and parsing text according to this rule. In this way, e.g. the syntactic structure of sentences can be derived and from the syntactic structure semantics are infered.

The enormous success of NLP during the last few years is based on Data-based-approaches, which increasingly substitute the old Rule-based-approach. The idea of this approach is to learn language statistics from large amounts of digitally available texts (copora). For this, modern Machine Learning (ML) techniques, such as Deep Neural Networks are applied. The learned statistics can then be applied e.g. for Part-of-Speech-Tagging, Named-Entity-Recognition, Text Summarisation, Semantic Analysis, Language Translation, Text Generation, Question-Answering, Dialog-Systems and many other NLP tasks.

As the picture below describes, Rule-based-approaches require expert-knowledge of the linguists, whereas Data-based approaches require large amount of data, ML-algorithms and performant Hardware.

Rule-based and data-based approach

The following statement of Fred Jelinek expresses the increasing dominance of Data-based-approaches:

Every time I fire a linguist, the performance of the speech recognizer goes up.

—Fred Jelinek1

NLP Process Chain and Lecture Contents

In order to realize NLP tasks one usually has to implement a chain of processing steps for accessing, storing and preprocessing before the task specific model can be learned and applied. The following figure sketches an entire processing chain in general.

NLP Processing Chain

This processing chain defines the content of this lecture:

  1. Methods for accessing text from different types of sources

  2. Text preprocessing like segmentation, normalisation, POS-tagging, etc

  3. Models for representing words and texts.

  4. Statistical Language Models.

  5. Architectures for implementing NLP tasks such as word-completion, auto-correction, information retrieval, document classification, automatic translation, automatic text generation, etc.

The lecture has a practical focus, i.e. for most of the techniques the implementation in Python is demonstrated.

The challenge of ambiguity

In contrast to many formal languages (programming languages), natural language is ambiguous on different levels:

  • Segmentation: Shall Stamford Bridge be segmentated into 2 words? Is Web 2.0 one expression?…

  • Homonyms (ball) and Synonyms (bike and bycicle)

  • Part-of-Speech: Is love a verb, an adjective or a noun?

  • Syntax: The sentence John saw the man on the mountain with a telescope has multiple syntax trees.

  • Semantic: We saw her dug (Wir sahen ihre Ente oder Wir sahen, wie sie sich geduckt hat)

  • Ambiguity of pronouns, e.g. The bottle fell into the glass. It broke.