1. Access and Preprocess Text¶

The very first step of NLP processing chains is to access text. Some sources already contain text in a clean form, others, e.g. Websites, not only contain raw-text but also markup, images, tables, etc. In the latter case the challenge of preprocessing is to extract the raw text from the not-relevant parts. Moreover, preprocessing contains also the task of segmentation, i.e. the transformation of a possibly long text string into a list of sentences or words.

This chapter demonstrates how raw text can be accessed from:

local text-files
online text-files
online API’s, such as e.g. Twitter

Moreover, the process of crawling raw-text from

html files
RSS feeds is shown.

Corresponding preprocessing methods, e.g. for segmentation of strings into lists of words and lists of sentences are also demonstrated.

Natural Language Processing Lecture

1. Access and Preprocess Text¶