1. Access and Preprocess Text

The very first step of NLP processing chains is to access text. Some sources already contain text in a clean form, others, e.g. Websites, not only contain raw-text but also markup, images, tables, etc. In the latter case the challenge of preprocessing is to extract the raw text from the not-relevant parts. Moreover, preprocessing contains also the task of segmentation, i.e. the transformation of a possibly long text string into a list of sentences or words.

This chapter demonstrates how raw text can be accessed from:

  • local text-files

  • online text-files

  • online API’s, such as e.g. Twitter

Moreover, the process of crawling raw-text from

  • html files

  • RSS feeds is shown.

Corresponding preprocessing methods, e.g. for segmentation of strings into lists of words and lists of sentences are also demonstrated.