3.1. PoS Tagsets¶

Depending on the language and the NLP task different tagsets can be applied. Popular English and German tagsets are:

In NLP tools (e.g. NLTK) sometimes a Universal Tagset for English is applied:

Tag	Meaning	Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner	the, a, some, most, every, no
NOUN	noun	year, home, costs, time, education
NUM	number	twenty-four, fourth, 1991, 14:24
PRON	pronoun	he, their, her, its, my, I, us
PRT	particle	at, on, out, over per, that, up, with
VERB	verb	is, say, told, given, playing, would
.	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, univeristy

Some tagsets distinguish quite a lot different tags, some only a few. The resolution depends on

the NLP tasks: for some tasks a fine-grained differentiation is not required
the language: If a language is quite irregular, it does not make sense to distinguish PoS in a fine-grained manner, because a tagger would implement all these irregular cases, what may be too complex. For example in German there is no unique rule for the differentiation in noun-singular and noun-plural. Therefore the Stuttgart-Tübingen-Tagset does not distinguish these two noun-categories. However, in English there is such a rule (append 's), which is applicable in nearly all cases. Therefore English Tagsets differentiate these two cases.

3.2. Algorithms for PoS Tagging¶

PoS-tagging can be implemented in a rule-based or in a data-based approach. As for other NLP methods the rule-based approach is the conventional on. It does not require a training data set, but it requires expert knowledge. Today, data-based approaches are superior, if enough labeled training data is available. Data-based approaches do not require expert knowledge.

In this, first a rule-based approach for tagging is described. Then simple data-based methods, the Unigram and the N-Gram Tagger, are introduced. The currently best performing PoS-taggers learn the tagging-rules from large amounts of PoS-tagged training data by applying machine learning algorithms.

3.2.1. Rule based Tagging¶

For rule based-tagging linguistic knowledge on the PoS and patterns that can be applied to determine the PoS of a given word is required. The PoS of a word depends not only on the word itself, e.g. pre- and suffixes, length of the word, etc. but also on surrounding words. Therefore, rules on the word itself, e.g. does the word end with ing, and rules on the surrounding words, e.g. is the previous word a determiner (the), must be defined. An example of a small set of rules is given below. This small set contains rules only on the word itself:

#1. Define Pattern:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*es$', 'VBZ'),                # 3rd singular present
     (r'.*ould$', 'MD'),               # modals
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'the', 'DT'),                   # Determiner
     (r'in','IN'),                     # preposition
     (r'.*', 'NN')                     # nouns (default)
]

Once a rule-set (=pattern) is defined, it can be applied by, e.g. a RegexpTagger-object from the NLTK package as follows:

#2.Generate RegexpTagger
from nltk import RegexpTagger
regexp_tagger = nltk.RegexpTagger(patterns)
#3.Tag a sentence. Note that the string, which contains the sentence must be segmented into words
regexp_tagger.tag("5 friends have been singing in the rain".split())

The output of this simle Tagger is:

[('5', 'CD'),
 ('friends', 'NNS'),
 ('have', 'NN'),
 ('been', 'NN'),
 ('singing', 'VBG'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('rain', 'NN')]

3.2.2. Unigram Tagger¶

In the context of this lecture a unigram is just a single word. A unigram-tagger is probably the simplest data-based tagger. As all data-based taggers it requires a labeled training data set (corpus), from which it learns a mapping from a single word to its PoS:

\[ word \rightarrow PoS(word), \quad \forall word \in V, \]

where \(V\) is the applied vocabulary.

Training: For training a Unigram-Tagger a large PoS-tagged corpus is required. Such corpora are publicly available for almost all common languages, e.g. the Brown Corpus for English and the Tiger Corpus for German. In such corpora each word is associated with its PoS, as can be seen in the following sentence from the Brown corpus:

[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')]

During training the Unigram-Tagger determines for each word in the corpus which PoS-Tag is associated most often with the word in the training corpus. The result of the training is a table of two columns, the first column is a word and the second the most-frequent PoS of this word:

Word	Most Frequent Tag
control	noun
run	verb
love	verb
red	adjective
:	:

Tagging: The learned mapping is the two-column table of word and associated most frequent PoS. This table can be applied to tag each word with its PoS.

Properties: A unigram tagger is simple to learn and to apply. However, it suffers from the drawback that only the word itself, but not its context is applied to determine the tag. Consequently a word is always tagged with the same PoS, independent of its context. Unigram-Tagging is erroneous whenever a PoS applies to a word, which is not the PoS that appeared most often with this word in the training corpora.

3.2.3. N-Gram Tagger¶

The Unigram-Tagger ignored the context of the word. However, the previous words, or better, the PoS of the previous words, may provide much information for assigning the correct PoS-tag. For example, if the word before run is an article, then the PoS-tag of run is probably noun, whereas if the word predecessor of run is a pronoun, the run’s PoS-tag is more likely verb.

A Bigram-Tagger assigns the PoS-tag of the current word by taking into account the current word itself and the PoS-tag of the preceding word.

\[ (PoS(word_{i-1}),word_i) \rightarrow PoS(word_i), \quad \forall word \in V, \]

More general, an N-gram-Tagger assigns the PoS-tag of the current word by taking into account the current word itself and the PoS-tag of the N-1 preceding words.

\[ (PoS(word_{i-N+1}),\ldots,PoS(word_{i-1}),word_i) \rightarrow PoS(word_i), \quad \forall word \in V, \]

**Figure:** A 3-Gram-Tagger determines the PoS-Tag of the current word, by taking into the account the current word and the PoS-Tags of 2 preceiding words.

Training an N-Gram-Tagger: As for the Unigram-Tagger a large PoS-tagged corpus is required. During training the N-Gram-Tagger determines for each combination of word plus \(N-1\) preceding PoS-Tags in the corpus which PoS-Tag is associated most often with the word. The result of the training is a table of \(N+1\) columns, the first \(N-1\) columns contain the PoS-tags of the preceding words, followed by a column with the current word and the column, which contains the most frequent PoS-tag for this combination. For example, for a Bigram-Tagger (\(N=2\)) the table-structure is as follows:

PoS-Tag of previous word	Word	Most Frequent Tag
article	control	noun
pronoun	control	verb
pronoun	run	verb
article	run	noun
pronoun	love	verb
article	love	adjective
:	:	:

Tagging: The learned mapping is the \(N+1\)-column table. This table can be applied to tag each PoS-tag-sequence-word-combination with the PoS of the current word.

Properties: The larger the value \(N\), the more context is taken into account and the higher the probability, that the correct PoS-Tag is assigned. However, with an increasing value \(N\), also the number of PoS-tag-sequence-word-combinations increases exponentially. Therefore the probability that the text, which must be tagged, contains a combination, which has not been in the training corpus and therefore is not listed in the mapping table increases. What should be done in the case of such an unknown combination? A standard solution is to train and implement a sequence of N-Gram-Taggers with varying \(N\). For example a Unigram-, Bigram-, 3-Gram and 4-Gram-Tagger is trained. For tagging the 4-Gram tagger is applied. If this tagger faces a 4-combination (sequence of 3 PoS-tags plus following word), which is not in its table, a Backup-Tagger, the 3-Gram-Tagger in this case, is applied for this combination. If the corresponding 3-combination is also not in the table of the 3-Gram-Tagger the next Backup-Tagger, which is the Bigram-Tagger, is applied and so on.

In the next section the application of all the taggers, described above, is demonstrated. Moreover, in a previous notebook it has already been shown how TextBlob can be applied for PoS-Tagging.

3. Part-Of-Speech Tagging 3.3. POS Tagging with NLTK

Natural Language Processing Lecture

3.1. PoS Tagsets¶

3.2. Algorithms for PoS Tagging¶

3.2.1. Rule based Tagging¶

3.2.2. Unigram Tagger¶

3.2.3. N-Gram Tagger¶