Tuesday, June 19, 2012

Stemming or Lemmatization Words


From wiki (http://en.wikipedia.org/wiki/Stemming): Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. For example, the words "stemmer", "stemming", "stemmed" are based on the word "stem".

Multiple algorithms exist to stem words: e.g. Porter Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/, Snowball stemming algorithms, etc. We are going to use Python-NLTK package implementation of lemmatizer that uses WordNet's built-in morphy function.

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'

>>> lmtzr.lemmatize('feet')
'foot'

>>> lmtzr.lemmatize('stemmed')
'stemmed'

>>> lmtzr.lemmatize('stemmed','v')
'stem'
The wordnet lemmatizer only knows four parts of speech (ADJ, ADV, NOUN, and VERB): {'a': 'adj', 'n': 'noun', 'r': 'adv', 'v': 'verb'}
The wordnet lemmatizer considers the pos of words passed on to be noun unless otherwise specifically told. We can know the part of speech value of a word from the treebank module of nltk which has its own nomenclature to denote parts of speech. For example, the noun parts of speech in the treebank tagset all start with NN, the verb tags all start with VB, the adjective tags start with JJ, and the adverb tags start with RB. 

So, we will converting from one set of labels i.e. from the treebank module to the other set, i.e. the wordnet terms:

wordnet_tag ={'NN':'n','JJ':'a','VB':'v','RB':'r'}

tokens = nltk.word_tokenize("stemmer stemming stemmed")
tagged = nltk.pos_tag(tokens)
for t in tagged:
     print t[0],t[1][:1]
     try:
          print t[0],":",lmtzr.lemmatize(t[0],wordnet_tag[t[1][:2]])
     except:
          print t[0],":",lmtzr.lemmatize(t[0])

No comments:

Post a Comment