Tuesday, June 19, 2012

Keyword Extraction

Keyword Extraction is a very difficult problem in natural language processing. Please read discussions on the topic on stackoverflow: or wikipedia: 


In this post I will talk about two ways to extract keywords from a large chunk of text.


The first code (text2term_topia.py) is based on a python package available at http://pypi.python.org/pypi/topia.termextract/ . Please install this package before you run the code. 

#text2term_topia.py

#coding: utf-8

from topia.termextract import extract

"""
install the package at http://pypi.python.org/pypi/topia.termextract/ for this to work
"""

def extract_keyword(text):
extractor = extract.TermExtractor()
try:
taggedTerms = sorted(extractor(text))
except Exception:
taggedTerms = []
terms = []
for tterms in taggedTerms:
terms.append(tterms[0])

return terms

def main():
text = 'University of California, Davis (also referred to as UCD, UC Davis, or Davis) is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over 5,300 acres (2,100 ha), the campus is the largest within the University of California system and third largest by enrollment.[6] The Carnegie Foundation classifies UC Davis as a comprehensive doctoral research university with a medical program, veterinary program, and very high research activity.'
 
fetched_keywords = extract_keyword(text)

print fetched_keywords
if __name__ == "__main__":
main()


The second code (text2term_yahoo.py) is based on Yahoo!'s engine to extract keywords. This piece of code requires an app_id from Yahoo! to run successfully. For more, read here: http://developer.yahoo.com/search/content/V1/termExtraction.html




# text2term_yahoo.py


import simplejson, urllib, sys

APP_ID = '' #INSERT YOUR APP_ID HERE
EXTRACT_BASE = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction'

class YahooSearchError(Exception):
    pass

def extract(context,query='',**kwargs):
    kwargs.update({
        'appid': APP_ID,
        'context': context,
        'output': 'json'
    })
    url = EXTRACT_BASE + '?' + urllib.urlencode(kwargs)
    result = simplejson.load(urllib.urlopen(url))
    if 'Error' in result:
        # An error occurred; raise an exception
        raise YahooSearchError, result['Error']
    return result['ResultSet']

def extract_keyword(text):
try:
info = extract(text)
if 'Result' in info:
return info['Result']
else:
return []
except YahooSearchError, e:
print e,"\nAn API error occurred."
sys.exit()
except IOError:
print "A network IO error occured."
sys.exit()

def main():
text = 'University of California, Davis (also referred to as UCD, UC Davis, or Davis) is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over 5,300 acres (2,100 ha), the campus is the largest within the University of California system and third largest by enrollment.[6] The Carnegie Foundation classifies UC Davis as a comprehensive doctoral research university with a medical program, veterinary program, and very high research activity.'
print extract_keyword(text)
if __name__ == "__main__":
main()

The Yahoo! based code is good to run only 3000 queries for every 24hours which is a disadvantage but returns very high-quality and a limited number (i.e. less noisy) of keywords. 

However, the topia based code runs locally on the machine and can handle as many queries as we want. Unfortunately the number of keyword it extracts is large and often times noisy and may also need further processing to separate out symbols and numbers. Despite this disadvantage, compared to the constraint of only 3000 queries per day with the Yahoo! engine, I mostly prefer to use the topia term extractor code for my work.


Stemming or Lemmatization Words


From wiki (http://en.wikipedia.org/wiki/Stemming): Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. For example, the words "stemmer", "stemming", "stemmed" are based on the word "stem".

Multiple algorithms exist to stem words: e.g. Porter Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/, Snowball stemming algorithms, etc. We are going to use Python-NLTK package implementation of lemmatizer that uses WordNet's built-in morphy function.

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'

>>> lmtzr.lemmatize('feet')
'foot'

>>> lmtzr.lemmatize('stemmed')
'stemmed'

>>> lmtzr.lemmatize('stemmed','v')
'stem'
The wordnet lemmatizer only knows four parts of speech (ADJ, ADV, NOUN, and VERB): {'a': 'adj', 'n': 'noun', 'r': 'adv', 'v': 'verb'}
The wordnet lemmatizer considers the pos of words passed on to be noun unless otherwise specifically told. We can know the part of speech value of a word from the treebank module of nltk which has its own nomenclature to denote parts of speech. For example, the noun parts of speech in the treebank tagset all start with NN, the verb tags all start with VB, the adjective tags start with JJ, and the adverb tags start with RB. 

So, we will converting from one set of labels i.e. from the treebank module to the other set, i.e. the wordnet terms:

wordnet_tag ={'NN':'n','JJ':'a','VB':'v','RB':'r'}

tokens = nltk.word_tokenize("stemmer stemming stemmed")
tagged = nltk.pos_tag(tokens)
for t in tagged:
     print t[0],t[1][:1]
     try:
          print t[0],":",lmtzr.lemmatize(t[0],wordnet_tag[t[1][:2]])
     except:
          print t[0],":",lmtzr.lemmatize(t[0])

Removing stop words


Frequently occurring words or words that don't add value to overall goal of the processing needs to be removed from a text. The definition of stop words is highly dependent on the context. Please look up the wiki page on stop words to know more about the concept: http://en.wikipedia.org/wiki/Stop_words

In this post, we are going to define stop words as used by MySQL version 5.6 (http://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html)

Attached is a text file containing the MySQL words and we are going to remove the stop words from our text.

stop_word_set = {}
file = open("mysql_5.6_stopwords.txt")
stop_word_set = set(file.read().split("\n"))
file.close()

def remove_stopwords(wordlist):
     return set(wordlist).difference(stop_word_set)

>>> line = "Football refers to a number of sports that involve kicking a ball with the foot to score a goal"

>>> print remove_stopwords(line.split(" "))
set(['a', 'ball', 'goal', 'Football', 'number', 'sports', 'involve', 'kicking', 'score', 'foot', 'refers'])


In this code, the function is remove_stopword(wordlist). The input parameter is a list of words and the function returns a set of words removing the words found in the stop word list. Due to properties of the set data structure, the words in the sentence are unordered. 

If the extract input string with stop words removed is required, then we can modify the above code as follows:

stop_word_set = {}
file = open("mysql_5.6_stopwords.txt")
stop_word_set = set(file.read().split("\n"))
file.close()

def remove_stopwords(line):
     output = []
     for l in line.split(" "):
          if l not in stop_word_set:
               output.append(l)
     return " ".join(output)

>>> line = "Football refers to a number of sports that involve kicking a ball with the foot to score a goal"

>>> print remove_stopwords(line)
Football refers a number sports involve kicking a ball foot score a goal

The first function will be significantly faster than the second function due to it's use of set operations. Select one that suits your requirement.




P.S.: Blogger doesn't support uploading files. Another reason why wordpress is better! Best option: Copy paste the words from http://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html into a text file named mysql_5.6_stopwords.txt and make sure one word appears on each line. Without correct formatting, the above code will break.



Removing punctuations


Python has a built-in function to access all the punctuations:

>>> from string import punctuation
>>> print punctuation
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

Code to remove punctuation(s) from a given string:

from string import punctuation
def removePunctuation(string,replacement='',exclude=''):
   for p in set(list(punctuation)).difference(set(list(exclude))):
       string = string.replace(p,replacement)
   return string

>>> removePunctuation("Hello World!!",' ')
"Hello World  "

>>> removePunctuation("Hello World!!")
"Hello World"

>>> removePunctuation("Hello-World!!",'  ','!')
"Hello World!!"

The replacement parameter replaces the punctuation characters with the given character.
The default value to replace punctuation marks is an empty string.

The exclude option provides scope to retain specific punctuations. For example in the case of cleaning a paragraph of text, we might want to retain the full stop (.) mark. The exclude parameter takes a string containing all the punctuations that needs to be skipped.

Python Multiple Whitespace removal

import re
s = "The   fox jumped   over    the log."
re.sub("\s{2,}" , " ", s)


This can also be done using lists which for some reason the python community really favors over all other methods:


s = "The  fox  jumped  over   the log."
s = filter(None,s.split())
s = " ".join(s)

Sunday, June 3, 2012

Change font matplotlib

How to change font in matplotlib codes:

import matplotlib



# customization using formats given at: http://matplotlib.sourceforge.net/users/customizing.html 

font = {'family' : 'Trebuchet MS', #options: 'serif' (e.g. Times), 'sans-serif' (e.g. Helvetica), 'cursive' (e.g. Zapf-Chancery), 'fantasy' (e.g. Western), and 'monospace' (e.g. Courier)
'style'  : 'normal', #options: normal (or roman), italic or oblique
        'weight' : 'normal', #options: 13 values: normal, bold, bolder, lighter, 100, 200, 300, ...900.
        'size'   : 10}
matplotlib.rc('font', **font)

axes_font = {'labelsize' : 16, #: medium  # fontsize of the x any y labels
'titlesize' : 'medium' #fontsize of the axes title
}
matplotlib.rc('axes', **axes_font)

Yep, that's all !!