Tuesday, June 19, 2012

Keyword Extraction

Keyword Extraction is a very difficult problem in natural language processing. Please read discussions on the topic on stackoverflow: or wikipedia: 


In this post I will talk about two ways to extract keywords from a large chunk of text.


The first code (text2term_topia.py) is based on a python package available at http://pypi.python.org/pypi/topia.termextract/ . Please install this package before you run the code. 

#text2term_topia.py

#coding: utf-8

from topia.termextract import extract

"""
install the package at http://pypi.python.org/pypi/topia.termextract/ for this to work
"""

def extract_keyword(text):
extractor = extract.TermExtractor()
try:
taggedTerms = sorted(extractor(text))
except Exception:
taggedTerms = []
terms = []
for tterms in taggedTerms:
terms.append(tterms[0])

return terms

def main():
text = 'University of California, Davis (also referred to as UCD, UC Davis, or Davis) is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over 5,300 acres (2,100 ha), the campus is the largest within the University of California system and third largest by enrollment.[6] The Carnegie Foundation classifies UC Davis as a comprehensive doctoral research university with a medical program, veterinary program, and very high research activity.'
 
fetched_keywords = extract_keyword(text)

print fetched_keywords
if __name__ == "__main__":
main()


The second code (text2term_yahoo.py) is based on Yahoo!'s engine to extract keywords. This piece of code requires an app_id from Yahoo! to run successfully. For more, read here: http://developer.yahoo.com/search/content/V1/termExtraction.html




# text2term_yahoo.py


import simplejson, urllib, sys

APP_ID = '' #INSERT YOUR APP_ID HERE
EXTRACT_BASE = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction'

class YahooSearchError(Exception):
    pass

def extract(context,query='',**kwargs):
    kwargs.update({
        'appid': APP_ID,
        'context': context,
        'output': 'json'
    })
    url = EXTRACT_BASE + '?' + urllib.urlencode(kwargs)
    result = simplejson.load(urllib.urlopen(url))
    if 'Error' in result:
        # An error occurred; raise an exception
        raise YahooSearchError, result['Error']
    return result['ResultSet']

def extract_keyword(text):
try:
info = extract(text)
if 'Result' in info:
return info['Result']
else:
return []
except YahooSearchError, e:
print e,"\nAn API error occurred."
sys.exit()
except IOError:
print "A network IO error occured."
sys.exit()

def main():
text = 'University of California, Davis (also referred to as UCD, UC Davis, or Davis) is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over 5,300 acres (2,100 ha), the campus is the largest within the University of California system and third largest by enrollment.[6] The Carnegie Foundation classifies UC Davis as a comprehensive doctoral research university with a medical program, veterinary program, and very high research activity.'
print extract_keyword(text)
if __name__ == "__main__":
main()

The Yahoo! based code is good to run only 3000 queries for every 24hours which is a disadvantage but returns very high-quality and a limited number (i.e. less noisy) of keywords. 

However, the topia based code runs locally on the machine and can handle as many queries as we want. Unfortunately the number of keyword it extracts is large and often times noisy and may also need further processing to separate out symbols and numbers. Despite this disadvantage, compared to the constraint of only 3000 queries per day with the Yahoo! engine, I mostly prefer to use the topia term extractor code for my work.


2 comments:

  1. A beautiful and high quality information.this paper is accurate to be useful. Thanks to the author.

    PIC Scheme Singapore

    ReplyDelete
  2. I just needed to say that I found your blog via Goolge and I am glad I did. Keep up the good work and I will make sure to bookmark you for when I have more free time away from the books. Thanks again! cheap essay writing service uk

    ReplyDelete