In this post I will talk about two ways to extract keywords from a large chunk of text.
The first code (text2term_topia.py) is based on a python package available at http://pypi.python.org/pypi/topia.termextract/ . Please install this package before you run the code.
#text2term_topia.py
#coding: utf-8
from topia.termextract import extract
"""
install the package at http://pypi.python.org/pypi/topia.termextract/ for this to work
"""
def extract_keyword(text):
extractor = extract.TermExtractor()
try:
taggedTerms = sorted(extractor(text))
except Exception:
taggedTerms = []
terms = []
for tterms in taggedTerms:
terms.append(tterms[0])
return terms
def main():
text = 'University of California, Davis (also referred to as UCD, UC Davis, or Davis) is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over 5,300 acres (2,100 ha), the campus is the largest within the University of California system and third largest by enrollment.[6] The Carnegie Foundation classifies UC Davis as a comprehensive doctoral research university with a medical program, veterinary program, and very high research activity.'
fetched_keywords = extract_keyword(text)
print fetched_keywords
if __name__ == "__main__":
main()
# text2term_yahoo.py
import simplejson, urllib, sys
import simplejson, urllib, sys
APP_ID = '' #INSERT YOUR APP_ID HERE
EXTRACT_BASE = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction'
class YahooSearchError(Exception):
pass
def extract(context,query='',**kwargs):
kwargs.update({
'appid': APP_ID,
'context': context,
'output': 'json'
})
url = EXTRACT_BASE + '?' + urllib.urlencode(kwargs)
result = simplejson.load(urllib.urlopen(url))
if 'Error' in result:
# An error occurred; raise an exception
raise YahooSearchError, result['Error']
return result['ResultSet']
def extract_keyword(text):
try:
info = extract(text)
if 'Result' in info:
return info['Result']
else:
return []
except YahooSearchError, e:
print e,"\nAn API error occurred."
sys.exit()
except IOError:
print "A network IO error occured."
sys.exit()
def main():
text = 'University of California, Davis (also referred to as UCD, UC Davis, or Davis) is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over 5,300 acres (2,100 ha), the campus is the largest within the University of California system and third largest by enrollment.[6] The Carnegie Foundation classifies UC Davis as a comprehensive doctoral research university with a medical program, veterinary program, and very high research activity.'
print extract_keyword(text)
if __name__ == "__main__":
main()
The Yahoo! based code is good to run only 3000 queries for every 24hours which is a disadvantage but returns very high-quality and a limited number (i.e. less noisy) of keywords.
However, the topia based code runs locally on the machine and can handle as many queries as we want. Unfortunately the number of keyword it extracts is large and often times noisy and may also need further processing to separate out symbols and numbers. Despite this disadvantage, compared to the constraint of only 3000 queries per day with the Yahoo! engine, I mostly prefer to use the topia term extractor code for my work.