Tuesday, June 19, 2012

Removing stop words


Frequently occurring words or words that don't add value to overall goal of the processing needs to be removed from a text. The definition of stop words is highly dependent on the context. Please look up the wiki page on stop words to know more about the concept: http://en.wikipedia.org/wiki/Stop_words

In this post, we are going to define stop words as used by MySQL version 5.6 (http://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html)

Attached is a text file containing the MySQL words and we are going to remove the stop words from our text.

stop_word_set = {}
file = open("mysql_5.6_stopwords.txt")
stop_word_set = set(file.read().split("\n"))
file.close()

def remove_stopwords(wordlist):
     return set(wordlist).difference(stop_word_set)

>>> line = "Football refers to a number of sports that involve kicking a ball with the foot to score a goal"

>>> print remove_stopwords(line.split(" "))
set(['a', 'ball', 'goal', 'Football', 'number', 'sports', 'involve', 'kicking', 'score', 'foot', 'refers'])


In this code, the function is remove_stopword(wordlist). The input parameter is a list of words and the function returns a set of words removing the words found in the stop word list. Due to properties of the set data structure, the words in the sentence are unordered. 

If the extract input string with stop words removed is required, then we can modify the above code as follows:

stop_word_set = {}
file = open("mysql_5.6_stopwords.txt")
stop_word_set = set(file.read().split("\n"))
file.close()

def remove_stopwords(line):
     output = []
     for l in line.split(" "):
          if l not in stop_word_set:
               output.append(l)
     return " ".join(output)

>>> line = "Football refers to a number of sports that involve kicking a ball with the foot to score a goal"

>>> print remove_stopwords(line)
Football refers a number sports involve kicking a ball foot score a goal

The first function will be significantly faster than the second function due to it's use of set operations. Select one that suits your requirement.




P.S.: Blogger doesn't support uploading files. Another reason why wordpress is better! Best option: Copy paste the words from http://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html into a text file named mysql_5.6_stopwords.txt and make sure one word appears on each line. Without correct formatting, the above code will break.



No comments:

Post a Comment