Skip to content

NLTK Preprocessor

NLTK(*, strip_multiple_whitespaces=True, remove_punctuation=False, stopwords_removal=False, url_tagging=False, lemmatization=False, stemming=False, pos_tag=False, lang='english')

Bases: NLP

Interface to the NLTK library for natural language processing features.

Examples:

  • Strip multiple whitespaces from running text
>>> nltk_obj = NLTK(strip_multiple_whitespaces=True)
>>> nltk_obj.process('This   has  a lot  of   spaces')
['This', 'has', 'a', 'lot', 'of', 'spaces']
  • Remove punctuation from running text
>>> nltk_obj = NLTK(remove_punctuation=True)
>>> nltk_obj.process("Hello there. How are you? I'm fine, thanks.")
["Hello", "there", "How", "are", "you", "I", "m", "fine", "thanks"]
  • Remove stopwords using from running text
>>> nltk_obj = NLTK(stopwords_removal=True)
>>> nltk_obj.process("The striped bats are hanging on their feet for the best")
["striped", "bats", "hanging", "feet", "best"]
  • Replace URL with a normalized token <URL>
>>> nltk_obj = NLTK(url_tagging=True)
>>> nltk_obj.process("This is facebook http://facebook.com and github https://github.com")
['This', 'is', 'facebook', '<URL>', 'and', 'github', '<URL>']
  • Perform lemmatization on running text
>>> nltk_obj = NLTK(lemmatization=True)
>>> nltk_obj.process("The striped bats are hanging on their feet for best")
["The", "strip", "bat", "be", "hang", "on", "their", "foot", "for", "best"]
  • Perform stemming on running text
>>> nltk_obj = NLTK(stemming=True)
>>> nltk_obj.process("These unbelievable abnormous objects")
['these', 'unbeliev', 'abnorm', 'object']
  • Label each token in the running text with its POS tag
>>> nltk_obj = NLTK(pos_tag=True)
>>> nltk_obj.process("Facebook was fined by Hewlett Packard for spending 100€")
['Facebook_NNP', 'was_VBD', 'fined_VBN', 'by_IN', 'Hewlett_NNP', 'Packard_NNP', 'for_IN', 'spending_VBG',
'100€_CD']
PARAMETER DESCRIPTION
strip_multiple_whitespaces

If set to True, all multiple whitespaces will be reduced to only one white space

TYPE: bool DEFAULT: True

remove_punctuation

If set to True, all punctuation from the running text will be removed

TYPE: bool DEFAULT: False

stopwords_removal

If set to True, all stowpwords from the running text will be removed

TYPE: bool DEFAULT: False

url_tagging

If set to True, all urls in the running text will be replaced with the <URL> token

TYPE: bool DEFAULT: False

lemmatization

If set to True, each token in the running text will be brought to its lemma

TYPE: bool DEFAULT: False

stemming

If set to True, each token in the running text will be brought to its stem

TYPE: bool DEFAULT: False

pos_tag

If set to True, each token in the running text will be labeled with its POS tag in the form token_TAG

TYPE: bool DEFAULT: False

lang

Language of the running text

TYPE: str DEFAULT: 'english'

Source code in clayrs/content_analyzer/information_processor/nltk_processor.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def __init__(self, *,
             strip_multiple_whitespaces: bool = True,
             remove_punctuation: bool = False,
             stopwords_removal: bool = False,
             url_tagging: bool = False,
             lemmatization: bool = False,
             stemming: bool = False,
             pos_tag: bool = False,
             lang: str = 'english'):

    if not NLTK._corpus_downloaded:
        self.__download_corpus()
        NLTK._corpus_downloaded = True

    self.stopwords_removal = stopwords_removal

    self.stemming = stemming
    self.stemmer = SnowballStemmer(language=lang)

    self.lemmatization = lemmatization
    self.lemmatizer = WordNetLemmatizer()

    self.strip_multiple_whitespaces = strip_multiple_whitespaces
    self.url_tagging = url_tagging
    self.remove_punctuation = remove_punctuation
    self.pos_tag = pos_tag
    self.__full_lang_code = lang