NLTK Preprocessor

`NLTK(*, strip_multiple_whitespaces=True, remove_punctuation=False, stopwords_removal=False, url_tagging=False, lemmatization=False, stemming=False, pos_tag=False, lang='english')`

Bases: NLP

Interface to the NLTK library for natural language processing features.

Examples:

Strip multiple whitespaces from running text

>>> nltk_obj = NLTK(strip_multiple_whitespaces=True)
>>> nltk_obj.process('This   has  a lot  of   spaces')
['This', 'has', 'a', 'lot', 'of', 'spaces']

Remove punctuation from running text

>>> nltk_obj = NLTK(remove_punctuation=True)
>>> nltk_obj.process("Hello there. How are you? I'm fine, thanks.")
["Hello", "there", "How", "are", "you", "I", "m", "fine", "thanks"]

Remove stopwords using from running text

>>> nltk_obj = NLTK(stopwords_removal=True)
>>> nltk_obj.process("The striped bats are hanging on their feet for the best")
["striped", "bats", "hanging", "feet", "best"]

Replace URL with a normalized token <URL>

>>> nltk_obj = NLTK(url_tagging=True)
>>> nltk_obj.process("This is facebook http://facebook.com and github https://github.com")
['This', 'is', 'facebook', '<URL>', 'and', 'github', '<URL>']

Perform lemmatization on running text

>>> nltk_obj = NLTK(lemmatization=True)
>>> nltk_obj.process("The striped bats are hanging on their feet for best")
["The", "strip", "bat", "be", "hang", "on", "their", "foot", "for", "best"]

Perform stemming on running text

>>> nltk_obj = NLTK(stemming=True)
>>> nltk_obj.process("These unbelievable abnormous objects")
['these', 'unbeliev', 'abnorm', 'object']

Label each token in the running text with its POS tag

>>> nltk_obj = NLTK(pos_tag=True)
>>> nltk_obj.process("Facebook was fined by Hewlett Packard for spending 100€")
['Facebook_NNP', 'was_VBD', 'fined_VBN', 'by_IN', 'Hewlett_NNP', 'Packard_NNP', 'for_IN', 'spending_VBG',
'100€_CD']

PARAMETER	DESCRIPTION
`strip_multiple_whitespaces`	If set to True, all multiple whitespaces will be reduced to only one white space TYPE: `bool` DEFAULT: `True`
`remove_punctuation`	If set to True, all punctuation from the running text will be removed TYPE: `bool` DEFAULT: `False`
`stopwords_removal`	If set to True, all stowpwords from the running text will be removed TYPE: `bool` DEFAULT: `False`
`url_tagging`	If set to True, all urls in the running text will be replaced with the `<URL>` token TYPE: `bool` DEFAULT: `False`
`lemmatization`	If set to True, each token in the running text will be brought to its lemma TYPE: `bool` DEFAULT: `False`
`stemming`	If set to True, each token in the running text will be brought to its stem TYPE: `bool` DEFAULT: `False`
`pos_tag`	If set to True, each token in the running text will be labeled with its POS tag in the form `token_TAG` TYPE: `bool` DEFAULT: `False`
`lang`	Language of the running text TYPE: `str` DEFAULT: `'english'`

Source code in clayrs/content_analyzer/information_processor/nltk_processor.py

def __init__(self, *,
             strip_multiple_whitespaces: bool = True,
             remove_punctuation: bool = False,
             stopwords_removal: bool = False,
             url_tagging: bool = False,
             lemmatization: bool = False,
             stemming: bool = False,
             pos_tag: bool = False,
             lang: str = 'english'):

    if not NLTK._corpus_downloaded:
        self.__download_corpus()
        NLTK._corpus_downloaded = True

    self.stopwords_removal = stopwords_removal

    self.stemming = stemming
    self.stemmer = SnowballStemmer(language=lang)

    self.lemmatization = lemmatization
    self.lemmatizer = WordNetLemmatizer()

    self.strip_multiple_whitespaces = strip_multiple_whitespaces
    self.url_tagging = url_tagging
    self.remove_punctuation = remove_punctuation
    self.pos_tag = pos_tag
    self.__full_lang_code = lang