NLTK Preprocessor
NLTK(*, strip_multiple_whitespaces=True, remove_punctuation=False, stopwords_removal=False, url_tagging=False, lemmatization=False, stemming=False, pos_tag=False, lang='english')
Bases: NLP
Interface to the NLTK library for natural language processing features.
Examples:
- Strip multiple whitespaces from running text
>>> nltk_obj = NLTK(strip_multiple_whitespaces=True)
>>> nltk_obj.process('This has a lot of spaces')
['This', 'has', 'a', 'lot', 'of', 'spaces']
- Remove punctuation from running text
>>> nltk_obj = NLTK(remove_punctuation=True)
>>> nltk_obj.process("Hello there. How are you? I'm fine, thanks.")
["Hello", "there", "How", "are", "you", "I", "m", "fine", "thanks"]
- Remove stopwords using from running text
>>> nltk_obj = NLTK(stopwords_removal=True)
>>> nltk_obj.process("The striped bats are hanging on their feet for the best")
["striped", "bats", "hanging", "feet", "best"]
- Replace URL with a normalized token
<URL>
>>> nltk_obj = NLTK(url_tagging=True)
>>> nltk_obj.process("This is facebook http://facebook.com and github https://github.com")
['This', 'is', 'facebook', '<URL>', 'and', 'github', '<URL>']
- Perform lemmatization on running text
>>> nltk_obj = NLTK(lemmatization=True)
>>> nltk_obj.process("The striped bats are hanging on their feet for best")
["The", "strip", "bat", "be", "hang", "on", "their", "foot", "for", "best"]
- Perform stemming on running text
>>> nltk_obj = NLTK(stemming=True)
>>> nltk_obj.process("These unbelievable abnormous objects")
['these', 'unbeliev', 'abnorm', 'object']
- Label each token in the running text with its POS tag
>>> nltk_obj = NLTK(pos_tag=True)
>>> nltk_obj.process("Facebook was fined by Hewlett Packard for spending 100€")
['Facebook_NNP', 'was_VBD', 'fined_VBN', 'by_IN', 'Hewlett_NNP', 'Packard_NNP', 'for_IN', 'spending_VBG',
'100€_CD']
PARAMETER | DESCRIPTION |
---|---|
strip_multiple_whitespaces |
If set to True, all multiple whitespaces will be reduced to only one white space
TYPE:
|
remove_punctuation |
If set to True, all punctuation from the running text will be removed
TYPE:
|
stopwords_removal |
If set to True, all stowpwords from the running text will be removed
TYPE:
|
url_tagging |
If set to True, all urls in the running text will be replaced with the
TYPE:
|
lemmatization |
If set to True, each token in the running text will be brought to its lemma
TYPE:
|
stemming |
If set to True, each token in the running text will be brought to its stem
TYPE:
|
pos_tag |
If set to True, each token in the running text will be labeled with its POS tag in the form
TYPE:
|
lang |
Language of the running text
TYPE:
|
Source code in clayrs/content_analyzer/information_processor/nltk_processor.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|