Spacy preprocessor
Spacy(model='en_core_web_sm', *, strip_multiple_whitespaces=True, remove_punctuation=False, stopwords_removal=False, new_stopwords=None, not_stopwords=None, lemmatization=False, url_tagging=False, named_entity_recognition=False)
Bases: NLP
Interface to the Spacy library for natural language processing features
Examples:
- Strip multiple whitespaces from running text
>>> spacy_obj = Spacy(strip_multiple_whitespaces=True)
>>> spacy_obj.process('This has a lot of spaces')
['This', 'has', 'a', 'lot', 'of', 'spaces']
- Remove punctuation from running text
>>> spacy_obj = Spacy(remove_punctuation=True)
>>> spacy_obj.process("Hello there. How are you? I'm fine, thanks.")
["Hello", "there", "How", "are", "you", "I", "'m", "fine", "thanks"]
- Remove stopwords using default stopwords corpus of spacy from running text
>>> spacy_obj = Spacy(stopwords_removal=True)
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["striped", "bats", "hanging", "feet", "best"]
- Remove stopwords using default stopwords corpus of spacy +
new_stopwords
list from running text
>>> spacy_obj = Spacy(stopwords_removal=True, new_stopwords=['bats', 'best'])
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["striped", "hanging", "feet"]
- Remove stopwords using default stopwords corpus of spacy -
not_stopwords
list from running text
>>> spacy_obj = Spacy(stopwords_removal=True, not_stopwords=['The', 'the', 'on'])
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["The", "striped", "bats", "hanging", "on", "feet", "the", "best"]
- Replace URL with a normalized token
<URL>
>>> spacy_obj = Spacy(url_tagging=True)
>>> spacy_obj.process("This is facebook http://facebook.com and github https://github.com")
['This', 'is', 'facebook', '<URL>', 'and', 'github', '<URL>']
- Perform lemmatization on running text
>>> spacy_obj = Spacy(lemmatization=True)
>>> spacy_obj.process("The striped bats are hanging on their feet for best")
["The", "strip", "bat", "be", "hang", "on", "their", "foot", "for", "best"]
- Perform NER on running text (NEs will be tagged with BIO tagging)
>>> spacy_obj = Spacy(named_entity_recognition=True)
>>> spacy_obj.process("Facebook was fined by Hewlett Packard for spending 100€")
["Facebook", "was", "fined", "by", "<Hewlett_ORG_B>", "<Packard_ORG_I>", "for", "spending",
"<100_MONEY_B>", "<€_MONEY_I>"]
PARAMETER | DESCRIPTION |
---|---|
model |
Spacy model that will be used to perform nlp operations. It will be downloaded if not present locally
TYPE:
|
strip_multiple_whitespaces |
If set to True, all multiple whitespaces will be reduced to only one white space
TYPE:
|
remove_punctuation |
If set to True, all punctuation from the running text will be removed
TYPE:
|
stopwords_removal |
If set to True, all stowpwords from the running text will be removed
TYPE:
|
new_stopwords |
List which contains custom defined stopwords that will be removed if |
not_stopwords |
List which contains custom defined stopwords that will not be considered as such, therefore won't
be removed if |
url_tagging |
If set to True, all urls in the running text will be replaced with the
TYPE:
|
lemmatization |
If set to True, each token in the running text will be brought to its lemma
TYPE:
|
named_entity_recognition |
If set to True, named entities recognized will be labeled in the form
TYPE:
|
Source code in clayrs/content_analyzer/information_processor/spacy_processor.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
process(field_data)
PARAMETER | DESCRIPTION |
---|---|
field_data |
content to be processed
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
field_data
|
list of str or dict in case of named entity recognition |
Source code in clayrs/content_analyzer/information_processor/spacy_processor.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 |
|