Skip to content

Spacy preprocessor

Spacy(model='en_core_web_sm', *, strip_multiple_whitespaces=True, remove_punctuation=False, stopwords_removal=False, new_stopwords=None, not_stopwords=None, lemmatization=False, url_tagging=False, named_entity_recognition=False)

Bases: NLP

Interface to the Spacy library for natural language processing features

Examples:

  • Strip multiple whitespaces from running text
>>> spacy_obj = Spacy(strip_multiple_whitespaces=True)
>>> spacy_obj.process('This   has  a lot  of   spaces')
['This', 'has', 'a', 'lot', 'of', 'spaces']
  • Remove punctuation from running text
>>> spacy_obj = Spacy(remove_punctuation=True)
>>> spacy_obj.process("Hello there. How are you? I'm fine, thanks.")
["Hello", "there", "How", "are", "you", "I", "'m", "fine", "thanks"]
  • Remove stopwords using default stopwords corpus of spacy from running text
>>> spacy_obj = Spacy(stopwords_removal=True)
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["striped", "bats", "hanging", "feet", "best"]
  • Remove stopwords using default stopwords corpus of spacy + new_stopwords list from running text
>>> spacy_obj = Spacy(stopwords_removal=True, new_stopwords=['bats', 'best'])
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["striped", "hanging", "feet"]
  • Remove stopwords using default stopwords corpus of spacy - not_stopwords list from running text
>>> spacy_obj = Spacy(stopwords_removal=True, not_stopwords=['The', 'the', 'on'])
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["The", "striped", "bats", "hanging", "on", "feet", "the", "best"]
  • Replace URL with a normalized token <URL>
>>> spacy_obj = Spacy(url_tagging=True)
>>> spacy_obj.process("This is facebook http://facebook.com and github https://github.com")
['This', 'is', 'facebook', '<URL>', 'and', 'github', '<URL>']
  • Perform lemmatization on running text
>>> spacy_obj = Spacy(lemmatization=True)
>>> spacy_obj.process("The striped bats are hanging on their feet for best")
["The", "strip", "bat", "be", "hang", "on", "their", "foot", "for", "best"]
  • Perform NER on running text (NEs will be tagged with BIO tagging)
>>> spacy_obj = Spacy(named_entity_recognition=True)
>>> spacy_obj.process("Facebook was fined by Hewlett Packard for spending 100€")
["Facebook", "was", "fined", "by", "<Hewlett_ORG_B>", "<Packard_ORG_I>", "for", "spending",
"<100_MONEY_B>", "<€_MONEY_I>"]
PARAMETER DESCRIPTION
model

Spacy model that will be used to perform nlp operations. It will be downloaded if not present locally

TYPE: str DEFAULT: 'en_core_web_sm'

strip_multiple_whitespaces

If set to True, all multiple whitespaces will be reduced to only one white space

TYPE: bool DEFAULT: True

remove_punctuation

If set to True, all punctuation from the running text will be removed

TYPE: bool DEFAULT: False

stopwords_removal

If set to True, all stowpwords from the running text will be removed

TYPE: bool DEFAULT: False

new_stopwords

List which contains custom defined stopwords that will be removed if stopwords_removal=True

TYPE: List[str] DEFAULT: None

not_stopwords

List which contains custom defined stopwords that will not be considered as such, therefore won't be removed if stopwords_removal=True

TYPE: List[str] DEFAULT: None

url_tagging

If set to True, all urls in the running text will be replaced with the <URL> token

TYPE: bool DEFAULT: False

lemmatization

If set to True, each token in the running text will be brought to its lemma

TYPE: bool DEFAULT: False

named_entity_recognition

If set to True, named entities recognized will be labeled in the form <token_B_TAG> or <token_I_TAG>, according to BIO tagging strategy

TYPE: bool DEFAULT: False

Source code in clayrs/content_analyzer/information_processor/spacy_processor.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def __init__(self, model: str = 'en_core_web_sm', *,
             strip_multiple_whitespaces: bool = True,
             remove_punctuation: bool = False,
             stopwords_removal: bool = False,
             new_stopwords: List[str] = None,
             not_stopwords: List[str] = None,
             lemmatization: bool = False,
             url_tagging: bool = False,
             named_entity_recognition: bool = False):

    self.model = model
    self.stopwords_removal = stopwords_removal
    self.lemmatization = lemmatization
    self.strip_multiple_whitespaces = strip_multiple_whitespaces
    self.url_tagging = url_tagging
    self.remove_punctuation = remove_punctuation
    self.named_entity_recognition = named_entity_recognition

    # download the model if not present. In any case load it
    if model not in spacy.cli.info()['pipelines']:
        spacy.cli.download(model)
    self._nlp = spacy.load(model)

    # Adding custom rule of preserving '<URL>' token and in general token
    # wrapped by '<...>'
    prefixes = list(self._nlp.Defaults.prefixes)
    prefixes.remove('<')
    prefix_regex = spacy.util.compile_prefix_regex(prefixes)
    self._nlp.tokenizer.prefix_search = prefix_regex.search

    suffixes = list(self._nlp.Defaults.suffixes)
    suffixes.remove('>')
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    self._nlp.tokenizer.suffix_search = suffix_regex.search

    self.not_stopwords_list = not_stopwords
    if not_stopwords is not None:
        for stopword in not_stopwords:
            self._nlp.vocab[stopword].is_stop = False

    self.new_stopwords_list = new_stopwords
    if new_stopwords is not None:
        for stopword in new_stopwords:
            self._nlp.vocab[stopword].is_stop = True

process(field_data)

PARAMETER DESCRIPTION
field_data

content to be processed

TYPE: str

RETURNS DESCRIPTION
field_data

list of str or dict in case of named entity recognition

TYPE: List[str]

Source code in clayrs/content_analyzer/information_processor/spacy_processor.py
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def process(self, field_data: str) -> List[str]:
    """
    Args:
        field_data: content to be processed

    Returns:
        field_data: list of str or dict in case of named entity recognition

    """
    field_data = check_not_tokenized(field_data)
    if self.strip_multiple_whitespaces:
        field_data = self.__strip_multiple_whitespaces_operation(field_data)
    field_data = self.__tokenization_operation(field_data)
    if self.named_entity_recognition:
        field_data = self.__named_entity_recognition_operation(field_data)
    if self.remove_punctuation:
        field_data = self.__remove_punctuation(field_data)
    if self.stopwords_removal:
        field_data = self.__stopwords_removal_operation(field_data)
    if self.lemmatization:
        field_data = self.__lemmatization_operation(field_data)
    if self.url_tagging:
        field_data = self.__url_tagging_operation(field_data)

    return self.__token_to_string(field_data)