Spacy preprocessor

`Spacy(model='en_core_web_sm', *, strip_multiple_whitespaces=True, remove_punctuation=False, stopwords_removal=False, new_stopwords=None, not_stopwords=None, lemmatization=False, url_tagging=False, named_entity_recognition=False)`

Bases: NLP

Interface to the Spacy library for natural language processing features

Examples:

Strip multiple whitespaces from running text

>>> spacy_obj = Spacy(strip_multiple_whitespaces=True)
>>> spacy_obj.process('This   has  a lot  of   spaces')
['This', 'has', 'a', 'lot', 'of', 'spaces']

Remove punctuation from running text

>>> spacy_obj = Spacy(remove_punctuation=True)
>>> spacy_obj.process("Hello there. How are you? I'm fine, thanks.")
["Hello", "there", "How", "are", "you", "I", "'m", "fine", "thanks"]

Remove stopwords using default stopwords corpus of spacy from running text

>>> spacy_obj = Spacy(stopwords_removal=True)
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["striped", "bats", "hanging", "feet", "best"]

Remove stopwords using default stopwords corpus of spacy + new_stopwords list from running text

>>> spacy_obj = Spacy(stopwords_removal=True, new_stopwords=['bats', 'best'])
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["striped", "hanging", "feet"]

Remove stopwords using default stopwords corpus of spacy - not_stopwords list from running text

>>> spacy_obj = Spacy(stopwords_removal=True, not_stopwords=['The', 'the', 'on'])
>>> spacy_obj.process("The striped bats are hanging on their feet for the best")
["The", "striped", "bats", "hanging", "on", "feet", "the", "best"]

Replace URL with a normalized token <URL>

>>> spacy_obj = Spacy(url_tagging=True)
>>> spacy_obj.process("This is facebook http://facebook.com and github https://github.com")
['This', 'is', 'facebook', '<URL>', 'and', 'github', '<URL>']

Perform lemmatization on running text

>>> spacy_obj = Spacy(lemmatization=True)
>>> spacy_obj.process("The striped bats are hanging on their feet for best")
["The", "strip", "bat", "be", "hang", "on", "their", "foot", "for", "best"]

Perform NER on running text (NEs will be tagged with BIO tagging)

>>> spacy_obj = Spacy(named_entity_recognition=True)
>>> spacy_obj.process("Facebook was fined by Hewlett Packard for spending 100€")
["Facebook", "was", "fined", "by", "<Hewlett_ORG_B>", "<Packard_ORG_I>", "for", "spending",
"<100_MONEY_B>", "<€_MONEY_I>"]

PARAMETER	DESCRIPTION
`model`	Spacy model that will be used to perform nlp operations. It will be downloaded if not present locally TYPE: `str` DEFAULT: `'en_core_web_sm'`
`strip_multiple_whitespaces`	If set to True, all multiple whitespaces will be reduced to only one white space TYPE: `bool` DEFAULT: `True`
`remove_punctuation`	If set to True, all punctuation from the running text will be removed TYPE: `bool` DEFAULT: `False`
`stopwords_removal`	If set to True, all stowpwords from the running text will be removed TYPE: `bool` DEFAULT: `False`
`new_stopwords`	List which contains custom defined stopwords that will be removed if `stopwords_removal=True` TYPE: `List[str]` DEFAULT: `None`
`not_stopwords`	List which contains custom defined stopwords that will not be considered as such, therefore won't be removed if `stopwords_removal=True` TYPE: `List[str]` DEFAULT: `None`
`url_tagging`	If set to True, all urls in the running text will be replaced with the `<URL>` token TYPE: `bool` DEFAULT: `False`
`lemmatization`	If set to True, each token in the running text will be brought to its lemma TYPE: `bool` DEFAULT: `False`
`named_entity_recognition`	If set to True, named entities recognized will be labeled in the form `<token_B_TAG>` or `<token_I_TAG>`, according to BIO tagging strategy TYPE: `bool` DEFAULT: `False`

Source code in clayrs/content_analyzer/information_processor/spacy_processor.py

def __init__(self, model: str = 'en_core_web_sm', *,
             strip_multiple_whitespaces: bool = True,
             remove_punctuation: bool = False,
             stopwords_removal: bool = False,
             new_stopwords: List[str] = None,
             not_stopwords: List[str] = None,
             lemmatization: bool = False,
             url_tagging: bool = False,
             named_entity_recognition: bool = False):

    self.model = model
    self.stopwords_removal = stopwords_removal
    self.lemmatization = lemmatization
    self.strip_multiple_whitespaces = strip_multiple_whitespaces
    self.url_tagging = url_tagging
    self.remove_punctuation = remove_punctuation
    self.named_entity_recognition = named_entity_recognition

    # download the model if not present. In any case load it
    if model not in spacy.cli.info()['pipelines']:
        spacy.cli.download(model)
    self._nlp = spacy.load(model)

    # Adding custom rule of preserving '<URL>' token and in general token
    # wrapped by '<...>'
    prefixes = list(self._nlp.Defaults.prefixes)
    prefixes.remove('<')
    prefix_regex = spacy.util.compile_prefix_regex(prefixes)
    self._nlp.tokenizer.prefix_search = prefix_regex.search

    suffixes = list(self._nlp.Defaults.suffixes)
    suffixes.remove('>')
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    self._nlp.tokenizer.suffix_search = suffix_regex.search

    self.not_stopwords_list = not_stopwords
    if not_stopwords is not None:
        for stopword in not_stopwords:
            self._nlp.vocab[stopword].is_stop = False

    self.new_stopwords_list = new_stopwords
    if new_stopwords is not None:
        for stopword in new_stopwords:
            self._nlp.vocab[stopword].is_stop = True

`process(field_data)`

PARAMETER DESCRIPTION

field_data

content to be processed

TYPE: str

RETURNS DESCRIPTION

field_data

list of str or dict in case of named entity recognition

TYPE: List[str]

Source code in clayrs/content_analyzer/information_processor/spacy_processor.py

def process(self, field_data: str) -> List[str]:
    """
    Args:
        field_data: content to be processed

    Returns:
        field_data: list of str or dict in case of named entity recognition

    """
    field_data = check_not_tokenized(field_data)
    if self.strip_multiple_whitespaces:
        field_data = self.__strip_multiple_whitespaces_operation(field_data)
    field_data = self.__tokenization_operation(field_data)
    if self.named_entity_recognition:
        field_data = self.__named_entity_recognition_operation(field_data)
    if self.remove_punctuation:
        field_data = self.__remove_punctuation(field_data)
    if self.stopwords_removal:
        field_data = self.__stopwords_removal_operation(field_data)
    if self.lemmatization:
        field_data = self.__lemmatization_operation(field_data)
    if self.url_tagging:
        field_data = self.__url_tagging_operation(field_data)

    return self.__token_to_string(field_data)