Ekphrasis Preprocessor

`Ekphrasis(*, omit=None, normalize=None, unpack_contractions=False, unpack_hashtags=False, annotate=None, corrector=None, tokenizer=social_tokenizer_ekphrasis, segmenter=None, all_caps_tag=None, spell_correction=False, segmentation=False, dicts=None, spell_correct_elong=False)`

Bases: NLP

Interface to the Ekphrasis library for natural language processing features

Examples:

Normalize email and percentage tokens but omit email ones:

>>> ek = Ekphrasis(omit=['email'], normalize=['email', 'percent'])
>>> ek.process("this is an email: alias@mail.com and this is a percent 23%")
['this', 'is', 'an', 'email', ':', 'and', 'this', 'is', 'a', 'percent', '<percent>']

Unpack contractions on running text:

>>> ek = Ekphrasis(unpack_contractions=True)
>>> ek.process("I can't do this because I won't and I shouldn't")
['i', 'can', 'not', 'do', 'this', 'because', 'i', 'will', 'not', 'and', 'i', 'should', 'not']

Unpack hashtag using statistics from 'twitter' corpus:

>>> ek = Ekphrasis(unpack_hashtags=True, segmenter='twitter')
>>> ek.process("#next #gamedev #retrogaming #coolphoto no unpack")
['next', 'game', 'dev', 'retro', 'gaming', 'cool', 'photo', 'no', 'unpack']

Annotate words in CAPS and repeated tokens with single tag for CAPS words:

>>> ek = Ekphrasis(annotate=['allcaps', 'repeated'], all_caps_tag='single')
>>> ek.process("this is good !!! text and a SHOUTED one")
['this', 'is', 'good', '!', '<repeated>', 'text', 'and', 'a', 'shouted', '<allcaps>', 'one']

Perform segmentation using statistics from 'twitter' corpus:

>>> ek = Ekphrasis(segmentation=True, segmenter='twitter')
>>> ek.process("thewatercooler exponentialbackoff no segmentation")
['the', 'watercooler', 'exponential', 'back', 'off', 'no', 'segmentation']

Substitute words with custom tokens:

>>> ek = Ekphrasis(dicts=[{':)': '<happy>', ':(': '<sad>'}])
>>> ek.process("Hello :) how are you? :(")
['Hello', '<happy>', 'how', 'are', 'you', '?', '<sad>']

Perform spell correction on text and on elongated words by using statistics from default 'english' corpus:

>>> Ekphrasis(spell_correction=True, spell_correct_elong=True)
>>> ek.process("This is huuuuge. The korrect way of doing tihngs is not the followingt")
["this", 'is', 'huge', '.', 'the', 'correct', "way", "of", "doing", "things", "is", 'not', 'the',
'following']

PARAMETER	DESCRIPTION
`omit`	Choose what tokens that you want to omit from the text. Possible values: *['email', 'percent', 'money', 'phone', 'user','time', 'url', 'date', 'hashtag']* Important Notes: 1 - the token in this list must be present in the `normalize` list to have any effect! 2 - put url at front, if you plan to use it. Messes with the regexes! 3 - if you use hashtag then unpack_hashtags will automatically be set to False TYPE: `List` DEFAULT: `None`
`normalize`	Choose what tokens that you want to normalize from the text. Possible values: *['email', 'percent', 'money', 'phone', 'user', 'time', 'url', 'date', 'hashtag']* For example: myaddress@mysite.com -> `<email>` Important Notes: `1 - put url at front, if you plan to use it. Messes with the regexes! 2 - if you use hashtag then unpack_hashtags will automatically be set to False` TYPE: `List` DEFAULT: `None`
`unpack_contractions`	Replace English contractions in running text with their unshortened forms for example: can't -> can not, wouldn't -> would not, and so on... TYPE: `bool` DEFAULT: `False`
`unpack_hashtags`	split a hashtag to its constituent words. for example: #ilikedogs -> i like dogs TYPE: `bool` DEFAULT: `False`
`annotate`	add special tags to special tokens. Possible values: ['hashtag', 'allcaps', 'elongated', 'repeated'] for example: myaddress@mysite.com -> myaddress@mysite.com TYPE: `List` DEFAULT: `None`
`corrector`	define the statistics of what corpus you would like to use [english, twitter]. Be sure to set `spell_correction` to True if you want to perform spell correction on the running text TYPE: `str` DEFAULT: `None`
`tokenizer`	callable function that accepts a string and returns a list of strings. If no tokenizer is provided then the text will be tokenized on whitespace TYPE: `Callable` DEFAULT: `social_tokenizer_ekphrasis`
`segmenter`	define the statistics of what corpus you would like to use [english, twitter]. Be sure to set `segmentation` to True if you want to perform segmentation on the running text TYPE: `str` DEFAULT: `None`
`all_caps_tag`	how to wrap the capitalized words Note: applicable only when `allcaps` is included in the `annotate` list Possible values *[single, wrap, every]: `- single: add a tag after the last capitalized word for example: "SHOUTED TEXT" -> "shouted text <allcaps>" - wrap: wrap all words with opening and closing tags for example: "SHOUTED TEXT" -> "<allcaps> shouted text </allcaps>" - every: add a tag after each word for example: "SHOUTED TEXT" -> "shouted <allcaps> text <allcaps>"` TYPE:* `str` DEFAULT: `None`
`spell_correction`	If set to True, running text will be spell corrected using statistics of corpus set in `corrector` parameter TYPE: `bool` DEFAULT: `False`
`segmentation`	If set to True, running text will be segmented using statistics of corpus set in `corrector` parameter for example: exponentialbackoff -> exponential back off TYPE: `bool` DEFAULT: `False`
`spell_correct_elong`	choose if you want to perform spell correction after the normalization of elongated words. significantly affects performance (speed) TYPE: `bool` DEFAULT: `False`
`spell_correction`	choose if you want to perform spell correction to the text. significantly affects performance (speed) TYPE: `bool` DEFAULT: `False`

Source code in clayrs/content_analyzer/information_processor/ekphrasis_processor.py

def __init__(self, *,
             omit: List = None,
             normalize: List = None,
             unpack_contractions: bool = False,
             unpack_hashtags: bool = False,
             annotate: List = None,
             corrector: str = None,
             tokenizer: Callable = social_tokenizer_ekphrasis,
             segmenter: str = None,
             all_caps_tag: str = None,
             spell_correction: bool = False,
             segmentation: bool = False,
             dicts: List[Dict] = None,
             spell_correct_elong: bool = False):

    # ekphrasis has default values for arguments not passed. So if they are not evaluated in our class,
    # we simply don't pass them to ekphrasis
    kwargs_to_pass = {argument: arg_value for argument, arg_value in zip(locals().keys(), locals().values())
                      if argument != 'self' and arg_value is not None}

    self.text_processor = TextPreProcessor(**kwargs_to_pass)

    self.spell_correct_elong = spell_correct_elong

    self.sc = None
    if spell_correction is True:
        if corrector is not None:
            self.sc = SpellCorrector(corpus=corrector)
        else:
            self.sc = SpellCorrector()

    self.segmentation = segmentation
    self.ws = None
    if segmentation is True:
        if segmenter is not None:
            self.ws = Segmenter(corpus=segmenter)
        else:
            self.ws = Segmenter()

    self._repr_string = autorepr(self, inspect.currentframe())

`process(field_data)`

PARAMETER DESCRIPTION

field_data

Running text to be processed

TYPE: str

RETURNS DESCRIPTION

field_data

List of str representing running text preprocessed

TYPE: List[str]

Source code in clayrs/content_analyzer/information_processor/ekphrasis_processor.py

def process(self, field_data: str) -> List[str]:
    """
    Args:
        field_data: Running text to be processed
    Returns:
        field_data: List of str representing running text preprocessed
    """
    field_data = self.text_processor.pre_process_doc(field_data)
    if self.sc is not None:
        field_data = self.__spell_check(field_data)
    if self.ws is not None:
        field_data = self.__word_segmenter(field_data)
    return field_data