Skip to content

Ekphrasis Preprocessor

Ekphrasis(*, omit=None, normalize=None, unpack_contractions=False, unpack_hashtags=False, annotate=None, corrector=None, tokenizer=social_tokenizer_ekphrasis, segmenter=None, all_caps_tag=None, spell_correction=False, segmentation=False, dicts=None, spell_correct_elong=False)

Bases: NLP

Interface to the Ekphrasis library for natural language processing features

Examples:

  • Normalize email and percentage tokens but omit email ones:
>>> ek = Ekphrasis(omit=['email'], normalize=['email', 'percent'])
>>> ek.process("this is an email: alias@mail.com and this is a percent 23%")
['this', 'is', 'an', 'email', ':', 'and', 'this', 'is', 'a', 'percent', '<percent>']
  • Unpack contractions on running text:
>>> ek = Ekphrasis(unpack_contractions=True)
>>> ek.process("I can't do this because I won't and I shouldn't")
['i', 'can', 'not', 'do', 'this', 'because', 'i', 'will', 'not', 'and', 'i', 'should', 'not']
  • Unpack hashtag using statistics from 'twitter' corpus:
>>> ek = Ekphrasis(unpack_hashtags=True, segmenter='twitter')
>>> ek.process("#next #gamedev #retrogaming #coolphoto no unpack")
['next', 'game', 'dev', 'retro', 'gaming', 'cool', 'photo', 'no', 'unpack']
  • Annotate words in CAPS and repeated tokens with single tag for CAPS words:
>>> ek = Ekphrasis(annotate=['allcaps', 'repeated'], all_caps_tag='single')
>>> ek.process("this is good !!! text and a SHOUTED one")
['this', 'is', 'good', '!', '<repeated>', 'text', 'and', 'a', 'shouted', '<allcaps>', 'one']
  • Perform segmentation using statistics from 'twitter' corpus:
>>> ek = Ekphrasis(segmentation=True, segmenter='twitter')
>>> ek.process("thewatercooler exponentialbackoff no segmentation")
['the', 'watercooler', 'exponential', 'back', 'off', 'no', 'segmentation']
  • Substitute words with custom tokens:
>>> ek = Ekphrasis(dicts=[{':)': '<happy>', ':(': '<sad>'}])
>>> ek.process("Hello :) how are you? :(")
['Hello', '<happy>', 'how', 'are', 'you', '?', '<sad>']
  • Perform spell correction on text and on elongated words by using statistics from default 'english' corpus:
>>> Ekphrasis(spell_correction=True, spell_correct_elong=True)
>>> ek.process("This is huuuuge. The korrect way of doing tihngs is not the followingt")
["this", 'is', 'huge', '.', 'the', 'correct', "way", "of", "doing", "things", "is", 'not', 'the',
'following']
PARAMETER DESCRIPTION
omit

Choose what tokens that you want to omit from the text.

Possible values: ['email', 'percent', 'money', 'phone', 'user','time', 'url', 'date', 'hashtag']

Important Notes:

1 - the token in this list must be present in the `normalize`
    list to have any effect!
2 - put url at front, if you plan to use it.
    Messes with the regexes!
3 - if you use hashtag then unpack_hashtags will
    automatically be set to False

TYPE: List DEFAULT: None

normalize

Choose what tokens that you want to normalize from the text. Possible values: ['email', 'percent', 'money', 'phone', 'user', 'time', 'url', 'date', 'hashtag']

For example: myaddress@mysite.com -> <email>

Important Notes:

1 - put url at front, if you plan to use it.
    Messes with the regexes!
2 - if you use hashtag then unpack_hashtags will
    automatically be set to False

TYPE: List DEFAULT: None

unpack_contractions

Replace English contractions in running text with their unshortened forms

for example: can't -> can not, wouldn't -> would not, and so on...

TYPE: bool DEFAULT: False

unpack_hashtags

split a hashtag to its constituent words.

for example: #ilikedogs -> i like dogs

TYPE: bool DEFAULT: False

annotate

add special tags to special tokens.

Possible values: ['hashtag', 'allcaps', 'elongated', 'repeated']

for example: myaddress@mysite.com -> myaddress@mysite.com

TYPE: List DEFAULT: None

corrector

define the statistics of what corpus you would like to use [english, twitter]. Be sure to set spell_correction to True if you want to perform spell correction on the running text

TYPE: str DEFAULT: None

tokenizer

callable function that accepts a string and returns a list of strings. If no tokenizer is provided then the text will be tokenized on whitespace

TYPE: Callable DEFAULT: social_tokenizer_ekphrasis

segmenter

define the statistics of what corpus you would like to use [english, twitter]. Be sure to set segmentation to True if you want to perform segmentation on the running text

TYPE: str DEFAULT: None

all_caps_tag

how to wrap the capitalized words Note: applicable only when allcaps is included in the annotate list Possible values [single, wrap, every]:

- single: add a tag after the last capitalized word
    for example: "SHOUTED TEXT" -> "shouted text <allcaps>"
- wrap: wrap all words with opening and closing tags
    for example: "SHOUTED TEXT" -> "<allcaps> shouted text </allcaps>"
- every: add a tag after each word
    for example: "SHOUTED TEXT" -> "shouted <allcaps> text <allcaps>"

TYPE: str DEFAULT: None

spell_correction

If set to True, running text will be spell corrected using statistics of corpus set in corrector parameter

TYPE: bool DEFAULT: False

segmentation

If set to True, running text will be segmented using statistics of corpus set in corrector parameter

for example: exponentialbackoff -> exponential back off

TYPE: bool DEFAULT: False

spell_correct_elong

choose if you want to perform spell correction after the normalization of elongated words.

significantly affects performance (speed)

TYPE: bool DEFAULT: False

spell_correction

choose if you want to perform spell correction to the text.

significantly affects performance (speed)

TYPE: bool DEFAULT: False

Source code in clayrs/content_analyzer/information_processor/ekphrasis_processor.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
def __init__(self, *,
             omit: List = None,
             normalize: List = None,
             unpack_contractions: bool = False,
             unpack_hashtags: bool = False,
             annotate: List = None,
             corrector: str = None,
             tokenizer: Callable = social_tokenizer_ekphrasis,
             segmenter: str = None,
             all_caps_tag: str = None,
             spell_correction: bool = False,
             segmentation: bool = False,
             dicts: List[Dict] = None,
             spell_correct_elong: bool = False):

    # ekphrasis has default values for arguments not passed. So if they are not evaluated in our class,
    # we simply don't pass them to ekphrasis
    kwargs_to_pass = {argument: arg_value for argument, arg_value in zip(locals().keys(), locals().values())
                      if argument != 'self' and arg_value is not None}

    self.text_processor = TextPreProcessor(**kwargs_to_pass)

    self.spell_correct_elong = spell_correct_elong

    self.sc = None
    if spell_correction is True:
        if corrector is not None:
            self.sc = SpellCorrector(corpus=corrector)
        else:
            self.sc = SpellCorrector()

    self.segmentation = segmentation
    self.ws = None
    if segmentation is True:
        if segmenter is not None:
            self.ws = Segmenter(corpus=segmenter)
        else:
            self.ws = Segmenter()

    self._repr_string = autorepr(self, inspect.currentframe())

process(field_data)

PARAMETER DESCRIPTION
field_data

Running text to be processed

TYPE: str

RETURNS DESCRIPTION
field_data

List of str representing running text preprocessed

TYPE: List[str]

Source code in clayrs/content_analyzer/information_processor/ekphrasis_processor.py
216
217
218
219
220
221
222
223
224
225
226
227
228
def process(self, field_data: str) -> List[str]:
    """
    Args:
        field_data: Running text to be processed
    Returns:
        field_data: List of str representing running text preprocessed
    """
    field_data = self.text_processor.pre_process_doc(field_data)
    if self.sc is not None:
        field_data = self.__spell_check(field_data)
    if self.ws is not None:
        field_data = self.__word_segmenter(field_data)
    return field_data