Ekphrasis Preprocessor
Ekphrasis(*, omit=None, normalize=None, unpack_contractions=False, unpack_hashtags=False, annotate=None, corrector=None, tokenizer=social_tokenizer_ekphrasis, segmenter=None, all_caps_tag=None, spell_correction=False, segmentation=False, dicts=None, spell_correct_elong=False)
Bases: NLP
Interface to the Ekphrasis library for natural language processing features
Examples:
- Normalize email and percentage tokens but omit email ones:
>>> ek = Ekphrasis(omit=['email'], normalize=['email', 'percent'])
>>> ek.process("this is an email: alias@mail.com and this is a percent 23%")
['this', 'is', 'an', 'email', ':', 'and', 'this', 'is', 'a', 'percent', '<percent>']
- Unpack contractions on running text:
>>> ek = Ekphrasis(unpack_contractions=True)
>>> ek.process("I can't do this because I won't and I shouldn't")
['i', 'can', 'not', 'do', 'this', 'because', 'i', 'will', 'not', 'and', 'i', 'should', 'not']
- Unpack hashtag using statistics from 'twitter' corpus:
>>> ek = Ekphrasis(unpack_hashtags=True, segmenter='twitter')
>>> ek.process("#next #gamedev #retrogaming #coolphoto no unpack")
['next', 'game', 'dev', 'retro', 'gaming', 'cool', 'photo', 'no', 'unpack']
- Annotate words in CAPS and repeated tokens with single tag for CAPS words:
>>> ek = Ekphrasis(annotate=['allcaps', 'repeated'], all_caps_tag='single')
>>> ek.process("this is good !!! text and a SHOUTED one")
['this', 'is', 'good', '!', '<repeated>', 'text', 'and', 'a', 'shouted', '<allcaps>', 'one']
- Perform segmentation using statistics from 'twitter' corpus:
>>> ek = Ekphrasis(segmentation=True, segmenter='twitter')
>>> ek.process("thewatercooler exponentialbackoff no segmentation")
['the', 'watercooler', 'exponential', 'back', 'off', 'no', 'segmentation']
- Substitute words with custom tokens:
>>> ek = Ekphrasis(dicts=[{':)': '<happy>', ':(': '<sad>'}])
>>> ek.process("Hello :) how are you? :(")
['Hello', '<happy>', 'how', 'are', 'you', '?', '<sad>']
- Perform spell correction on text and on elongated words by using statistics from default 'english' corpus:
>>> Ekphrasis(spell_correction=True, spell_correct_elong=True)
>>> ek.process("This is huuuuge. The korrect way of doing tihngs is not the followingt")
["this", 'is', 'huge', '.', 'the', 'correct', "way", "of", "doing", "things", "is", 'not', 'the',
'following']
PARAMETER | DESCRIPTION |
---|---|
omit |
Choose what tokens that you want to omit from the text. Possible values: ['email', 'percent', 'money', 'phone', 'user','time', 'url', 'date', 'hashtag'] Important Notes:
TYPE:
|
normalize |
Choose what tokens that you want to normalize from the text. Possible values: ['email', 'percent', 'money', 'phone', 'user', 'time', 'url', 'date', 'hashtag'] For example: myaddress@mysite.com -> Important Notes:
TYPE:
|
unpack_contractions |
Replace English contractions in running text with their unshortened forms for example: can't -> can not, wouldn't -> would not, and so on...
TYPE:
|
unpack_hashtags |
split a hashtag to its constituent words. for example: #ilikedogs -> i like dogs
TYPE:
|
annotate |
add special tags to special tokens. Possible values: ['hashtag', 'allcaps', 'elongated', 'repeated'] for example: myaddress@mysite.com -> myaddress@mysite.com
TYPE:
|
corrector |
define the statistics of what corpus you would like to use [english, twitter].
Be sure to set
TYPE:
|
tokenizer |
callable function that accepts a string and returns a list of strings. If no tokenizer is provided then the text will be tokenized on whitespace
TYPE:
|
segmenter |
define the statistics of what corpus you would like to use [english, twitter].
Be sure to set
TYPE:
|
all_caps_tag |
how to wrap the capitalized words
Note: applicable only when
TYPE:
|
spell_correction |
If set to True, running text will be spell corrected using statistics of corpus set in
TYPE:
|
segmentation |
If set to True, running text will be segmented using statistics of corpus set in
for example: exponentialbackoff -> exponential back off
TYPE:
|
spell_correct_elong |
choose if you want to perform spell correction after the normalization of elongated words. significantly affects performance (speed)
TYPE:
|
spell_correction |
choose if you want to perform spell correction to the text. significantly affects performance (speed)
TYPE:
|
Source code in clayrs/content_analyzer/information_processor/ekphrasis_processor.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
|
process(field_data)
PARAMETER | DESCRIPTION |
---|---|
field_data |
Running text to be processed
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
field_data
|
List of str representing running text preprocessed |
Source code in clayrs/content_analyzer/information_processor/ekphrasis_processor.py
216 217 218 219 220 221 222 223 224 225 226 227 228 |
|