Content Analyzer Config
ContentAnalyzerConfig(source, id, output_directory, field_dict=None, exogenous_representation_list=None, export_json=False)
Bases: ABC
Abstract class that represents the configuration for the content analyzer. The configuration specifies how the
Content Analyzer
needs to complexly represent contents, i.e. how to preprocess them and how to represent them
PARAMETER | DESCRIPTION |
---|---|
source |
Raw data source wrapper which contains original information about contents to process
TYPE:
|
id |
Field of the raw source which represents each content uniquely. |
output_directory |
Where contents complexly represented will be serialized
TYPE:
|
field_dict |
Dictionary object which contains, for each field of the raw source to process, a FieldConfig object
(e.g.
TYPE:
|
exogenous_representation_list |
List of
TYPE:
|
export_json |
If set to True, contents complexly represented will be serialized in a human readable JSON, other than in a proprietary format of the framework
TYPE:
|
Source code in clayrs/content_analyzer/config.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
|
exogenous_representation_list: List[ExogenousConfig]
property
Getter for the exogenous_representation_list
export_json: bool
property
Getter for the export_json parameter
id: List[str]
property
Getter for the id that represents the ids of the produced contents
output_directory
property
Getter for the output directory where the produced contents will be stored
source: RawInformationSource
property
Getter for the raw information source where the original contents are stored
add_multiple_config(field_name, config_list)
Method which adds multiple complex representations for the field_name
of the raw source
Examples:
- Represent preprocessed field "Plot" of the raw source with a tf-idf technique using sklearn and a word embedding technique using Word2Vec. For the latter, no preprocessing operation will be applied
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_multiple_config("Plot",
>>> [FieldConfig(ca.SkLearnTfIdf(),
>>> preprocessing=ca.NLTK(stopwords_removal=True)),
>>>
>>> FieldConfig(ca.WordEmbeddingTechnique(ca.GensimWord2Vec()))]
PARAMETER | DESCRIPTION |
---|---|
field_name |
field name of the raw source which must be complexly represented
TYPE:
|
config_list |
List of
TYPE:
|
Source code in clayrs/content_analyzer/config.py
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 |
|
add_multiple_exogenous(config_list)
Method which adds multiple exogenous representations which will be used to expand each content
Examples:
- Expand each content by using DBPedia as external source and local dataset as external source
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_exogenous(
>>> [
>>> ca.ExogenousConfig(
>>> ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
>>> ),
>>>
>>> ca.ExogenousConfig(
>>> ca.PropertiesFromDataset(field_name_list=['director'])
>>> ),
>>> ]
>>> )
PARAMETER | DESCRIPTION |
---|---|
config_list |
List containing
TYPE:
|
Source code in clayrs/content_analyzer/config.py
390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 |
|
add_single_config(field_name, field_config)
Method which adds a single complex representation for the field_name
of the raw source
Examples:
- Represent field "Plot" of the raw source with a tf-idf technique using sklearn
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_config("Plot", FieldConfig(ca.SkLearnTfIdf()))
PARAMETER | DESCRIPTION |
---|---|
field_name |
field name of the raw source which must be complexly represented
TYPE:
|
field_config |
TYPE:
|
Source code in clayrs/content_analyzer/config.py
322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
|
add_single_exogenous(exogenous_config)
Method which adds a single exogenous representation which will be used to expand each content
Examples:
- Expand each content by using DBPedia as external source
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_exogenous(
>>> ca.ExogenousConfig(
>>> ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
>>> )
>>> )
PARAMETER | DESCRIPTION |
---|---|
exogenous_config |
TYPE:
|
Source code in clayrs/content_analyzer/config.py
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 |
|
get_configs_list(field_name)
Method which returns the list of all FieldConfig
objects specified for the input field_name
parameter
PARAMETER | DESCRIPTION |
---|---|
field_name |
Name of the field for which the list of field configs will be retrieved
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[FieldConfig]
|
List containing all |
Source code in clayrs/content_analyzer/config.py
300 301 302 303 304 305 306 307 308 309 310 |
|
get_field_name_list()
Method which returns a list containing all the fields of the raw source for which at least one FieldConfig
object has been assigned (i.e. at least one complex representations is specified)
RETURNS | DESCRIPTION |
---|---|
List[str]
|
List of all the fields of the raw source that must be complexly represented |
Source code in clayrs/content_analyzer/config.py
312 313 314 315 316 317 318 319 320 |
|
ExogenousConfig(exogenous_technique, id=None)
Class that represents the configuration for a single exogenous representation.
The config allows the user to specify an ExogenousPropertiesRetrieval
technique to use to expand each content.
W.r.t FieldConfig
objects, an ExogenousConfig
does not refer to a particular field but to the whole content
itself.
You can use the id
parameter to assign a custom id for the representation: by doing so the user can freely refer
to it by using the custom id given, rather than positional integers (which are given automatically by the
framework).
-
This will create an exogenous representation for the content by expanding it using DBPedia, said representation will be named 'test'
ExogenousConfig(DBPediaMappingTechnique('dbo:Film', 'Title', 'EN'), id='test')
-
Same as the example above, but since no custom id was assigned, the exogenous representation can be referred to only with an integer (0 if it's the first exogenous representation specified for the contents, 1 if it's the second, etc.)
ExogenousConfig(DBPediaMappingTechnique('dbo:Film', 'Title', 'EN'))
PARAMETER | DESCRIPTION |
---|---|
exogenous_technique |
Technique which will be used to expand each content with data from external sources. An example would be the DBPediaMappingTechnique which allows to retrieve properties from DBPedia.
TYPE:
|
id |
Custom id that can be used later by the user to easily refer to the representation generated by this config. IDs for a single field should be unique! And should only contain '_', '-' and alphanumeric characters
TYPE:
|
Source code in clayrs/content_analyzer/config.py
190 191 192 193 194 195 |
|
exogenous_technique
property
Getter for the exogenous properties retrieval technique
id
property
Getter for the ExogenousConfig id
FieldConfig(content_technique=OriginalData(), preprocessing=None, postprocessing=None, memory_interface=None, id=None)
Class that represents the configuration for a single representation of a field. The configuration of a single
representation is defined by a FieldContentProductionTechnique
(e.g. an EmbeddingTechnique
) that will be applied
to the pre-processed data of said field.
To specify how to preprocess data, simply specify an InformationProcessor
in the preprocessing
parameter.
Multiple InformationProcessor
can be wrapped in a list: in this case, the field will be preprocessed by performing
operations all objects inside the list.
If preprocessing is not defined, no preprocessing operations will be done on the field data.
You can use the id
parameter to assign a custom id for the representation: by doing so the user can freely refer
to it by using the custom id given, rather than positional integers (which are given automatically by the
framework).
There is also a memory_interface attribute which allows to define a data structure where the representation will be serialized (e.g. an Index).
Various configurations are possible depending on how the user wants to represent a particular field:
- This will produce a field representation using the SkLearnTfIdf technique on the field data preprocessed by NLTK by performing stopwords removal, and the name of the produced representation will be 'field_example'
FieldConfig(SkLearnTfIdf(), NLTK(stopwords_removal=True), id='field_example')
- This will produce the same result as above but the id for the field representation defined by this config will be set by the ContentAnalyzer once it is being processed (0 integer if it's the first representation specified for the field, 1 if it's the second, etc.)
FieldConfig(SkLearnTfIdf(), NLTK())
- This will produce a field representation using the SkLearnTfIdf technique on the field data without applying any preprocessing operation, but it will not be directly stored in the content, instead it will be stored in a index
FieldConfig(SkLearnTfIdf(), memory_interface=SearchIndex(/somedir))
- In the following nothing will be done on the field data, it will be represented as is
FieldConfig()
PARAMETER | DESCRIPTION |
---|---|
content_technique |
Technique that will be applied to the field in order to produce a complex representation of said field
TYPE:
|
preprocessing |
Single
TYPE:
|
memory_interface |
complex structure where the content representation can be serialized (an Index for example)
TYPE:
|
id |
Custom id that can be used later by the user to easily refer to the representation generated by this config. IDs for a single field should be unique! And should only contain '_', '-' and alphanumeric characters
TYPE:
|
Source code in clayrs/content_analyzer/config.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
content_technique
property
Getter for the field content production technique of the field
id
property
Getter for the id of the field config
memory_interface
property
Getter for the index associated to the field config
postprocessing
property
Getter for the list of postprocessor of the field config
preprocessing
property
Getter for the list of preprocessor of the field config
ItemAnalyzerConfig
Bases: ContentAnalyzerConfig
Class that represents the configuration for the content analyzer. The configuration specifies how the
Content Analyzer
needs to complexly represent contents, i.e. how to preprocess them and how to represent them
In particular this class refers to items.
Examples:
>>> import clayrs.content_analyzer as ca
>>> raw_source = ca.JSONFile(json_path)
>>> movies_config = ca.ItemAnalyzerConfig(raw_source, id='movie_id', output_directory='movies_codified/')
>>> # add single field config
>>> movies_config.add_single_config('occupation', FieldConfig(content_technique=ca.OriginalData()))
>>> # add single exogenous technique
>>> movies_config.add_single_exogenous(ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))
UserAnalyzerConfig
Bases: ContentAnalyzerConfig
Class that represents the configuration for the content analyzer. The configuration specifies how the
Content Analyzer
needs to complexly represent contents, i.e. how to preprocess them and how to represent them
In particular this class refers to users.
Examples:
>>> import clayrs.content_analyzer as ca
>>> raw_source = ca.JSONFile(json_path)
>>> users_config = ca.UserAnalyzerConfig(raw_source, id='user_id', output_directory='users_codified/')
>>> # add single field config
>>> users_config.add_single_config('occupation', FieldConfig(content_technique=ca.OriginalData()))
>>> # add single exogenous technique
>>> users_config.add_single_exogenous(ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))
Content Analyzer Class
ContentAnalyzer(config, n_thread=1)
Class to whom the control of the content analysis phase is delegated. It uses the data stored in the configuration file to create and serialize the contents the user wants to produce. It also checks that the configurations the user wants to run on the raw contents have unique ids (otherwise it would be impossible to refer to a particular field representation or exogenous representation)
PARAMETER | DESCRIPTION |
---|---|
config |
configuration for processing the item fields. This parameter provides the possibility of customizing the way in which the input data is processed.
TYPE:
|
Source code in clayrs/content_analyzer/content_analyzer_main.py
33 34 35 |
|
fit()
Processes the creation of the contents and serializes the contents. This method starts the content production process and initializes everything that will be used to create said contents, their fields and their representations
Source code in clayrs/content_analyzer/content_analyzer_main.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|