Skip to content

Content Analyzer Config

ContentAnalyzerConfig(source, id, output_directory, field_dict=None, exogenous_representation_list=None, export_json=False)

Bases: ABC

Abstract class that represents the configuration for the content analyzer. The configuration specifies how the Content Analyzer needs to complexly represent contents, i.e. how to preprocess them and how to represent them

PARAMETER DESCRIPTION
source

Raw data source wrapper which contains original information about contents to process

TYPE: RawInformationSource

id

Field of the raw source which represents each content uniquely.

TYPE: Union[str, List[str]]

output_directory

Where contents complexly represented will be serialized

TYPE: str

field_dict

Dictionary object which contains, for each field of the raw source to process, a FieldConfig object (e.g. {'plot': FieldConfig(SkLearnTfIdf(), 'genres': FieldConfig(WhooshTfIdf()))})

TYPE: Dict[str, List[FieldConfig]] DEFAULT: None

exogenous_representation_list

List of ExogenousTechnique objects that will be used to expand each contents with data from external sources

TYPE: Union[ExogenousConfig, List[ExogenousConfig]] DEFAULT: None

export_json

If set to True, contents complexly represented will be serialized in a human readable JSON, other than in a proprietary format of the framework

TYPE: bool DEFAULT: False

Source code in clayrs/content_analyzer/config.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def __init__(self, source: RawInformationSource,
             id: Union[str, List[str]],
             output_directory: str,
             field_dict: Dict[str, List[FieldConfig]] = None,
             exogenous_representation_list: Union[ExogenousConfig, List[ExogenousConfig]] = None,
             export_json: bool = False):
    if field_dict is None:
        field_dict = {}
    if exogenous_representation_list is None:
        exogenous_representation_list = []

    self.__source = source
    self.__id = id
    self.__output_directory = output_directory
    self.__field_dict = field_dict
    self.__exogenous_representation_list = exogenous_representation_list
    self.__export_json = export_json

    if not isinstance(self.__exogenous_representation_list, list):
        self.__exogenous_representation_list = [self.__exogenous_representation_list]

    if not isinstance(self.__id, list):
        self.__id = [self.__id]

exogenous_representation_list: List[ExogenousConfig] property

Getter for the exogenous_representation_list

export_json: bool property

Getter for the export_json parameter

id: List[str] property

Getter for the id that represents the ids of the produced contents

output_directory property

Getter for the output directory where the produced contents will be stored

source: RawInformationSource property

Getter for the raw information source where the original contents are stored

add_multiple_config(field_name, config_list)

Method which adds multiple complex representations for the field_name of the raw source

Examples:

  • Represent preprocessed field "Plot" of the raw source with a tf-idf technique using sklearn and a word embedding technique using Word2Vec. For the latter, no preprocessing operation will be applied
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_multiple_config("Plot",
>>>                                       [FieldConfig(ca.SkLearnTfIdf(),
>>>                                                    preprocessing=ca.NLTK(stopwords_removal=True)),
>>>
>>>                                        FieldConfig(ca.WordEmbeddingTechnique(ca.GensimWord2Vec()))]
PARAMETER DESCRIPTION
field_name

field name of the raw source which must be complexly represented

TYPE: str

config_list

List of FieldConfig objects specifying how to represent the field of the raw source

TYPE: List[FieldConfig]

Source code in clayrs/content_analyzer/config.py
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
def add_multiple_config(self, field_name: str, config_list: List[FieldConfig]):
    """
    Method which adds multiple complex representations for the `field_name` of the raw source

    Examples:

        * Represent preprocessed field "Plot" of the raw source with a tf-idf technique using sklearn and a word
        embedding technique using Word2Vec. For the latter, no preprocessing operation will be applied
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_multiple_config("Plot",
        >>>                                       [FieldConfig(ca.SkLearnTfIdf(),
        >>>                                                    preprocessing=ca.NLTK(stopwords_removal=True)),
        >>>
        >>>                                        FieldConfig(ca.WordEmbeddingTechnique(ca.GensimWord2Vec()))]

    Args:
        field_name: field name of the raw source which must be complexly represented
        config_list: List of `FieldConfig` objects specifying how to represent the field of the raw source
    """
    # If the field_name is not in the field_dict keys it means there is no list to append the FieldConfig to,
    # so a new list is instantiated
    if self.__field_dict.get(field_name) is not None:
        self.__field_dict[field_name].extend(config_list)
    else:
        self.__field_dict[field_name] = list()
        self.__field_dict[field_name].extend(config_list)

add_multiple_exogenous(config_list)

Method which adds multiple exogenous representations which will be used to expand each content

Examples:

  • Expand each content by using DBPedia as external source and local dataset as external source
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_exogenous(
>>>     [
>>>         ca.ExogenousConfig(
>>>             ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
>>>         ),
>>>
>>>         ca.ExogenousConfig(
>>>             ca.PropertiesFromDataset(field_name_list=['director'])
>>>         ),
>>>     ]
>>> )
PARAMETER DESCRIPTION
config_list

List containing ExogenousConfig objects specifying how to expand each content

TYPE: List[ExogenousConfig]

Source code in clayrs/content_analyzer/config.py
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
def add_multiple_exogenous(self, config_list: List[ExogenousConfig]):
    """
    Method which adds multiple exogenous representations which will be used to expand each content

    Examples:

        * Expand each content by using DBPedia as external source and local dataset as external source
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_single_exogenous(
        >>>     [
        >>>         ca.ExogenousConfig(
        >>>             ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
        >>>         ),
        >>>
        >>>         ca.ExogenousConfig(
        >>>             ca.PropertiesFromDataset(field_name_list=['director'])
        >>>         ),
        >>>     ]
        >>> )

    Args:
        config_list: List containing `ExogenousConfig` objects specifying how to expand each content
    """
    self.__exogenous_representation_list.extend(config_list)

add_single_config(field_name, field_config)

Method which adds a single complex representation for the field_name of the raw source

Examples:

  • Represent field "Plot" of the raw source with a tf-idf technique using sklearn
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_config("Plot", FieldConfig(ca.SkLearnTfIdf()))
PARAMETER DESCRIPTION
field_name

field name of the raw source which must be complexly represented

TYPE: str

field_config

FieldConfig specifying how to represent the field of the raw source

TYPE: FieldConfig

Source code in clayrs/content_analyzer/config.py
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
def add_single_config(self, field_name: str, field_config: FieldConfig):
    """
    Method which adds a single complex representation for the `field_name` of the raw source

    Examples:

        * Represent field "Plot" of the raw source with a tf-idf technique using sklearn
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_single_config("Plot", FieldConfig(ca.SkLearnTfIdf()))

    Args:
        field_name: field name of the raw source which must be complexly represented
        field_config: `FieldConfig` specifying how to represent the field of the raw source
    """
    # If the field_name is not in the field_dict keys it means there is no list to append the FieldConfig to,
    # so a new list is instantiated
    if self.__field_dict.get(field_name) is not None:
        self.__field_dict[field_name].append(field_config)
    else:
        self.__field_dict[field_name] = list()
        self.__field_dict[field_name].append(field_config)

add_single_exogenous(exogenous_config)

Method which adds a single exogenous representation which will be used to expand each content

Examples:

  • Expand each content by using DBPedia as external source
>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_exogenous(
>>>     ca.ExogenousConfig(
>>>         ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
>>>     )
>>> )
PARAMETER DESCRIPTION
exogenous_config

ExogenousConfig object specifying how to expand each content

TYPE: ExogenousConfig

Source code in clayrs/content_analyzer/config.py
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
def add_single_exogenous(self, exogenous_config: ExogenousConfig):
    """
    Method which adds a single exogenous representation which will be used to expand each content

    Examples:

        * Expand each content by using DBPedia as external source
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_single_exogenous(
        >>>     ca.ExogenousConfig(
        >>>         ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
        >>>     )
        >>> )

    Args:
        exogenous_config: `ExogenousConfig` object specifying how to expand each content
    """
    self.__exogenous_representation_list.append(exogenous_config)

get_configs_list(field_name)

Method which returns the list of all FieldConfig objects specified for the input field_name parameter

PARAMETER DESCRIPTION
field_name

Name of the field for which the list of field configs will be retrieved

TYPE: str

RETURNS DESCRIPTION
List[FieldConfig]

List containing all FieldConfig objects specified for the input field_name

Source code in clayrs/content_analyzer/config.py
300
301
302
303
304
305
306
307
308
309
310
def get_configs_list(self, field_name: str) -> List[FieldConfig]:
    """
    Method which returns the list of all `FieldConfig` objects specified for the input `field_name` parameter

    Args:
        field_name: Name of the field for which the list of field configs will be retrieved

    Returns:
        List containing all `FieldConfig` objects specified for the input `field_name`
    """
    return [config for config in self.__field_dict[field_name]]

get_field_name_list()

Method which returns a list containing all the fields of the raw source for which at least one FieldConfig object has been assigned (i.e. at least one complex representations is specified)

RETURNS DESCRIPTION
List[str]

List of all the fields of the raw source that must be complexly represented

Source code in clayrs/content_analyzer/config.py
312
313
314
315
316
317
318
319
320
def get_field_name_list(self) -> List[str]:
    """
    Method which returns a list containing all the fields of the raw source for which at least one `FieldConfig`
    object has been assigned (i.e. at least one complex representations is specified)

    Returns:
        List of all the fields of the raw source that must be complexly represented
    """
    return list(self.__field_dict.keys())

ExogenousConfig(exogenous_technique, id=None)

Class that represents the configuration for a single exogenous representation.

The config allows the user to specify an ExogenousPropertiesRetrieval technique to use to expand each content. W.r.t FieldConfig objects, an ExogenousConfig does not refer to a particular field but to the whole content itself.

You can use the id parameter to assign a custom id for the representation: by doing so the user can freely refer to it by using the custom id given, rather than positional integers (which are given automatically by the framework).

  • This will create an exogenous representation for the content by expanding it using DBPedia, said representation will be named 'test'

    ExogenousConfig(DBPediaMappingTechnique('dbo:Film', 'Title', 'EN'), id='test')
    

  • Same as the example above, but since no custom id was assigned, the exogenous representation can be referred to only with an integer (0 if it's the first exogenous representation specified for the contents, 1 if it's the second, etc.)

ExogenousConfig(DBPediaMappingTechnique('dbo:Film', 'Title', 'EN'))
PARAMETER DESCRIPTION
exogenous_technique

Technique which will be used to expand each content with data from external sources. An example would be the DBPediaMappingTechnique which allows to retrieve properties from DBPedia.

TYPE: ExogenousPropertiesRetrieval

id

Custom id that can be used later by the user to easily refer to the representation generated by this config. IDs for a single field should be unique! And should only contain '_', '-' and alphanumeric characters

TYPE: str DEFAULT: None

Source code in clayrs/content_analyzer/config.py
190
191
192
193
194
195
def __init__(self, exogenous_technique: ExogenousPropertiesRetrieval, id: str = None):
    if id is not None:
        self._check_custom_id(id)

    self.__exogenous_technique = exogenous_technique
    self.__id = id

exogenous_technique property

Getter for the exogenous properties retrieval technique

id property

Getter for the ExogenousConfig id

FieldConfig(content_technique=OriginalData(), preprocessing=None, postprocessing=None, memory_interface=None, id=None)

Class that represents the configuration for a single representation of a field. The configuration of a single representation is defined by a FieldContentProductionTechnique (e.g. an EmbeddingTechnique) that will be applied to the pre-processed data of said field.

To specify how to preprocess data, simply specify an InformationProcessor in the preprocessing parameter. Multiple InformationProcessor can be wrapped in a list: in this case, the field will be preprocessed by performing operations all objects inside the list. If preprocessing is not defined, no preprocessing operations will be done on the field data.

You can use the id parameter to assign a custom id for the representation: by doing so the user can freely refer to it by using the custom id given, rather than positional integers (which are given automatically by the framework).

There is also a memory_interface attribute which allows to define a data structure where the representation will be serialized (e.g. an Index).

Various configurations are possible depending on how the user wants to represent a particular field:

  • This will produce a field representation using the SkLearnTfIdf technique on the field data preprocessed by NLTK by performing stopwords removal, and the name of the produced representation will be 'field_example'
FieldConfig(SkLearnTfIdf(), NLTK(stopwords_removal=True), id='field_example')
  • This will produce the same result as above but the id for the field representation defined by this config will be set by the ContentAnalyzer once it is being processed (0 integer if it's the first representation specified for the field, 1 if it's the second, etc.)
FieldConfig(SkLearnTfIdf(), NLTK())
  • This will produce a field representation using the SkLearnTfIdf technique on the field data without applying any preprocessing operation, but it will not be directly stored in the content, instead it will be stored in a index
FieldConfig(SkLearnTfIdf(), memory_interface=SearchIndex(/somedir))
  • In the following nothing will be done on the field data, it will be represented as is
FieldConfig()
PARAMETER DESCRIPTION
content_technique

Technique that will be applied to the field in order to produce a complex representation of said field

TYPE: FieldContentProductionTechnique DEFAULT: OriginalData()

preprocessing

Single InformationProcessor object or a list of InformationProcessor objects that will be used preprocess field data before applying the content_technique

TYPE: Union[InformationProcessor, List[InformationProcessor]] DEFAULT: None

memory_interface

complex structure where the content representation can be serialized (an Index for example)

TYPE: InformationInterface DEFAULT: None

id

Custom id that can be used later by the user to easily refer to the representation generated by this config. IDs for a single field should be unique! And should only contain '_', '-' and alphanumeric characters

TYPE: str DEFAULT: None

Source code in clayrs/content_analyzer/config.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def __init__(self,
             content_technique: FieldContentProductionTechnique = OriginalData(),
             preprocessing: Union[InformationProcessor, List[InformationProcessor]] = None,
             postprocessing: Union[VisualPostProcessor, List[VisualPostProcessor]] = None,
             memory_interface: InformationInterface = None,
             id: str = None):

    if preprocessing is None:
        preprocessing = []

    if postprocessing is None:
        postprocessing = []

    if id is not None:
        self._check_custom_id(id)

    self.__content_technique = content_technique
    self.__preprocessing = preprocessing
    self.__postprocessing = postprocessing
    self.__memory_interface = memory_interface
    self.__id = id

    if not isinstance(self.__preprocessing, list):
        self.__preprocessing = [self.__preprocessing]

    if not isinstance(self.__postprocessing, list):
        self.__postprocessing = [self.__postprocessing]

content_technique property

Getter for the field content production technique of the field

id property

Getter for the id of the field config

memory_interface property

Getter for the index associated to the field config

postprocessing property

Getter for the list of postprocessor of the field config

preprocessing property

Getter for the list of preprocessor of the field config

ItemAnalyzerConfig

Bases: ContentAnalyzerConfig

Class that represents the configuration for the content analyzer. The configuration specifies how the Content Analyzer needs to complexly represent contents, i.e. how to preprocess them and how to represent them In particular this class refers to items.

Examples:

>>> import clayrs.content_analyzer as ca
>>> raw_source = ca.JSONFile(json_path)
>>> movies_config = ca.ItemAnalyzerConfig(raw_source, id='movie_id', output_directory='movies_codified/')
>>> # add single field config
>>> movies_config.add_single_config('occupation', FieldConfig(content_technique=ca.OriginalData()))
>>> # add single exogenous technique
>>> movies_config.add_single_exogenous(ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))

UserAnalyzerConfig

Bases: ContentAnalyzerConfig

Class that represents the configuration for the content analyzer. The configuration specifies how the Content Analyzer needs to complexly represent contents, i.e. how to preprocess them and how to represent them In particular this class refers to users.

Examples:

>>> import clayrs.content_analyzer as ca
>>> raw_source = ca.JSONFile(json_path)
>>> users_config = ca.UserAnalyzerConfig(raw_source, id='user_id', output_directory='users_codified/')
>>> # add single field config
>>> users_config.add_single_config('occupation', FieldConfig(content_technique=ca.OriginalData()))
>>> # add single exogenous technique
>>> users_config.add_single_exogenous(ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))

Content Analyzer Class

ContentAnalyzer(config, n_thread=1)

Class to whom the control of the content analysis phase is delegated. It uses the data stored in the configuration file to create and serialize the contents the user wants to produce. It also checks that the configurations the user wants to run on the raw contents have unique ids (otherwise it would be impossible to refer to a particular field representation or exogenous representation)

PARAMETER DESCRIPTION
config

configuration for processing the item fields. This parameter provides the possibility of customizing the way in which the input data is processed.

TYPE: ContentAnalyzerConfig

Source code in clayrs/content_analyzer/content_analyzer_main.py
33
34
35
def __init__(self, config: ContentAnalyzerConfig, n_thread: int = 1):
    self._config: ContentAnalyzerConfig = config
    self._n_thread = n_thread

fit()

Processes the creation of the contents and serializes the contents. This method starts the content production process and initializes everything that will be used to create said contents, their fields and their representations

Source code in clayrs/content_analyzer/content_analyzer_main.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def fit(self):
    """
    Processes the creation of the contents and serializes the contents. This method starts the content production
    process and initializes everything that will be used to create said contents, their fields and their
    representations
    """
    # before starting the process, the content analyzer manin checks that there are no duplicate id cases
    # both in the field dictionary and in the exogenous representation list
    # this is done now and not recursively for each content during the creation process, in order to avoid starting
    # an operation that is going to fail
    try:
        self.__check_field_dict()
        self.__check_exogenous_representation_list()
    except ValueError as e:
        raise e

    # creates the directory where the data will be serialized and overwrites it if it already exists
    output_path = self._config.output_directory
    if os.path.exists(output_path):
        shutil.rmtree(output_path)
    os.makedirs(output_path)

    contents_producer = ContentsProducer.get_instance()
    contents_producer.set_config(self._config)
    created_contents = contents_producer.create_contents()

    if self._config.export_json:
        json_path = os.path.join(self._config.output_directory, 'contents.json')
        with open(json_path, "w") as data:
            json.dump(created_contents, data, cls=ContentEncoder, indent=4)

    # with get_progbar(created_contents) as pbar:
    with get_iterator_thread(self._n_thread, self._serialize_content, created_contents,
                             keep_order=False, progress_bar=True, total=len(created_contents)) as pbar:
        pbar.set_description("Serializing contents")

        for _ in pbar:
            pass