Content Analyzer Config

`ContentAnalyzerConfig(source, id, output_directory, field_dict=None, exogenous_representation_list=None, export_json=False)`

Bases: ABC

Abstract class that represents the configuration for the content analyzer. The configuration specifies how the Content Analyzer needs to complexly represent contents, i.e. how to preprocess them and how to represent them

PARAMETER	DESCRIPTION
`source`	Raw data source wrapper which contains original information about contents to process TYPE: `RawInformationSource`
`id`	Field of the raw source which represents each content uniquely. TYPE: `Union[str, List[str]]`
`output_directory`	Where contents complexly represented will be serialized TYPE: `str`
`field_dict`	Dictionary object which contains, for each field of the raw source to process, a FieldConfig object (e.g. `{'plot': FieldConfig(SkLearnTfIdf(), 'genres': FieldConfig(WhooshTfIdf()))}`) TYPE: `Dict[str, List[FieldConfig]]` DEFAULT: `None`
`exogenous_representation_list`	List of `ExogenousTechnique` objects that will be used to expand each contents with data from external sources TYPE: `Union[ExogenousConfig, List[ExogenousConfig]]` DEFAULT: `None`
`export_json`	If set to True, contents complexly represented will be serialized in a human readable JSON, other than in a proprietary format of the framework TYPE: `bool` DEFAULT: `False`

Source code in clayrs/content_analyzer/config.py

def __init__(self, source: RawInformationSource,
             id: Union[str, List[str]],
             output_directory: str,
             field_dict: Dict[str, List[FieldConfig]] = None,
             exogenous_representation_list: Union[ExogenousConfig, List[ExogenousConfig]] = None,
             export_json: bool = False):
    if field_dict is None:
        field_dict = {}
    if exogenous_representation_list is None:
        exogenous_representation_list = []

    self.__source = source
    self.__id = id
    self.__output_directory = output_directory
    self.__field_dict = field_dict
    self.__exogenous_representation_list = exogenous_representation_list
    self.__export_json = export_json

    if not isinstance(self.__exogenous_representation_list, list):
        self.__exogenous_representation_list = [self.__exogenous_representation_list]

    if not isinstance(self.__id, list):
        self.__id = [self.__id]

`exogenous_representation_list: List[ExogenousConfig]` `property`

Getter for the exogenous_representation_list

`export_json: bool` `property`

Getter for the export_json parameter

`id: List[str]` `property`

Getter for the id that represents the ids of the produced contents

`output_directory` `property`

Getter for the output directory where the produced contents will be stored

`source: RawInformationSource` `property`

Getter for the raw information source where the original contents are stored

`add_multiple_config(field_name, config_list)`

Method which adds multiple complex representations for the field_name of the raw source

Examples:

Represent preprocessed field "Plot" of the raw source with a tf-idf technique using sklearn and a word embedding technique using Word2Vec. For the latter, no preprocessing operation will be applied

>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_multiple_config("Plot",
>>>                                       [FieldConfig(ca.SkLearnTfIdf(),
>>>                                                    preprocessing=ca.NLTK(stopwords_removal=True)),
>>>
>>>                                        FieldConfig(ca.WordEmbeddingTechnique(ca.GensimWord2Vec()))]

PARAMETER DESCRIPTION

field_name

field name of the raw source which must be complexly represented

TYPE: str

config_list

List of FieldConfig objects specifying how to represent the field of the raw source

TYPE: List[FieldConfig]

Source code in clayrs/content_analyzer/config.py

def add_multiple_config(self, field_name: str, config_list: List[FieldConfig]):
    """
    Method which adds multiple complex representations for the `field_name` of the raw source

    Examples:

        * Represent preprocessed field "Plot" of the raw source with a tf-idf technique using sklearn and a word
        embedding technique using Word2Vec. For the latter, no preprocessing operation will be applied
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_multiple_config("Plot",
        >>>                                       [FieldConfig(ca.SkLearnTfIdf(),
        >>>                                                    preprocessing=ca.NLTK(stopwords_removal=True)),
        >>>
        >>>                                        FieldConfig(ca.WordEmbeddingTechnique(ca.GensimWord2Vec()))]

    Args:
        field_name: field name of the raw source which must be complexly represented
        config_list: List of `FieldConfig` objects specifying how to represent the field of the raw source
    """
    # If the field_name is not in the field_dict keys it means there is no list to append the FieldConfig to,
    # so a new list is instantiated
    if self.__field_dict.get(field_name) is not None:
        self.__field_dict[field_name].extend(config_list)
    else:
        self.__field_dict[field_name] = list()
        self.__field_dict[field_name].extend(config_list)

`add_multiple_exogenous(config_list)`

Method which adds multiple exogenous representations which will be used to expand each content

Examples:

Expand each content by using DBPedia as external source and local dataset as external source

>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_exogenous(
>>>     [
>>>         ca.ExogenousConfig(
>>>             ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
>>>         ),
>>>
>>>         ca.ExogenousConfig(
>>>             ca.PropertiesFromDataset(field_name_list=['director'])
>>>         ),
>>>     ]
>>> )

PARAMETER DESCRIPTION

config_list

List containing ExogenousConfig objects specifying how to expand each content

TYPE: List[ExogenousConfig]

Source code in clayrs/content_analyzer/config.py

def add_multiple_exogenous(self, config_list: List[ExogenousConfig]):
    """
    Method which adds multiple exogenous representations which will be used to expand each content

    Examples:

        * Expand each content by using DBPedia as external source and local dataset as external source
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_single_exogenous(
        >>>     [
        >>>         ca.ExogenousConfig(
        >>>             ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
        >>>         ),
        >>>
        >>>         ca.ExogenousConfig(
        >>>             ca.PropertiesFromDataset(field_name_list=['director'])
        >>>         ),
        >>>     ]
        >>> )

    Args:
        config_list: List containing `ExogenousConfig` objects specifying how to expand each content
    """
    self.__exogenous_representation_list.extend(config_list)

`add_single_config(field_name, field_config)`

Method which adds a single complex representation for the field_name of the raw source

Examples:

Represent field "Plot" of the raw source with a tf-idf technique using sklearn

>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_config("Plot", FieldConfig(ca.SkLearnTfIdf()))

PARAMETER DESCRIPTION

field_name

field name of the raw source which must be complexly represented

TYPE: str

field_config

FieldConfig specifying how to represent the field of the raw source

TYPE: FieldConfig

Source code in clayrs/content_analyzer/config.py

def add_single_config(self, field_name: str, field_config: FieldConfig):
    """
    Method which adds a single complex representation for the `field_name` of the raw source

    Examples:

        * Represent field "Plot" of the raw source with a tf-idf technique using sklearn
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_single_config("Plot", FieldConfig(ca.SkLearnTfIdf()))

    Args:
        field_name: field name of the raw source which must be complexly represented
        field_config: `FieldConfig` specifying how to represent the field of the raw source
    """
    # If the field_name is not in the field_dict keys it means there is no list to append the FieldConfig to,
    # so a new list is instantiated
    if self.__field_dict.get(field_name) is not None:
        self.__field_dict[field_name].append(field_config)
    else:
        self.__field_dict[field_name] = list()
        self.__field_dict[field_name].append(field_config)

`add_single_exogenous(exogenous_config)`

Method which adds a single exogenous representation which will be used to expand each content

Examples:

Expand each content by using DBPedia as external source

>>> import clayrs.content_analyzer as ca
>>> movies_ca_config.add_single_exogenous(
>>>     ca.ExogenousConfig(
>>>         ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
>>>     )
>>> )

PARAMETER DESCRIPTION

exogenous_config

ExogenousConfig object specifying how to expand each content

TYPE: ExogenousConfig

Source code in clayrs/content_analyzer/config.py

def add_single_exogenous(self, exogenous_config: ExogenousConfig):
    """
    Method which adds a single exogenous representation which will be used to expand each content

    Examples:

        * Expand each content by using DBPedia as external source
        >>> import clayrs.content_analyzer as ca
        >>> movies_ca_config.add_single_exogenous(
        >>>     ca.ExogenousConfig(
        >>>         ca.DBPediaMappingTechnique('dbo:Film', 'Title', 'EN')
        >>>     )
        >>> )

    Args:
        exogenous_config: `ExogenousConfig` object specifying how to expand each content
    """
    self.__exogenous_representation_list.append(exogenous_config)

`get_configs_list(field_name)`

Method which returns the list of all FieldConfig objects specified for the input field_name parameter

PARAMETER DESCRIPTION

field_name

Name of the field for which the list of field configs will be retrieved

TYPE: str

RETURNS	DESCRIPTION
`List[FieldConfig]`	List containing all `FieldConfig` objects specified for the input `field_name`

Source code in clayrs/content_analyzer/config.py

def get_configs_list(self, field_name: str) -> List[FieldConfig]:
    """
    Method which returns the list of all `FieldConfig` objects specified for the input `field_name` parameter

    Args:
        field_name: Name of the field for which the list of field configs will be retrieved

    Returns:
        List containing all `FieldConfig` objects specified for the input `field_name`
    """
    return [config for config in self.__field_dict[field_name]]

`get_field_name_list()`

Method which returns a list containing all the fields of the raw source for which at least one FieldConfig object has been assigned (i.e. at least one complex representations is specified)

RETURNS	DESCRIPTION
`List[str]`	List of all the fields of the raw source that must be complexly represented

Source code in clayrs/content_analyzer/config.py

def get_field_name_list(self) -> List[str]:
    """
    Method which returns a list containing all the fields of the raw source for which at least one `FieldConfig`
    object has been assigned (i.e. at least one complex representations is specified)

    Returns:
        List of all the fields of the raw source that must be complexly represented
    """
    return list(self.__field_dict.keys())

`ExogenousConfig(exogenous_technique, id=None)`

Class that represents the configuration for a single exogenous representation.

The config allows the user to specify an ExogenousPropertiesRetrieval technique to use to expand each content. W.r.t FieldConfig objects, an ExogenousConfig does not refer to a particular field but to the whole content itself.

You can use the id parameter to assign a custom id for the representation: by doing so the user can freely refer to it by using the custom id given, rather than positional integers (which are given automatically by the framework).

This will create an exogenous representation for the content by expanding it using DBPedia, said representation will be named 'test'
```
ExogenousConfig(DBPediaMappingTechnique('dbo:Film', 'Title', 'EN'), id='test')
```
Same as the example above, but since no custom id was assigned, the exogenous representation can be referred to only with an integer (0 if it's the first exogenous representation specified for the contents, 1 if it's the second, etc.)

ExogenousConfig(DBPediaMappingTechnique('dbo:Film', 'Title', 'EN'))

PARAMETER DESCRIPTION

exogenous_technique

Technique which will be used to expand each content with data from external sources. An example would be the DBPediaMappingTechnique which allows to retrieve properties from DBPedia.

TYPE: ExogenousPropertiesRetrieval

id

Custom id that can be used later by the user to easily refer to the representation generated by this config. IDs for a single field should be unique! And should only contain '_', '-' and alphanumeric characters

TYPE: str DEFAULT: None

Source code in clayrs/content_analyzer/config.py

def __init__(self, exogenous_technique: ExogenousPropertiesRetrieval, id: str = None):
    if id is not None:
        self._check_custom_id(id)

    self.__exogenous_technique = exogenous_technique
    self.__id = id

`exogenous_technique` `property`

Getter for the exogenous properties retrieval technique

`id` `property`

Getter for the ExogenousConfig id

`FieldConfig(content_technique=OriginalData(), preprocessing=None, postprocessing=None, memory_interface=None, id=None)`

Class that represents the configuration for a single representation of a field. The configuration of a single representation is defined by a FieldContentProductionTechnique (e.g. an EmbeddingTechnique) that will be applied to the pre-processed data of said field.

To specify how to preprocess data, simply specify an InformationProcessor in the preprocessing parameter. Multiple InformationProcessor can be wrapped in a list: in this case, the field will be preprocessed by performing operations all objects inside the list. If preprocessing is not defined, no preprocessing operations will be done on the field data.

You can use the id parameter to assign a custom id for the representation: by doing so the user can freely refer to it by using the custom id given, rather than positional integers (which are given automatically by the framework).

There is also a memory_interface attribute which allows to define a data structure where the representation will be serialized (e.g. an Index).

Various configurations are possible depending on how the user wants to represent a particular field:

This will produce a field representation using the SkLearnTfIdf technique on the field data preprocessed by NLTK by performing stopwords removal, and the name of the produced representation will be 'field_example'

FieldConfig(SkLearnTfIdf(), NLTK(stopwords_removal=True), id='field_example')

This will produce the same result as above but the id for the field representation defined by this config will be set by the ContentAnalyzer once it is being processed (0 integer if it's the first representation specified for the field, 1 if it's the second, etc.)

FieldConfig(SkLearnTfIdf(), NLTK())

This will produce a field representation using the SkLearnTfIdf technique on the field data without applying any preprocessing operation, but it will not be directly stored in the content, instead it will be stored in a index

FieldConfig(SkLearnTfIdf(), memory_interface=SearchIndex(/somedir))

In the following nothing will be done on the field data, it will be represented as is

FieldConfig()

PARAMETER	DESCRIPTION
`content_technique`	Technique that will be applied to the field in order to produce a complex representation of said field TYPE: `FieldContentProductionTechnique` DEFAULT: `OriginalData()`
`preprocessing`	Single `InformationProcessor` object or a list of `InformationProcessor` objects that will be used preprocess field data before applying the `content_technique` TYPE: `Union[InformationProcessor, List[InformationProcessor]]` DEFAULT: `None`
`memory_interface`	complex structure where the content representation can be serialized (an Index for example) TYPE: `InformationInterface` DEFAULT: `None`
`id`	Custom id that can be used later by the user to easily refer to the representation generated by this config. IDs for a single field should be unique! And should only contain '_', '-' and alphanumeric characters TYPE: `str` DEFAULT: `None`

Source code in clayrs/content_analyzer/config.py

def __init__(self,
             content_technique: FieldContentProductionTechnique = OriginalData(),
             preprocessing: Union[InformationProcessor, List[InformationProcessor]] = None,
             postprocessing: Union[VisualPostProcessor, List[VisualPostProcessor]] = None,
             memory_interface: InformationInterface = None,
             id: str = None):

    if preprocessing is None:
        preprocessing = []

    if postprocessing is None:
        postprocessing = []

    if id is not None:
        self._check_custom_id(id)

    self.__content_technique = content_technique
    self.__preprocessing = preprocessing
    self.__postprocessing = postprocessing
    self.__memory_interface = memory_interface
    self.__id = id

    if not isinstance(self.__preprocessing, list):
        self.__preprocessing = [self.__preprocessing]

    if not isinstance(self.__postprocessing, list):
        self.__postprocessing = [self.__postprocessing]

`content_technique` `property`

Getter for the field content production technique of the field

`id` `property`

Getter for the id of the field config

`memory_interface` `property`

Getter for the index associated to the field config

`postprocessing` `property`

Getter for the list of postprocessor of the field config

`preprocessing` `property`

Getter for the list of preprocessor of the field config

`ItemAnalyzerConfig`

Bases: ContentAnalyzerConfig

Class that represents the configuration for the content analyzer. The configuration specifies how the Content Analyzer needs to complexly represent contents, i.e. how to preprocess them and how to represent them In particular this class refers to items.

Examples:

>>> import clayrs.content_analyzer as ca
>>> raw_source = ca.JSONFile(json_path)
>>> movies_config = ca.ItemAnalyzerConfig(raw_source, id='movie_id', output_directory='movies_codified/')
>>> # add single field config
>>> movies_config.add_single_config('occupation', FieldConfig(content_technique=ca.OriginalData()))
>>> # add single exogenous technique
>>> movies_config.add_single_exogenous(ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))

`UserAnalyzerConfig`

Bases: ContentAnalyzerConfig

Class that represents the configuration for the content analyzer. The configuration specifies how the Content Analyzer needs to complexly represent contents, i.e. how to preprocess them and how to represent them In particular this class refers to users.

Examples:

>>> import clayrs.content_analyzer as ca
>>> raw_source = ca.JSONFile(json_path)
>>> users_config = ca.UserAnalyzerConfig(raw_source, id='user_id', output_directory='users_codified/')
>>> # add single field config
>>> users_config.add_single_config('occupation', FieldConfig(content_technique=ca.OriginalData()))
>>> # add single exogenous technique
>>> users_config.add_single_exogenous(ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))

Content Analyzer Class

`ContentAnalyzer(config, n_thread=1)`

Class to whom the control of the content analysis phase is delegated. It uses the data stored in the configuration file to create and serialize the contents the user wants to produce. It also checks that the configurations the user wants to run on the raw contents have unique ids (otherwise it would be impossible to refer to a particular field representation or exogenous representation)

PARAMETER DESCRIPTION

config

configuration for processing the item fields. This parameter provides the possibility of customizing the way in which the input data is processed.

TYPE: ContentAnalyzerConfig

Source code in clayrs/content_analyzer/content_analyzer_main.py

def __init__(self, config: ContentAnalyzerConfig, n_thread: int = 1):
    self._config: ContentAnalyzerConfig = config
    self._n_thread = n_thread

`fit()`

Processes the creation of the contents and serializes the contents. This method starts the content production process and initializes everything that will be used to create said contents, their fields and their representations

Source code in clayrs/content_analyzer/content_analyzer_main.py

def fit(self):
    """
    Processes the creation of the contents and serializes the contents. This method starts the content production
    process and initializes everything that will be used to create said contents, their fields and their
    representations
    """
    # before starting the process, the content analyzer manin checks that there are no duplicate id cases
    # both in the field dictionary and in the exogenous representation list
    # this is done now and not recursively for each content during the creation process, in order to avoid starting
    # an operation that is going to fail
    try:
        self.__check_field_dict()
        self.__check_exogenous_representation_list()
    except ValueError as e:
        raise e

    # creates the directory where the data will be serialized and overwrites it if it already exists
    output_path = self._config.output_directory
    if os.path.exists(output_path):
        shutil.rmtree(output_path)
    os.makedirs(output_path)

    contents_producer = ContentsProducer.get_instance()
    contents_producer.set_config(self._config)
    created_contents = contents_producer.create_contents()

    if self._config.export_json:
        json_path = os.path.join(self._config.output_directory, 'contents.json')
        with open(json_path, "w") as data:
            json.dump(created_contents, data, cls=ContentEncoder, indent=4)

    # with get_progbar(created_contents) as pbar:
    with get_iterator_thread(self._n_thread, self._serialize_content, created_contents,
                             keep_order=False, progress_bar=True, total=len(created_contents)) as pbar:
        pbar.set_description("Serializing contents")

        for _ in pbar:
            pass

Content Analyzer Config

ContentAnalyzerConfig(source, id, output_directory, field_dict=None, exogenous_representation_list=None, export_json=False)

exogenous_representation_list: List[ExogenousConfig] property

export_json: bool property

id: List[str] property

output_directory property

source: RawInformationSource property

add_multiple_config(field_name, config_list)

add_multiple_exogenous(config_list)

add_single_config(field_name, field_config)

add_single_exogenous(exogenous_config)

get_configs_list(field_name)

get_field_name_list()

ExogenousConfig(exogenous_technique, id=None)

exogenous_technique property

id property

FieldConfig(content_technique=OriginalData(), preprocessing=None, postprocessing=None, memory_interface=None, id=None)

content_technique property

id property

memory_interface property

postprocessing property

preprocessing property

ItemAnalyzerConfig

UserAnalyzerConfig

Content Analyzer Class

ContentAnalyzer(config, n_thread=1)

fit()

`ContentAnalyzerConfig(source, id, output_directory, field_dict=None, exogenous_representation_list=None, export_json=False)`

`exogenous_representation_list: List[ExogenousConfig]` `property`

`export_json: bool` `property`

`id: List[str]` `property`

`output_directory` `property`

`source: RawInformationSource` `property`

`add_multiple_config(field_name, config_list)`

`add_multiple_exogenous(config_list)`

`add_single_config(field_name, field_config)`

`add_single_exogenous(exogenous_config)`

`get_configs_list(field_name)`

`get_field_name_list()`

`ExogenousConfig(exogenous_technique, id=None)`

`exogenous_technique` `property`

`id` `property`

`FieldConfig(content_technique=OriginalData(), preprocessing=None, postprocessing=None, memory_interface=None, id=None)`

`content_technique` `property`

`id` `property`

`memory_interface` `property`

`postprocessing` `property`

`preprocessing` `property`

`ItemAnalyzerConfig`

`UserAnalyzerConfig`

`ContentAnalyzer(config, n_thread=1)`

`fit()`