Skip to content

Quickstart

Content Analyzer

The first thing to do is to import the Content Analyzer module * We will access its methods and classes via dot notation

import clayrs.content_analyzer as ca

Then, let's point to the source containing raw information to process

raw_source = ca.JSONFile('items_info.json')

We can now start building the configuration for the items

Info

Note that same operations that can be specified for items could be also specified for users via the ca.UserAnalyzerConfig class

# Configuration of item representation
movies_ca_config = ca.ItemAnalyzerConfig(
    source=raw_source,
    id='movielens_id', # (1) 
    output_directory='movies_codified/' # (2) 
)
  1. The id in the raw source which uniquely identifies each item
  2. Directory which will contain items complexly represented

Let's represent the plot field of each content with a TfIdf representation

  • Since the preprocessing parameter has been specified, then each field is first preprocessed with the specified operations

    movies_ca_config.add_single_config(
        'plot',
        ca.FieldConfig(ca.SkLearnTfIdf(),
                       preprocessing=ca.NLTK(stopwords_removal=True,
                                             lemmatization=True),
                       id='tfidf')  # (1)
    )
    

  • User defined id for the representation

To finalize the Content Analyzer part, let's instantiate the ContentAnalyzer class by passing the built configuration and by calling its fit() method

ca.ContentAnalyzer(movies_ca_config).fit()
The items will be created with the specified representations and serialized


RecSys

Similarly above, we must first import the RecSys module

import clayrs.recsys as rs

Then we load the rating frame from a TSV file

Info

In this case in our file the first three columns are user_id, item_id, score in this order

  • If your file has a different structure you must specify how to map the column via parameters, check documentation for more
ratings = ca.Ratings(ca.CSVFile('ratings.tsv', separator='\t'))

Let's split with the KFold technique the loaded rating frame into train set and test set

  • since n_splits=2, train_list will contain two train_sets and test_list will contain two test_sets
    train_list, test_list = rs.KFoldPartitioning(n_splits=2).split_all(ratings)
    

In order to recommend items to users, we must choose an algorithm to use

  • In this case we are using the CentroidVector algorithm which will work by using the first representation specified for the plot field
  • You can freely choose which representation to use among all representation codified for the fields in the Content Analyzer phase
centroid_vec = rs.CentroidVector(
    {'plot': 'tfidf'}, # (1)

    similarity=rs.CosineSimilarity()
)
  1. We can reference the representation specified for the 'plot' field with the assigned custom id in the Content Analyzer phase

Let's now compute the top-10 ranking for each user of the train set

  • By default the candidate items are those in the test set of the user, but you can change this behaviour with the methodology parameter

Since we used the kfold technique, we iterate over all train sets and test sets

result_list = []

for train_set, test_set in zip(train_list, test_list):

  cbrs = rs.ContentBasedRS(centroid_vec, train_set, 'movies_codified/')
  rank = cbrs.fit_rank(test_set, n_recs=10)

  result_list.append(rank)
result_list will contain two Rank objects in this case, one for each split


Evaluation module

Similarly to the Content Analyzer and RecSys module, we must first import the evaluation module

import clayrs.evaluation as eva

The class responsible for evaluating recommendation lists is the EvalModel class. It needs the following parameters:

  • A list of computed rank/predictions (in case multiple splits must be evaluated)
  • A list of truths (in case multiple splits must be evaluated)
  • List of metrics to compute

Obviously the list of computed rank/predictions and list of truths must have the same length, and the rank/prediction in position \(i\) will be compared with the truth at position \(i\)

em = eva.EvalModel(
    pred_list=result_list,
    truth_list=test_list,
    metric_list=[
        eva.NDCG(),
        eva.Precision(),
        eva.RecallAtK(k=5)
    ]
)

Then simply call the fit() method of the instantiated object

  • It will return two pandas DataFrame: the first one contains the metrics aggregated for the system, while the second contains the metrics computed for each user (where possible)
sys_result, users_result =  em.fit()

Note

Note that the EvalModel is able to compute evaluation of recommendations generated by other tools/frameworks, check documentation for more