Skip to content

Classifier Recommender

ClassifierRecommender(item_field, classifier, threshold=None, embedding_combiner=Centroid())

Bases: PerUserCBAlgorithm

Class that implements recommendation through a specified Classifier object. It's a ranking algorithm, so it can't do score prediction.

Examples:

  • Interested in only a field representation, DecisionTree classifier from sklearn, threshold \(= 3\) (Every item with rating score \(>= 3\) will be considered as positive)
>>> from clayrs import recsys as rs
>>> alg = rs.ClassifierRecommender({"Plot": 0}, rs.SkDecisionTree(), 3)
  • Interested in only a field representation, KNN classifier with custom parameters from sklearn, threshold \(= 3\) (Every item with rating score \(>= 3\) will be considered as positive)
>>> alg = rs.ClassifierRecommender({"Plot": 0}, rs.SkKNN(n_neighbors=3), 0)
  • Interested in multiple field representations of the items, KNN classifier with custom parameters from sklearn, threshold \(= None\) (Every item with rating \(>=\) mean rating of the user will be considered as positive)
>>> alg = ClassifierRecommender(
>>>                             item_field={"Plot": [0, "tfidf"],
>>>                                         "Genre": [0, 1],
>>>                                         "Director": "doc2vec"},
>>>                             classifier=rs.SkKNN(n_neighbors=3),
>>>                             threshold=None)

Info

After instantiating the ClassifierRecommender` algorithm, pass it in the initialization of a CBRS and the use its method to calculate ranking for single user or multiple users:

Examples:

>>> cbrs = rs.ContentBasedRS(algorithm=alg, ...)
>>> cbrs.fit_rank(...)
>>> # ...
PARAMETER DESCRIPTION
item_field

dict where the key is the name of the field that contains the content to use, value is the representation(s) id(s) that will be used for the said item. The value of a field can be a string or a list, use a list if you want to use multiple representations for a particular field.

TYPE: dict

classifier

classifier that will be used. Can be one object of the Classifier class.

TYPE: Classifier

threshold

Threshold for the ratings. If the rating is greater than the threshold, it will be considered as positive. If the threshold is not specified, the average score of all items rated by the user is used.

TYPE: float DEFAULT: None

embedding_combiner

CombiningTechnique used when embeddings representation must be used, but they are in a matrix form instead of a single vector (e.g. WordEmbedding representations have one vector for each word). By default, the Centroid of the rows of the matrix is computed

TYPE: CombiningTechnique DEFAULT: Centroid()

Source code in clayrs/recsys/content_based_algorithm/classifier/classifier_recommender.py
76
77
78
79
80
81
82
83
def __init__(self, item_field: dict, classifier: Classifier, threshold: float = None,
             embedding_combiner: CombiningTechnique = Centroid()):
    super().__init__(item_field, threshold)

    self._classifier = classifier
    self._embedding_combiner = embedding_combiner
    self._labels: Optional[list] = None
    self._items_features: Optional[list] = None

fit_single_user()

Fit the classifier specified in the constructor with the features and labels extracted with the process_rated() method.

It uses private attributes to fit the classifier, so process_rated() must be called before this method.

Source code in clayrs/recsys/content_based_algorithm/classifier/classifier_recommender.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def fit_single_user(self):
    """
    Fit the classifier specified in the constructor with the features and labels
    extracted with the `process_rated()` method.

    It uses private attributes to fit the classifier, so `process_rated()` must be called
    before this method.
    """
    # Fuse the input if there are dicts, multiple representation, etc.
    fused_features = self.fuse_representations(self._items_features, self._embedding_combiner)

    self._classifier.fit(fused_features, self._labels)

    # we delete variables used to fit since will no longer be used
    self._items_features = None
    self._labels = None

predict_single_user(user_idx, train_ratings, available_loaded_items, filter_list=None)

ClassifierRecommender is not a score prediction algorithm, calling this method will raise the NotPredictionAlg exception!

RAISES DESCRIPTION
NotPredictionAlg

exception raised since the ClassifierRecommender algorithm is not a score prediction algorithm

Source code in clayrs/recsys/content_based_algorithm/classifier/classifier_recommender.py
179
180
181
182
183
184
185
186
187
188
189
def predict_single_user(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsDict,
                        filter_list: List[str] = None) -> np.ndarray:
    """
    ClassifierRecommender is not a score prediction algorithm, calling this method will raise
    the `NotPredictionAlg` exception!

    Raises:
        NotPredictionAlg: exception raised since the ClassifierRecommender algorithm is not a
            score prediction algorithm
    """
    raise NotPredictionAlg("ClassifierRecommender is not a Score Prediction Algorithm!")

process_rated(user_idx, train_ratings, available_loaded_items)

Function that extracts features from rated item and labels them. The extracted features will be later used to fit the classifier.

Features and labels will be stored in private attributes of the class.

IF there are no rated items available locally or if there are only positive/negative items, an exception is thrown.

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user (the user for which we must fit the algorithm)

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsDict

RAISES DESCRIPTION
EmptyUserRatings

Exception raised when the user does not appear in the train set

NoRatedItems

Exception raised when there isn't any item available locally rated by the user

OnlyPositiveItems

Exception raised when there are only positive items available locally for the user (Items that the user liked)

OnlyNegativeitems

Exception raised when there are only negative items available locally for the user (Items that the user disliked)

Source code in clayrs/recsys/content_based_algorithm/classifier/classifier_recommender.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def process_rated(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsDict):
    """
    Function that extracts features from rated item and labels them.
    The extracted features will be later used to fit the classifier.

    Features and labels will be stored in private attributes of the class.

    IF there are no rated items available locally or if there are only positive/negative
    items, an exception is thrown.

    Args:
        user_idx: Mapped integer of the active user (the user for which we must fit the algorithm)
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents

    Raises:
        EmptyUserRatings: Exception raised when the user does not appear in the train set
        NoRatedItems: Exception raised when there isn't any item available locally
            rated by the user
        OnlyPositiveItems: Exception raised when there are only positive items available locally
            for the user (Items that the user liked)
        OnlyNegativeitems: Exception raised when there are only negative items available locally
            for the user (Items that the user disliked)
    """

    uir_user = train_ratings.get_user_interactions(user_idx)
    rated_items_id = train_ratings.item_map.convert_seq_int2str(uir_user[:, 1].astype(int))

    # a list since there could be duplicate interaction (eg bootstrap partitioning)
    items_scores_dict = defaultdict(list)

    for item_id, score in zip(rated_items_id, uir_user[:, 2]):
        items_scores_dict[item_id].append(score)

    items_scores_dict = dict(sorted(items_scores_dict.items()))  # sort dictionary based on key for reproducibility

    # Create list of all the available items that are useful for the user
    loaded_rated_items: List[Union[Content, None]] = available_loaded_items.get_list([item_id
                                                                                      for item_id
                                                                                      in rated_items_id])

    threshold = self.threshold
    if threshold is None:
        threshold = np.nanmean(uir_user[:, 2])

    # Assign label and extract features from the rated items
    labels = []
    items_features = []

    # we extract feature of each item sorted based on its key: IMPORTANT for reproducibility!!
    # otherwise the matrix we feed to sklearn will have input item in different rows each run!
    for item in loaded_rated_items:
        if item is not None:

            score_assigned = map(float, items_scores_dict[item.content_id])

            for score in score_assigned:
                items_features.append(self.extract_features_item(item))

                if score >= threshold:
                    labels.append(1)
                else:
                    labels.append(0)

    if len(uir_user[:, 1]) == 0:
        raise EmptyUserRatings("The user selected doesn't have any ratings!")

    if len(items_features) == 0:
        raise NoRatedItems("User {} - No rated item available locally!".format(user_idx))
    if 0 not in labels:
        raise OnlyPositiveItems("User {} - There are only positive items available locally!".format(user_idx))
    elif 1 not in labels:
        raise OnlyNegativeItems("User {} - There are only negative items available locally!".format(user_idx))

    self._labels = labels
    self._items_features = items_features

rank_single_user(user_idx, train_ratings, available_loaded_items, recs_number, filter_list)

Rank the top-n recommended items for the active user, where the top-n items to rank are controlled by the recs_number and filter_list parameter:

  • the former one is self-explanatory, the second is a list of items represented with their string ids. Must be necessarily strings and not their mapped integer since items are serialized following their string representation!

If recs_number is None, all ranked items will be returned

The filter list parameter is usually the result of the filter_single() method of a Methodology object

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsDict

recs_number

number of the top ranked items to return, if None all ranked items will be returned

TYPE: Optional[int]

filter_list

list of the items to rank. Should contain string item ids

TYPE: List[str]

RETURNS DESCRIPTION
np.ndarray

uir matrix for a single user containing user and item idxs (integer representation) with the ranked score as third dimension sorted in a decreasing order

Source code in clayrs/recsys/content_based_algorithm/classifier/classifier_recommender.py
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
def rank_single_user(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsDict,
                     recs_number: Optional[int], filter_list: List[str]) -> np.ndarray:
    """
    Rank the top-n recommended items for the active user, where the top-n items to rank are controlled by the
    `recs_number` and `filter_list` parameter:

    * the former one is self-explanatory, the second is a list of items
    represented with their string ids. Must be necessarily strings and not their mapped integer since items are
    serialized following their string representation!

    If `recs_number` is `None`, all ranked items will be returned

    The filter list parameter is usually the result of the `filter_single()` method of a `Methodology` object

    Args:
        user_idx: Mapped integer of the active user
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents
        recs_number: number of the top ranked items to return, if None all ranked items will be returned
        filter_list: list of the items to rank. Should contain string item ids

    Returns:
        uir matrix for a single user containing user and item idxs (integer representation) with the ranked score
            as third dimension sorted in a decreasing order
    """

    uir_user = train_ratings.get_user_interactions(user_idx)
    if len(uir_user) == 0:
        raise EmptyUserRatings("The user selected doesn't have any ratings!")

    # Load items to predict
    items_to_predict = available_loaded_items.get_list(filter_list)

    # Extract features of the items to predict
    idx_items_to_predict = []
    features_items_to_predict = []
    for item in items_to_predict:
        if item is not None:
            idx_items_to_predict.append(item.content_id)
            features_items_to_predict.append(self.extract_features_item(item))

    if len(idx_items_to_predict) == 0:
        return np.array([])  # if no item to predict, empty rank is returned

    idx_items_to_predict = train_ratings.item_map.convert_seq_str2int(idx_items_to_predict)

    # Fuse the input if there are dicts, multiple representation, etc.
    fused_features_items_to_pred = self.fuse_representations(features_items_to_predict,
                                                             self._embedding_combiner)

    class_prob = self._classifier.predict_proba(fused_features_items_to_pred)

    # for each item we extract the probability that the item is liked (class 1)
    sorted_scores_idxs = np.argsort(class_prob[:, 1])[::-1][:recs_number]
    sorted_items = np.array(idx_items_to_predict)[sorted_scores_idxs]
    sorted_scores = class_prob[:, 1][sorted_scores_idxs]

    uir_rank = np.array([[user_idx, item_idx, score]
                         for item_idx, score in zip(sorted_items, sorted_scores)])

    return uir_rank

Classifiers Implemented

The following are the classifiers you can use in the classifier parameter of the ClassifierRecommender class

SkDecisionTree(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)

Bases: Classifier

Class that implements the Decision Tree Classifier from sklearn. The parameters one could pass are the same ones you would pass instantiating the classifier directly from sklearn

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/classifier/classifiers.py
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
def __init__(self, *,
             criterion: Any = "gini",
             splitter: Any = "best",
             max_depth: Any = None,
             min_samples_split: Any = 2,
             min_samples_leaf: Any = 1,
             min_weight_fraction_leaf: Any = 0.0,
             max_features: Any = None,
             random_state: Any = None,
             max_leaf_nodes: Any = None,
             min_impurity_decrease: Any = 0.0,
             class_weight: Any = None,
             ccp_alpha: Any = 0.0):
    clf = DecisionTreeClassifier(criterion=criterion, splitter=splitter, max_depth=max_depth,
                                 min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                 min_weight_fraction_leaf=min_weight_fraction_leaf, max_features=max_features,
                                 random_state=random_state, max_leaf_nodes=max_leaf_nodes,
                                 min_impurity_decrease=min_impurity_decrease, class_weight=class_weight,
                                 ccp_alpha=ccp_alpha)

    super().__init__(clf, inspect.currentframe())

SkGaussianProcess(kernel=None, *, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class='one_vs_rest', n_jobs=None)

Bases: Classifier

Class that implements the Gaussian Process Classifier from sklearn. The parameters one could pass are the same ones you would pass instantiating the classifier directly from sklearn

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/classifier/classifiers.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
def __init__(self, kernel: Any = None,
             *,
             optimizer: Any = "fmin_l_bfgs_b",
             n_restarts_optimizer: Any = 0,
             max_iter_predict: Any = 100,
             warm_start: Any = False,
             copy_X_train: Any = True,
             random_state: Any = None,
             multi_class: Any = "one_vs_rest",
             n_jobs: Any = None):

    clf = GaussianProcessClassifier(kernel=kernel, optimizer=optimizer, n_restarts_optimizer=n_restarts_optimizer,
                                    max_iter_predict=max_iter_predict, warm_start=warm_start,
                                    copy_X_train=copy_X_train, random_state=random_state,
                                    multi_class=multi_class, n_jobs=n_jobs)

    super().__init__(clf, inspect.currentframe())

SkKNN(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

Bases: Classifier

Class that implements the KNN Classifier from sklearn. The parameters one could pass are the same ones you would pass instantiating the classifier KNN directly from sklearn.

Sklearn documentation: here

Since KNN implementation of sklearn has n_neighbors = 5 as default, it can throw an exception if less sample in the training data are provided, so we change dynamically the n_neighbors parameter according to the number of samples if the dataset is too small and if no manual n_neighbors is set

Source code in clayrs/recsys/content_based_algorithm/classifier/classifiers.py
129
130
131
132
133
134
135
136
137
138
139
140
141
def __init__(self, n_neighbors: Any = 5,
             *,
             weights: Any = "uniform",
             algorithm: Any = "auto",
             leaf_size: Any = 30,
             p: Any = 2,
             metric: Any = "minkowski",
             metric_params: Any = None,
             n_jobs: Any = None):
    clf = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size,
                               p=p, metric=metric, metric_params=metric_params, n_jobs=n_jobs)

    super().__init__(clf, inspect.currentframe())

SkLogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

Bases: Classifier

Class that implements the Logistic Regression Classifier from sklearn. The parameters one could pass are the same ones you would pass instantiating the classifier directly from sklearn

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/classifier/classifiers.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
def __init__(self, penalty: Any = "l2",
             *,
             dual: Any = False,
             tol: Any = 1e-4,
             C: Any = 1.0,
             fit_intercept: Any = True,
             intercept_scaling: Any = 1,
             class_weight: Any = None,
             random_state: Any = None,
             solver: Any = "lbfgs",
             max_iter: Any = 100,
             multi_class: Any = "auto",
             verbose: Any = 0,
             warm_start: Any = False,
             n_jobs: Any = None,
             l1_ratio: Any = None):
    clf = LogisticRegression(penalty=penalty, dual=dual, tol=tol, C=C, fit_intercept=fit_intercept,
                             intercept_scaling=intercept_scaling, class_weight=class_weight,
                             random_state=random_state, solver=solver, max_iter=max_iter,
                             multi_class=multi_class, verbose=verbose, warm_start=warm_start,
                             n_jobs=n_jobs, l1_ratio=l1_ratio)

    super().__init__(clf, inspect.currentframe())

SkRandomForest(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

Bases: Classifier

Class that implements the Random Forest Classifier from sklearn. The parameters one could pass are the same ones you would pass instantiating the classifier directly from sklearn

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/classifier/classifiers.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def __init__(self, n_estimators: Any = 100,
             *,
             criterion: Any = "gini",
             max_depth: Any = None,
             min_samples_split: Any = 2,
             min_samples_leaf: Any = 1,
             min_weight_fraction_leaf: Any = 0.0,
             max_features: Any = "auto",
             max_leaf_nodes: Any = None,
             min_impurity_decrease: Any = 0.0,
             bootstrap: Any = True,
             oob_score: Any = False,
             n_jobs: Any = None,
             random_state: Any = None,
             verbose: Any = 0,
             warm_start: Any = False,
             class_weight: Any = None,
             ccp_alpha: Any = 0.0,
             max_samples: Any = None):
    clf = RandomForestClassifier(n_estimators=n_estimators, criterion=criterion, max_depth=max_depth,
                                 min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                 min_weight_fraction_leaf=min_weight_fraction_leaf, max_features=max_features,
                                 max_leaf_nodes=max_leaf_nodes, min_impurity_decrease=min_impurity_decrease,
                                 bootstrap=bootstrap, oob_score=oob_score, n_jobs=n_jobs, random_state=random_state,
                                 verbose=verbose, warm_start=warm_start, class_weight=class_weight,
                                 ccp_alpha=ccp_alpha, max_samples=max_samples)

    super().__init__(clf, inspect.currentframe())

SkSVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)

Bases: Classifier

Class that implements the SVC Classifier from sklearn. The parameters one could pass are the same ones you would pass instantiating the classifier SVC directly from sklearn.

Sklearn documentation: here

The only parameter from sklearn that cannot be passed is the 'probability' parameter: it is set to True and cannot be changed

Source code in clayrs/recsys/content_based_algorithm/classifier/classifiers.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def __init__(self,
             *,
             C: Any = 1.0,
             kernel: Any = "rbf",
             degree: Any = 3,
             gamma: Any = "scale",
             coef0: Any = 0.0,
             shrinking: Any = True,
             tol: Any = 1e-3,
             cache_size: Any = 200,
             class_weight: Any = None,
             verbose: Any = False,
             max_iter: Any = -1,
             decision_function_shape: Any = "ovr",
             break_ties: Any = False,
             random_state: Any = None):

    # Force the probability parameter at True, otherwise SVC won't predict_proba
    clf = SVC(C=C, kernel=kernel, degree=degree, gamma=gamma, coef0=coef0, shrinking=shrinking, tol=tol,
              cache_size=cache_size, class_weight=class_weight, verbose=verbose, max_iter=max_iter,
              decision_function_shape=decision_function_shape, break_ties=break_ties, random_state=random_state,
              probability=True)

    super().__init__(clf, inspect.currentframe())