Skip to content

Linear Predictor

LinearPredictor(item_field, regressor, only_greater_eq=None, embedding_combiner=Centroid())

Bases: PerUserCBAlgorithm

Class that implements recommendation through a specified linear predictor. It's a score prediction algorithm, so it can predict what rating a user would give to an unseen item. As such, it's also a ranking algorithm (it simply ranks in descending order unseen items by the predicted rating)

Examples:

  • Interested in only a field representation, LinearRegression regressor from sklearn
>>> from clayrs import recsys as rs
>>> alg = rs.LinearPredictor({"Plot": 0}, rs.SkLinearRegression())
  • Interested in only a field representation, Ridge regressor from sklearn with custom parameters
>>> alg = rs.LinearPredictor({"Plot": 0}, rs.SkRidge(alpha=0.8))
  • Interested in multiple field representations of the items, Ridge regressor from sklearn with custom parameters, only_greater_eq \(= 2\) (Every item with rating \(>= 2\) will be discarded and not considered in the ranking/score prediction task)
>>> alg = rs.LinearPredictor(
>>>                         item_field={"Plot": [0, "tfidf"],
>>>                                     "Genre": [0, 1],
>>>                                     "Director": "doc2vec"},
>>>                         regressor=rs.SkRidge(alpha=0.8),
>>>                         only_greater_eq=2)

Info

After instantiating the LinearPredictor algorithm, pass it in the initialization of a CBRS and the use its method to predict ratings or calculate ranking for a single user or multiple users:

Examples:

>>> cbrs = rs.ContentBasedRS(algorithm=alg, ...)
>>> cbrs.fit_predict(...)
>>> cbrs.fit_rank(...)
>>> # ...
PARAMETER DESCRIPTION
item_field

dict where the key is the name of the field that contains the content to use, value is the representation(s) id(s) that will be used for the said item. The value of a field can be a string or a list, use a list if you want to use multiple representations for a particular field.

TYPE: dict

regressor

regressor that will be used. Can be one object of the Regressor class.

TYPE: Regressor

only_greater_eq

Threshold for the ratings. Only items with rating greater or equal than the threshold will be considered, items with lower rating will be discarded. If None, no item will be filter out

TYPE: float DEFAULT: None

embedding_combiner

CombiningTechnique used when embeddings representation must be used, but they are in a matrix form instead of a single vector (e.g. WordEmbedding representations have one vector for each word). By default, the Centroid of the rows of the matrix is computed

TYPE: CombiningTechnique DEFAULT: Centroid()

Source code in clayrs/recsys/content_based_algorithm/regressor/linear_predictor.py
75
76
77
78
79
80
81
def __init__(self, item_field: dict, regressor: Regressor, only_greater_eq: float = None,
             embedding_combiner: CombiningTechnique = Centroid()):
    super().__init__(item_field, only_greater_eq)
    self._regressor = regressor
    self._labels: Optional[list] = None
    self._items_features: Optional[list] = None
    self._embedding_combiner = embedding_combiner

fit_single_user()

Fit the regressor specified in the constructor with the features and labels (rating scores) extracted with the process_rated() method.

It uses private attributes to fit the regressor, so process_rated() must be called before this method.

Source code in clayrs/recsys/content_based_algorithm/regressor/linear_predictor.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def fit_single_user(self):
    """
    Fit the regressor specified in the constructor with the features and labels (rating scores)
    extracted with the `process_rated()` method.

    It uses private attributes to fit the regressor, so `process_rated()` must be called
    before this method.
    """
    # Fuse the input if there are dicts, multiple representation, etc.
    fused_features = self.fuse_representations(self._items_features, self._embedding_combiner)

    self._regressor.fit(fused_features, self._labels)

    # we delete variables used to fit since will no longer be used
    self._labels = None
    self._items_features = None

predict_single_user(user_idx, train_ratings, available_loaded_items, filter_list)

Predicts how much a user will like unrated items.

The filter list parameter is usually the result of the filter_single() method of a Methodology object, and is a list of items represented with their string ids. Must be necessarily strings and not their mapped integer since items are serialized following their string representation!

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsDict

filter_list

list of the items to rank. Should contain string item ids

TYPE: List[str]

RETURNS DESCRIPTION
np.ndarray

uir matrix for a single user containing user and item idxs (integer representation) with the predicted score as third dimension

Source code in clayrs/recsys/content_based_algorithm/regressor/linear_predictor.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
def predict_single_user(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsDict,
                        filter_list: List[str]) -> np.ndarray:
    """
    Predicts how much a user will like unrated items.

    The filter list parameter is usually the result of the `filter_single()` method of a `Methodology` object, and
    is a list of items represented with their string ids. Must be necessarily strings and not their mapped integer
    since items are serialized following their string representation!

    Args:
        user_idx: Mapped integer of the active user
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents
        filter_list: list of the items to rank. Should contain string item ids

    Returns:
        uir matrix for a single user containing user and item idxs (integer representation) with the predicted score
            as third dimension
    """

    idx_items_to_predict, score_labels = self._common_prediction_process(user_idx, train_ratings,
                                                                         available_loaded_items,
                                                                         filter_list)
    if len(score_labels) != 0:
        # Build the output data
        uir_pred = np.array(
            [[user_idx, item_idx, score] for item_idx, score in zip(idx_items_to_predict, score_labels)])
    else:
        uir_pred = np.array([])

    return uir_pred

process_rated(user_idx, train_ratings, available_loaded_items)

Function that extracts features from rated item and labels them. The extracted features will be later used to fit the regressor.

Features and labels (in this case the rating score) will be stored in private attributes of the class.

IF there are no rated items available locally, an exception is thrown.

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user (the user for which we must fit the algorithm)

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsDict

RAISES DESCRIPTION
EmptyUserRatings

Exception raised when the user does not appear in the train set

NoRatedItems

Exception raised when there isn't any item available locally rated by the user

Source code in clayrs/recsys/content_based_algorithm/regressor/linear_predictor.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
def process_rated(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsDict):
    """
    Function that extracts features from rated item and labels them.
    The extracted features will be later used to fit the regressor.

    Features and labels (in this case the rating score) will be stored in private attributes of the class.

    IF there are no rated items available locally, an exception is thrown.

    Args:
        user_idx: Mapped integer of the active user (the user for which we must fit the algorithm)
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents

    Raises:
        EmptyUserRatings: Exception raised when the user does not appear in the train set
        NoRatedItems: Exception raised when there isn't any item available locally
            rated by the user
    """
    uir_user = train_ratings.get_user_interactions(user_idx)
    rated_items_id = train_ratings.item_map.convert_seq_int2str(uir_user[:, 1].astype(int))

    # a list since there could be duplicate interaction (eg bootstrap partitioning)
    items_scores_dict = defaultdict(list)

    for item_id, score in zip(rated_items_id, uir_user[:, 2]):
        items_scores_dict[item_id].append(score)

    items_scores_dict = dict(sorted(items_scores_dict.items()))  # sort dictionary based on key for reproducibility

    # Create list of all the available items that are useful for the user
    loaded_rated_items: List[Union[Content, None]] = available_loaded_items.get_list([item_id
                                                                                      for item_id
                                                                                      in rated_items_id])

    # Assign label and extract features from the rated items
    labels = []
    items_features = []

    for item in loaded_rated_items:
        if item is not None:

            score_assigned = map(float, items_scores_dict[item.content_id])

            for score in score_assigned:
                if self.threshold is None or score >= self.threshold:
                    items_features.append(self.extract_features_item(item))
                    labels.append(score)

    if len(uir_user[:, 1]) == 0:
        raise EmptyUserRatings("The user selected doesn't have any ratings!")

    if len(items_features) == 0:
        raise NoRatedItems("User {} - No rated item available locally!".format(user_idx))

    self._labels = labels
    self._items_features = items_features

rank_single_user(user_idx, train_ratings, available_loaded_items, recs_number, filter_list)

Rank the top-n recommended items for the active user, where the top-n items to rank are controlled by the recs_number and filter_list parameter:

  • the former one is self-explanatory, the second is a list of items represented with their string ids. Must be necessarily strings and not their mapped integer since items are serialized following their string representation!

If recs_number is None, all ranked items will be returned

The filter list parameter is usually the result of the filter_single() method of a Methodology object

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsDict

recs_number

number of the top ranked items to return, if None all ranked items will be returned

TYPE: Optional[int]

filter_list

list of the items to rank. Should contain string item ids

TYPE: List[str]

RETURNS DESCRIPTION
np.ndarray

uir matrix for a single user containing user and item idxs (integer representation) with the ranked score as third dimension sorted in a decreasing order

Source code in clayrs/recsys/content_based_algorithm/regressor/linear_predictor.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def rank_single_user(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsDict,
                     recs_number: Optional[int], filter_list: List[str]) -> np.ndarray:
    """
    Rank the top-n recommended items for the active user, where the top-n items to rank are controlled by the
    `recs_number` and `filter_list` parameter:

    * the former one is self-explanatory, the second is a list of items
    represented with their string ids. Must be necessarily strings and not their mapped integer since items are
    serialized following their string representation!

    If `recs_number` is `None`, all ranked items will be returned

    The filter list parameter is usually the result of the `filter_single()` method of a `Methodology` object

    Args:
        user_idx: Mapped integer of the active user
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents
        recs_number: number of the top ranked items to return, if None all ranked items will be returned
        filter_list: list of the items to rank. Should contain string item ids

    Returns:
        uir matrix for a single user containing user and item idxs (integer representation) with the ranked score
            as third dimension sorted in a decreasing order
    """

    # Predict the rating for the items and sort them in descending order
    idx_items_to_predict, score_labels = self._common_prediction_process(user_idx, train_ratings,
                                                                         available_loaded_items,
                                                                         filter_list)

    if len(score_labels) != 0:
        sorted_scores_idxs = np.argsort(score_labels)[::-1][:recs_number]
        sorted_items = np.array(idx_items_to_predict)[sorted_scores_idxs]
        sorted_scores = score_labels[sorted_scores_idxs]

        # we construct the output data
        uir_rank = np.array([[user_idx, item_idx, score] for item_idx, score in zip(sorted_items, sorted_scores)])
    else:
        uir_rank = np.array([])

    return uir_rank

Regressors Implemented

The following are the regressors you can use in the regressor parameter of the LinearPredictor class

SkARDRegression(*, n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, compute_score=False, threshold_lambda=10000.0, fit_intercept=True, normalize='deprecated', copy_X=True, verbose=False)

Bases: Regressor

Class that implements the ARD regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor ARD directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def __init__(self, *,
             n_iter: Any = 300,
             tol: Any = 1.0e-3,
             alpha_1: Any = 1.0e-6,
             alpha_2: Any = 1.0e-6,
             lambda_1: Any = 1.0e-6,
             lambda_2: Any = 1.0e-6,
             compute_score: Any = False,
             threshold_lambda: Any = 1.0e4,
             fit_intercept: Any = True,
             normalize: Any = "deprecated",
             copy_X: Any = True,
             verbose: Any = False):
    model = ARDRegression(n_iter=n_iter, tol=tol, alpha_1=alpha_1, alpha_2=alpha_2, lambda_1=lambda_1,
                          lambda_2=lambda_2, compute_score=compute_score, threshold_lambda=threshold_lambda,
                          fit_intercept=fit_intercept, normalize=normalize, copy_X=copy_X, verbose=verbose)
    super().__init__(model, inspect.currentframe())

SkBayesianRidge(*, n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, alpha_init=None, lambda_init=None, compute_score=False, fit_intercept=True, normalize='deprecated', copy_X=True, verbose=False)

Bases: Regressor

Class that implements the BayesianRidge regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor BayesianRidge directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
def __init__(self, *,
             n_iter: Any = 300,
             tol: Any = 1.0e-3,
             alpha_1: Any = 1.0e-6,
             alpha_2: Any = 1.0e-6,
             lambda_1: Any = 1.0e-6,
             lambda_2: Any = 1.0e-6,
             alpha_init: Any = None,
             lambda_init: Any = None,
             compute_score: Any = False,
             fit_intercept: Any = True,
             normalize: Any = "deprecated",
             copy_X: Any = True,
             verbose: Any = False):
    model = BayesianRidge(n_iter=n_iter, tol=tol, alpha_1=alpha_1, alpha_2=alpha_2, lambda_1=lambda_1,
                          lambda_2=lambda_2, alpha_init=alpha_init, lambda_init=lambda_init,
                          compute_score=compute_score, fit_intercept=fit_intercept, normalize=normalize,
                          copy_X=copy_X, verbose=verbose)

    super().__init__(model, inspect.currentframe())

SkHuberRegressor(*, epsilon=1.35, max_iter=100, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05)

Bases: Regressor

Class that implements the Huber regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor Huber directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
256
257
258
259
260
261
262
263
264
265
266
def __init__(self, *,
             epsilon: Any = 1.35,
             max_iter: Any = 100,
             alpha: Any = 0.0001,
             warm_start: Any = False,
             fit_intercept: Any = True,
             tol: Any = 1e-05):
    model = HuberRegressor(epsilon=epsilon, max_iter=max_iter, alpha=alpha,
                           warm_start=warm_start, fit_intercept=fit_intercept, tol=tol)

    super().__init__(model, inspect.currentframe())

SkLinearRegression(*, fit_intercept=True, normalize='deprecated', copy_X=True, n_jobs=None, positive=False)

Bases: Regressor

Class that implements the LinearRegression regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor LinearRegression directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
80
81
82
83
84
85
86
87
88
def __init__(self, *, fit_intercept: Any = True,
             normalize: Any = "deprecated",
             copy_X: Any = True,
             n_jobs: Any = None,
             positive: Any = False):
    model = LinearRegression(fit_intercept=fit_intercept, normalize=normalize, copy_X=copy_X, n_jobs=n_jobs,
                             positive=positive)

    super().__init__(model, inspect.currentframe())

SkPassiveAggressiveRegressor(*, C=1.0, fit_intercept=True, max_iter=1000, tol=0.001, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, shuffle=True, verbose=0, loss='epsilon_insensitive', epsilon=DEFAULT_EPSILON, random_state=None, warm_start=False, average=False)

Bases: Regressor

Class that implements the PassiveAggressive regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor PassiveAggressive directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
def __init__(self, *,
             C: Any = 1.0,
             fit_intercept: Any = True,
             max_iter: Any = 1000,
             tol: Any = 1e-3,
             early_stopping: Any = False,
             validation_fraction: Any = 0.1,
             n_iter_no_change: Any = 5,
             shuffle: Any = True,
             verbose: Any = 0,
             loss: Any = "epsilon_insensitive",
             epsilon: Any = DEFAULT_EPSILON,
             random_state: Any = None,
             warm_start: Any = False,
             average: Any = False):

    model = PassiveAggressiveRegressor(C=C, fit_intercept=fit_intercept, max_iter=max_iter, tol=tol,
                                       early_stopping=early_stopping, validation_fraction=validation_fraction,
                                       n_iter_no_change=n_iter_no_change, shuffle=shuffle, verbose=verbose,
                                       loss=loss, epsilon=epsilon, random_state=random_state, warm_start=warm_start,
                                       average=average)
    super().__init__(model, inspect.currentframe())

SkRidge(alpha=1.0, *, fit_intercept=True, normalize='deprecated', copy_X=True, max_iter=None, tol=0.001, solver='auto', positive=False, random_state=None)

Bases: Regressor

Class that implements the Ridge regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor Ridge directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def __init__(self, alpha: Any = 1.0,
             *,
             fit_intercept: Any = True,
             normalize: Any = "deprecated",
             copy_X: Any = True,
             max_iter: Any = None,
             tol: Any = 1e-3,
             solver: Any = "auto",
             positive: Any = False,
             random_state: Any = None):
    model = Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, copy_X=copy_X,
                  max_iter=max_iter, tol=tol, solver=solver, positive=positive, random_state=random_state)

    super().__init__(model, inspect.currentframe())

SkSGDRegressor(loss='squared_error', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=DEFAULT_EPSILON, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False)

Bases: Regressor

Class that implements the SGD regressor from sklearn. The parameters one could pass are the same ones you would pass instantiating the regressor SGD directly from sklearn.

Sklearn documentation: here

Source code in clayrs/recsys/content_based_algorithm/regressor/regressors.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
def __init__(self, loss: Any = "squared_error",
             *,
             penalty: Any = "l2",
             alpha: Any = 0.0001,
             l1_ratio: Any = 0.15,
             fit_intercept: Any = True,
             max_iter: Any = 1000,
             tol: Any = 1e-3,
             shuffle: Any = True,
             verbose: Any = 0,
             epsilon: Any = DEFAULT_EPSILON,
             random_state: Any = None,
             learning_rate: Any = "invscaling",
             eta0: Any = 0.01,
             power_t: Any = 0.25,
             early_stopping: Any = False,
             validation_fraction: Any = 0.1,
             n_iter_no_change: Any = 5,
             warm_start: Any = False,
             average: Any = False):
    model = SGDRegressor(loss=loss, penalty=penalty, alpha=alpha, l1_ratio=l1_ratio, fit_intercept=fit_intercept,
                         max_iter=max_iter, tol=tol, shuffle=shuffle, verbose=verbose, epsilon=epsilon,
                         random_state=random_state, learning_rate=learning_rate, eta0=eta0, power_t=power_t,
                         early_stopping=early_stopping, validation_fraction=validation_fraction,
                         n_iter_no_change=n_iter_no_change, warm_start=warm_start, average=average)
    super().__init__(model, inspect.currentframe())