Skip to content

Index Query

IndexQuery(item_field, classic_similarity=True, threshold=None)

Bases: PerUserCBAlgorithm

Class for the search engine recommender using an index. It firsts builds a query using the representation(s) specified of the positive items, then uses the mentioned query to do an actual search inside the index: every item will have a score of "closeness" in relation to the query, we use this score to rank every item.

Just be sure to use textual representation(s) to build a significant query and to make a significant search!

Examples:

  • Interested in only a field representation, classic tfidf similarity, threshold \(= 3\) (Every item with rating \(>= 3\) will be considered as positive)
>>> from clayrs import recsys as rs
>>> alg = rs.IndexQuery({"Plot": 0}, threshold=3)
  • Interested in multiple field representations of the items, BM25 similarity, threshold \(= None\) (Every item with rating \(>=\) mean rating of the user will be considered as positive)
>>> alg = rs.IndexQuery(
>>>                     item_field={"Plot": [0, "original_text"],
>>>                                 "Genre": [0, 1],
>>>                                 "Director": "preprocessed_text"},
>>>                     classic_similarity=False,
>>>                     threshold=3)

Info

After instantiating the IndexQuery algorithm, pass it in the initialization of a CBRS and the use its method to calculate ranking for single user or multiple users:

Examples:

>>> cbrs = rs.ContentBasedRS(algorithm=alg, ...)
>>> cbrs.fit_rank(...)
>>> # ...
PARAMETER DESCRIPTION
item_field

dict where the key is the name of the field that contains the content to use, value is the representation(s) id(s) that will be used for the said item, just BE SURE to use textual representation(s). The value of a field can be a string or a list, use a list if you want to use multiple representations for a particular field.

TYPE: dict

classic_similarity

True if you want to use the classic implementation of tfidf in Whoosh, False if you want BM25F

TYPE: bool DEFAULT: True

threshold

Threshold for the ratings. If the rating is greater than the threshold, it will be considered as positive. If the threshold is not specified, the average score of all items rated by the user is used.

TYPE: float DEFAULT: None

Source code in clayrs/recsys/content_based_algorithm/index_query/index_query.py
66
67
68
69
70
71
def __init__(self, item_field: dict, classic_similarity: bool = True, threshold: float = None):
    super().__init__(item_field, threshold)
    self._string_query: Optional[str] = None
    self._scores: Optional[list] = None
    self._positive_user_docs: Optional[dict] = None
    self._classic_similarity: bool = classic_similarity

fit_single_user()

The fit process for the IndexQuery consists in building a query using the features of the positive items ONLY (items that the user liked). The terms relative to these 'positive' items are boosted by the rating he/she/it gave.

This method uses extracted features of the positive items stored in a private attribute, so process_rated() must be called before this method.

The built query will also be stored in a private attribute.

Source code in clayrs/recsys/content_based_algorithm/index_query/index_query.py
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def fit_single_user(self):
    """
    The fit process for the IndexQuery consists in building a query using the features of the positive items ONLY
    (items that the user liked). The terms relative to these 'positive' items are boosted by the
    rating he/she/it gave.

    This method uses extracted features of the positive items stored in a private attribute, so
    `process_rated()` must be called before this method.

    The built query will also be stored in a private attribute.
    """
    # For each field of each document one string (containing the name of the field and the data in it)
    # is created and added to the query.
    # Also each part of the query that refers to a document
    # is boosted by the score given by the user to said document
    string_query = "("
    for (doc_id, doc_data), score in zip(self._positive_user_docs, self._scores):
        string_query += "("
        for field_name in doc_data:
            if field_name == 'content_id':
                continue
            word_list = doc_data[field_name].split()
            string_query += field_name + ":("
            for term in word_list:
                string_query += term + " "
            string_query += ") "
        string_query += ")^" + str(score) + " "
    string_query += ") "

    self._string_query = string_query

predict_single_user(user_idx, train_ratings, available_loaded_items, filter_list)

IndexQuery is not a Prediction Score Algorithm, so if this method is called, a NotPredictionAlg exception is raised

RAISES DESCRIPTION
NotPredictionAlg

exception raised since the IndexQuery algorithm is not a score prediction algorithm

Source code in clayrs/recsys/content_based_algorithm/index_query/index_query.py
232
233
234
235
236
237
238
239
240
241
def predict_single_user(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsIndex,
                        filter_list: List[str]) -> np.ndarray:
    """
    IndexQuery is not a Prediction Score Algorithm, so if this method is called,
    a NotPredictionAlg exception is raised

    Raises:
        NotPredictionAlg: exception raised since the IndexQuery algorithm is not a score prediction algorithm
    """
    raise NotPredictionAlg("IndexQuery is not a Score Prediction Algorithm!")

process_rated(user_idx, train_ratings, available_loaded_items)

Function that extracts features from positive rated items ONLY of a user The extracted features will be used to fit the algorithm (build the query).

Features extracted will be stored in private attributes of the class.

IF there are no rated items available locally or if there are only positive/negative items, an exception is thrown.

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user (the user for which we must fit the algorithm)

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsIndex

RAISES DESCRIPTION
EmptyUserRatings

Exception raised when the user does not appear in the train set

OnlyNegativeitems

Exception raised when there are only negative items available locally for the user (Items that the user disliked)

Source code in clayrs/recsys/content_based_algorithm/index_query/index_query.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
def process_rated(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsIndex):
    """
    Function that extracts features from positive rated items ONLY of a user
    The extracted features will be used to fit the algorithm (build the query).

    Features extracted will be stored in private attributes of the class.

    IF there are no rated items available locally or if there are only positive/negative
    items, an exception is thrown.

    Args:
        user_idx: Mapped integer of the active user (the user for which we must fit the algorithm)
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents

    Raises:
        EmptyUserRatings: Exception raised when the user does not appear in the train set
        OnlyNegativeitems: Exception raised when there are only negative items available locally
            for the user (Items that the user disliked)
    """

    uir_user = train_ratings.get_user_interactions(user_idx)
    rated_items_id = train_ratings.item_map.convert_seq_int2str(uir_user[:, 1].astype(int))

    # a list since there could be duplicate interaction (eg bootstrap partitioning)
    items_scores_dict = defaultdict(list)

    for item_id, score in zip(rated_items_id, uir_user[:, 2]):
        items_scores_dict[item_id].append(score)

    items_scores_dict = dict(sorted(items_scores_dict.items()))  # sort dictionary based on key for reproducibility

    threshold = self.threshold
    if threshold is None:
        threshold = np.nanmean(uir_user[:, 2])

    # Initializes positive_user_docs which is a list that has tuples with document_id as first element and
    # a dictionary as second. The dictionary value has the name of the field as key
    # and its contents as value. By doing so we obtain the data of the fields while
    # also storing information regarding the field and the document where it was
    scores = []
    positive_user_docs = []

    ix = available_loaded_items.get_contents_interface()

    # we extract feature of each item sorted based on its key: IMPORTANT for reproducibility!!
    # we must convert keys (which are strings) to the respective int idx to build the uir
    for (item_id, (item_idx, score_list)) in zip(rated_items_id, items_scores_dict.items()):

        score_assigned = map(float, score_list)

        for score in score_assigned:
            if score >= threshold:
                # {item_id: {"item": item_dictionary, "score": item_score}}
                item_query = ix.query(item_id, results_number=1, classic_similarity=self._classic_similarity)
                if len(item_query) != 0:
                    item = item_query.pop(item_id).get('item')
                    scores.append(score)
                    positive_user_docs.append((item_idx, self._get_representations(item)))

    if len(uir_user[:, 1]) == 0:
        raise EmptyUserRatings("The user selected doesn't have any ratings!")

    if len(positive_user_docs) == 0:
        raise OnlyNegativeItems(f"User {user_idx} - There are no rated items available locally or there are only "
                                f"negative items available locally!")

    self._positive_user_docs = positive_user_docs
    self._scores = scores

rank_single_user(user_idx, train_ratings, available_loaded_items, recs_number, filter_list)

Rank the top-n recommended items for the active user, where the top-n items to rank are controlled by the recs_number and filter_list parameter:

  • the former one is self-explanatory, the second is a list of items represented with their string ids. Must be necessarily strings and not their mapped integer since items are serialized following their string representation!

If recs_number is None, all ranked items will be returned

The filter list parameter is usually the result of the filter_single() method of a Methodology object

PARAMETER DESCRIPTION
user_idx

Mapped integer of the active user

TYPE: int

train_ratings

Ratings object which contains the train set of each user

TYPE: Ratings

available_loaded_items

The LoadedContents interface which contains loaded contents

TYPE: LoadedContentsIndex

recs_number

number of the top ranked items to return, if None all ranked items will be returned

TYPE: Optional[int]

filter_list

list of the items to rank. Should contain string item ids

TYPE: List[str]

RETURNS DESCRIPTION
np.ndarray

uir matrix for a single user containing user and item idxs (integer representation) with the ranked score as third dimension sorted in a decreasing order

Source code in clayrs/recsys/content_based_algorithm/index_query/index_query.py
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
def rank_single_user(self, user_idx: int, train_ratings: Ratings, available_loaded_items: LoadedContentsIndex,
                     recs_number: Optional[int], filter_list: List[str]) -> np.ndarray:
    """
    Rank the top-n recommended items for the active user, where the top-n items to rank are controlled by the
    `recs_number` and `filter_list` parameter:

    * the former one is self-explanatory, the second is a list of items
    represented with their string ids. Must be necessarily strings and not their mapped integer since items are
    serialized following their string representation!

    If `recs_number` is `None`, all ranked items will be returned

    The filter list parameter is usually the result of the `filter_single()` method of a `Methodology` object

    Args:
        user_idx: Mapped integer of the active user
        train_ratings: `Ratings` object which contains the train set of each user
        available_loaded_items: The LoadedContents interface which contains loaded contents
        recs_number: number of the top ranked items to return, if None all ranked items will be returned
        filter_list: list of the items to rank. Should contain string item ids

    Returns:
        uir matrix for a single user containing user and item idxs (integer representation) with the ranked score
            as third dimension sorted in a decreasing order
    """
    uir_user = train_ratings.get_user_interactions(user_idx)
    if len(uir_user) == 0:
        raise EmptyUserRatings("The user selected doesn't have any ratings!")

    user_seen_items = train_ratings.item_map.convert_seq_int2str(uir_user[:, 1].astype(int))
    mask_list = self._build_mask_list(user_seen_items, filter_list)

    ix = available_loaded_items.get_contents_interface()
    score_docs = ix.query(self._string_query, recs_number, mask_list, filter_list, self._classic_similarity)

    # we must convert keys (which are strings) to the respective int idx to build the uir
    score_list_idxs = train_ratings.item_map.convert_seq_str2int(list(score_docs.keys()))

    # we construct the output data
    uir_rank = np.array([[user_idx, item_idx, score_docs[item_id]['score']]
                         for item_idx, item_id in zip(score_list_idxs, score_docs)])

    return uir_rank