Skip to content

Ranking metrics

Ranking metrics evaluate the quality of the recommendation lists

Correlation(method='pearson', top_n=None)

Bases: RankingMetric

The Correlation metric calculates the correlation between the ranking of a user and its ideal ranking. The currently correlation methods implemented are:

  • pearson
  • kendall
  • spearman

Every correlation method is implemented by the pandas library, so read its documentation for more

The correlation metric is calculated as such for the single user:

\[ Corr_u = Corr(ranking_u, ideal\_ranking_u) \]

Where:

  • \(ranking_u\) is ranking of the user
  • \(ideal\_ranking_u\) is the ideal ranking for the user

The ideal ranking is calculated based on the rating inside the ground truth of the user

The Correlation metric calculated for the entire system is simply the average of every \(Corr\):

\[ Corr_{sys} = \frac{\sum_{u} Corr_u}{|U|} \]

Where:

  • \(Corr_u\) is the correlation of the user \(u\)
  • \(U\) is the set of all users

The system average excludes NaN values.

It's also possible to specify a cutoff parameter thanks to the 'top_n' parameter: if specified, only the first \(n\) results of the recommendation list will be used in order to calculate the correlation

PARAMETER DESCRIPTION
method

The correlation method to use. It must be 'pearson', 'kendall' or 'spearman', otherwise a ValueError exception is raised. By default is 'pearson'

TYPE: str DEFAULT: 'pearson'

top_n

Cutoff parameter, if specified only the first n items of the recommendation list will be used in order to calculate the correlation

TYPE: int DEFAULT: None

RAISES DESCRIPTION
ValueError

if an invalid method parameter is passed

Source code in clayrs/evaluation/metrics/ranking_metrics.py
571
572
573
574
575
576
577
578
579
def __init__(self, method: str = 'pearson', top_n: int = None):
    valid = {'pearson', 'kendall', 'spearman'}
    self.__method = method.lower()

    if self.__method not in valid:
        raise ValueError("Method {} is not supported! Methods available:\n"
                         "{}".format(method, valid))

    self.__top_n = top_n

MAP(relevant_threshold=None)

Bases: RankingMetric

The \(MAP\) metric (Mean average Precision) is a ranking metric computed by first calculating the \(AP\) (Average Precision) for each user and then taking its mean.

The \(AP\) is calculated as such for the single user:

\[ AP_u = \frac{1}{m_u}\sum_{i=1}^{N_u}P(i)\cdot rel(i) \]

Where:

  • \(m_u\) is the number of relevant items for the user \(u\)
  • \(N_u\) is the number of recommended items for the user \(u\)
  • \(P(i)\) is the precision computed at cutoff \(i\)
  • \(rel(i)\) is an indicator variable that says whether the i-th item is relevant (\(rel(i)=1\)) or not (\(rel(i)=0\))

After computing the \(AP\) for each user, we can compute the \(MAP\) for the whole system:

\[ MAP_{sys} = \frac{1}{|U|}\sum_{u}AP_u \]

This metric will return the \(AP\) computed for each user in the dataframe containing users results, and the \(MAP\) computed for the whole system in the dataframe containing system results

PARAMETER DESCRIPTION
relevant_threshold

parameter needed to discern relevant items and non-relevant items for every user. If not specified, the mean rating score of every user will be used

TYPE: float DEFAULT: None

Source code in clayrs/evaluation/metrics/ranking_metrics.py
396
397
def __init__(self, relevant_threshold: float = None):
    self.relevant_threshold = relevant_threshold

MAPAtK(k, relevant_threshold=None)

Bases: MAP

The \(MAP@K\) metric (Mean average Precision At K) is a ranking metric computed by first calculating the \(AP@K\) (Average Precision At K) for each user and then taking its mean.

The \(AP@K\) is calculated as such for the single user:

\[ AP@K_u = \frac{1}{m_u}\sum_{i=1}^{K}P(i)\cdot rel(i) \]

Where:

  • \(m_u\) is the number of relevant items for the user \(u\)
  • \(K\) is the cutoff value
  • \(P(i)\) is the precision computed at cutoff \(i\)
  • \(rel(i)\) is an indicator variable that says whether the i-th item is relevant (\(rel(i)=1\)) or not (\(rel(i)=0\))

After computing the \(AP@K\) for each user, we can compute the \(MAP@K\) for the whole system:

\[ MAP@K_{sys} = \frac{1}{|U|}\sum_{u}AP@K_u \]

This metric will return the \(AP@K\) computed for each user in the dataframe containing users results, and the \(MAP@K\) computed for the whole system in the dataframe containing system results

PARAMETER DESCRIPTION
k

the cutoff parameter. It must be >= 1, otherwise a ValueError exception is raised

TYPE: int

relevant_threshold

parameter needed to discern relevant items and non-relevant items for every user. If not specified, the mean rating score of every user will be used

TYPE: float DEFAULT: None

Source code in clayrs/evaluation/metrics/ranking_metrics.py
503
504
505
def __init__(self, k: int, relevant_threshold: float = None):
    super().__init__(relevant_threshold)
    self.k = k

MRR(relevant_threshold=None)

Bases: RankingMetric

The MRR (Mean Reciprocal Rank) metric is a system wide metric, so only its result it will be returned and not those of every user. MRR is calculated as such:

\[ MRR_{sys} = \frac{1}{|Q|}\cdot\sum_{i=1}^{|Q|}\frac{1}{rank(i)} \]

Where:

  • \(Q\) is the set of recommendation lists
  • \(rank(i)\) is the position of the first relevant item in the i-th recommendation list

The MRR metric needs to discern relevant items from the not relevant ones: in order to do that, one could pass a custom relevant_threshold parameter that will be applied to every user, so that if a rating of an item is >= relevant_threshold, then it's relevant, otherwise it's not. If no relevant_threshold parameter is passed then, for every user, its mean rating score will be used

PARAMETER DESCRIPTION
relevant_threshold

parameter needed to discern relevant items and non-relevant items for every user. If not specified, the mean rating score of every user will be used

TYPE: float DEFAULT: None

Source code in clayrs/evaluation/metrics/ranking_metrics.py
233
234
def __init__(self, relevant_threshold: float = None):
    self.__relevant_threshold = relevant_threshold

calc_reciprocal_rank(user_predictions_items, user_truth_relevant_items)

Method which calculates the RR (Reciprocal Rank) for a single user

PARAMETER DESCRIPTION
user_predictions_items

list of ranked item ids for the user computed by the Recommender

TYPE: np.ndarray

user_truth_relevant_items

list of relevant item ids for the user in its truth set

TYPE: np.ndarray

Source code in clayrs/evaluation/metrics/ranking_metrics.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def calc_reciprocal_rank(self, user_predictions_items: np.ndarray, user_truth_relevant_items: np.ndarray):
    """
    Method which calculates the RR (Reciprocal Rank) for a single user

    Args:
        user_predictions_items: list of ranked item ids for the user computed by the Recommender
        user_truth_relevant_items: list of relevant item ids for the user in its truth set
    """

    common_idxs = npi.indices(user_truth_relevant_items, user_predictions_items, missing=-1)
    non_missing_idxs = np.where(common_idxs != -1)[0]

    reciprocal_rank = 0
    if len(non_missing_idxs) != 0:
        reciprocal_rank = 1 / (non_missing_idxs[0] + 1)  # [0][0] because where returns a tuple

    return reciprocal_rank

MRRAtK(k, relevant_threshold=None)

Bases: MRR

The MRR@K (Mean Reciprocal Rank at K) metric is a system wide metric, so only its result will be returned and not those of every user. MRR@K is calculated as such

\[ MRR@K_{sys} = \frac{1}{|Q|}\cdot\sum_{i=1}^{K}\frac{1}{rank(i)} \]

Where:

  • \(K\) is the cutoff parameter
  • \(Q\) is the set of recommendation lists
  • \(rank(i)\) is the position of the first relevant item in the i-th recommendation list
PARAMETER DESCRIPTION
k

the cutoff parameter. It must be >= 1, otherwise a ValueError exception is raised

TYPE: int

relevant_threshold

parameter needed to discern relevant items and non-relevant items for every user. If not specified, the mean rating score of every user will be used

TYPE: float DEFAULT: None

RAISES DESCRIPTION
ValueError

if an invalid cutoff parameter is passed (0 or negative)

Source code in clayrs/evaluation/metrics/ranking_metrics.py
335
336
337
338
339
def __init__(self, k: int, relevant_threshold: float = None):
    if k < 1:
        raise ValueError('k={} not valid! k must be >= 1!'.format(k))
    self.__k = k
    super().__init__(relevant_threshold)

calc_reciprocal_rank(user_predictions_items, user_truth_relevant_items)

Method which calculates the RR (Reciprocal Rank) for a single user

PARAMETER DESCRIPTION
user_predictions_items

list of ranked item ids for the user computed by the Recommender

TYPE: np.ndarray

user_truth_relevant_items

list of relevant item ids for the user in its truth set

TYPE: np.ndarray

Source code in clayrs/evaluation/metrics/ranking_metrics.py
351
352
353
354
355
356
357
358
359
360
361
def calc_reciprocal_rank(self, user_predictions_items: np.ndarray, user_truth_relevant_items: np.ndarray):
    """
    Method which calculates the RR (Reciprocal Rank) for a single user

    Args:
        user_predictions_items: list of ranked item ids for the user computed by the Recommender
        user_truth_relevant_items: list of relevant item ids for the user in its truth set
    """
    user_predictions_cut = user_predictions_items[:self.k]

    return super().calc_reciprocal_rank(user_predictions_cut, user_truth_relevant_items)

NDCG(gains='linear', discount_log=np.log2)

Bases: RankingMetric

The NDCG (Normalized Discounted Cumulative Gain) metric is calculated for the single user by first computing the DCG score using the following formula:

\[ DCG_{u}(scores_{u}) = \sum_{r\in scores_{u}}{\frac{f(r)}{log_x(2 + i)}} \]

Where:

  • \(scores_{u}\) are the ground truth scores for predicted items, ordered according to the order of said items in the ranking for the user \(u\)
  • \(f\) is a gain function (linear or exponential, in particular)
  • \(x\) is the base of the logarithm
  • \(i\) is the index of the truth score \(r\) in the list of scores \(scores_{u}\)

If \(f\) is "linear", then the truth score \(r\) is returned as is. Otherwise, in the "exponential" case, the following formula is applied to \(r\):

\[ f(r) = 2^{r} - 1 \]

The NDCG for a single user is then calculated using the following formula:

\[ NDCG_u(scores_{u}) = \frac{DCG_{u}(scores_{u})}{IDCG_{u}(scores_{u})} \]

Where:

  • \(IDCG_{u}\) is the DCG of the ideal ranking for the truth scores

So the basic idea is to compare the actual ranking with the ideal one

Finally, the NDCG of the entire system is calculated instead as such:

\[ NDCG_{sys} = \frac{\sum_{u} NDCG_u}{|U|} \]

Where:

  • \(NDCG_u\) is the NDCG calculated for user :math:u
  • \(U\) is the set of all users

The system average excludes NaN values.

PARAMETER DESCRIPTION
gains

type of gain function to use when calculating the DCG score, the possible options are "linear" or "exponential"

TYPE: str DEFAULT: 'linear'

discount_log

logarithm function to use when calculating the DCG score, by default numpy logarithm in base 2 is used

TYPE: Callable DEFAULT: np.log2

Source code in clayrs/evaluation/metrics/ranking_metrics.py
77
78
79
80
81
82
83
84
85
86
def __init__(self, gains: str = "linear", discount_log: Callable = np.log2):
    self.gains = gains
    self.discount_log = discount_log

    if self.gains == "exponential":
        self.gains_fn = lambda r: 2 ** r - 1
    elif self.gains == "linear":
        self.gains_fn = lambda r: r
    else:
        raise ValueError("Invalid gains option.")

NDCGAtK(k, gains='linear', discount_log=np.log2)

Bases: NDCG

The NDCG@K (Normalized Discounted Cumulative Gain at K) metric is calculated for the single user by using the framework implementation of the NDCG but considering \(scores_{u}\) cut at the first \(k\) predictions

PARAMETER DESCRIPTION
k

the cutoff parameter

TYPE: int

gains

type of gain function to use when calculating the DCG score, the possible options are "linear" or "exponential"

TYPE: str DEFAULT: 'linear'

discount_log

logarithm function to use when calculating the DCG score, by default numpy logarithm in base 2 is used

TYPE: Callable DEFAULT: np.log2

Source code in clayrs/evaluation/metrics/ranking_metrics.py
179
180
181
182
def __init__(self, k: int, gains: str = "linear", discount_log: Callable = np.log2):
    super().__init__(gains, discount_log)

    self._k = k