Skip to content

Fairness metrics

Fairness metrics evaluate how unbiased the recommendation lists are (e.g. unbiased towards popularity of the items)

CatalogCoverage(catalog, top_n=None, k=None)

Bases: PredictionCoverage

The Catalog Coverage metric measures in percentage how many distinct items are being recommended in relation to all available items. It's a system wide metric, so only its result it will be returned and not those of every user. It differs from the Prediction Coverage since it allows for different parameters to come into play. If no parameter is passed then it's a simple Prediction Coverage. The metric is calculated as such:

\[ Catalog Coverage_{sys} = (\frac{|\bigcup_{j=1...N}reclist(u_j)|}{|I|})\cdot100 \]

Where:

  • \(N\) is the total number of users
  • \(reclist(u_j)\) is the set of items contained in the recommendation list of user \(j\)
  • \(I\) is the set of all available items

The \(I\) must be specified through the 'catalog' parameter

The recommendation list of every user (\(reclist(u_j)\)) can be reduced to the first n parameter with the top-n parameter, so that catalog coverage is measured considering only the most highest ranked items.

With the 'k' parameter one could specify the number of users that will be used to calculate catalog coverage: k users will be randomly sampled and their recommendation lists will be used. The formula above becomes:

\[ Catalog Coverage_{sys} = (\frac{|\bigcup_{j=1...k}reclist(u_j)|}{|I|})\cdot100 \]

Where:

  • \(k\) is the parameter specified

Obviously 'k' < N, else simply recommendation lists of all users will be used

Check the 'Beyond Accuracy: Evaluating Recommender Systems by Coverage and Serendipity' paper and page 13 of the 'Comparison of group recommendation algorithms' paper for more

PARAMETER DESCRIPTION
catalog

set of item id of the catalog on which the prediction coverage must be computed

TYPE: Set[str]

top_n

it's a cutoff parameter, if specified the Catalog Coverage will be calculated considering only the first 'n' items of every recommendation list of all users. Default is None

TYPE: int DEFAULT: None

k

number of users randomly sampled. If specified, k users will be randomly sampled across all users and only their recommendation lists will be used to compute the CatalogCoverage

TYPE: int DEFAULT: None

Source code in clayrs/evaluation/metrics/fairness_metrics.py
360
361
362
363
def __init__(self, catalog: Set[str], top_n: int = None, k: int = None):
    super().__init__(catalog)
    self.__top_n = top_n
    self.__k = k

DeltaGap(user_groups, user_profiles, original_ratings, top_n=None, pop_percentage=0.2)

Bases: GroupFairnessMetric

The Delta GAP (Group Average popularity) metric lets you compare the average popularity "requested" by one or multiple groups of users and the average popularity "obtained" with the recommendation given by the recsys. It's a system wise metric and results of every group will be returned.

It is calculated as such:

\[ \Delta GAP = \frac{recs_GAP - profile_GAP}{profile_GAP} \]

Users are split into groups based on the user_groups parameter, which contains names of the groups as keys, and percentage of how many user must contain a group as values. For example:

user_groups = {'popular_users': 0.3, 'medium_popular_users': 0.2, 'low_popular_users': 0.5}

Every user will be inserted in a group based on how many popular items the user has rated (in relation to the percentage of users we specified as value in the dictionary):

  • users with many popular items will be inserted into the first group
  • users with niche items rated will be inserted into one of the last groups.

In general users are grouped by \(Popularity\_ratio\) in a descending order. \(Popularity\_ratio\) for a single user \(u\) is defined as:

\[ Popularity\_ratio_u = n\_most\_popular\_items\_rated_u / n\_items\_rated_u \]

The most popular items are the first pop_percentage% items of all items ordered in a descending order by popularity.

The popularity of an item is defined as the number of times it is rated in the original_ratings parameter divided by the total number of users in the original_ratings.

It can happen that for a particular user of a group no recommendation are available: in that case it will be skipped and it won't be considered in the \(\Delta GAP\) computation of its group. In case no user of a group has recs available, a warning will be printed and the whole group won't be considered.

If the 'top_n' parameter is specified, then the \(\Delta GAP\) will be calculated considering only the first n items of every recommendation list of all users

PARAMETER DESCRIPTION
user_groups

Dict containing group names as keys and percentage of users as value, used to split users in groups. Users with more popular items rated are grouped into the first group, users with slightly less popular items rated are grouped into the second one, etc.

TYPE: Dict[str, float]

user_profiles

one or more Ratings objects containing interactions of the profile of each user (e.g. the train set). It should be one for each split to evaluate!

TYPE: Union[list, Ratings]

original_ratings

Ratings object containing original interactions of the dataset that will be used to compute the popularity of each item (i.e. the number of times it is rated divided by the total number of users)

TYPE: Ratings

top_n

it's a cutoff parameter, if specified the Gini index will be calculated considering only their first 'n' items of every recommendation list of all users. Default is None

TYPE: int DEFAULT: None

pop_percentage

How many (in percentage) most popular items must be considered. Default is 0.2

TYPE: float DEFAULT: 0.2

Source code in clayrs/evaluation/metrics/fairness_metrics.py
464
465
466
467
468
469
470
471
472
473
474
475
476
def __init__(self, user_groups: Dict[str, float], user_profiles: Union[list, Ratings], original_ratings: Ratings,
             top_n: int = None, pop_percentage: float = 0.2):
    if not 0 < pop_percentage <= 1:
        raise ValueError('Incorrect percentage! Valid percentage range: 0 < percentage <= 1')

    super().__init__(user_groups)
    self._pop_by_item = get_item_popularity(original_ratings)

    if not isinstance(user_profiles, list):
        user_profiles = [user_profiles]
    self._user_profiles = user_profiles
    self.__top_n = top_n
    self._pop_percentage = pop_percentage

calculate_delta_gap(recs_gap, profile_gap) staticmethod

Compute the ratio between the recommendation gap and the user profiles gap

PARAMETER DESCRIPTION
recs_gap

recommendation gap

TYPE: float

profile_gap

user profiles gap

TYPE: float

RETURNS DESCRIPTION
score

delta gap measure

TYPE: float

Source code in clayrs/evaluation/metrics/fairness_metrics.py
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
@staticmethod
def calculate_delta_gap(recs_gap: float, profile_gap: float) -> float:
    """
    Compute the ratio between the recommendation gap and the user profiles gap

    Args:
        recs_gap: recommendation gap
        profile_gap: user profiles gap

    Returns:
        score: delta gap measure
    """
    result = 0
    if profile_gap != 0.0:
        result = (recs_gap - profile_gap) / profile_gap

    return result

calculate_gap(group, avg_pop_by_users) staticmethod

Compute the GAP (Group Average Popularity) formula

\[ GAP = \frac{\sum_{u \in U}\cdot \frac{\sum_{i \in i_u} pop_i}{|i_u|}}{|G|} \]

Where:

  • \(G\) is the set of users
  • \(i_u\) is the set of items rated/recommended by/to user \(u\)
  • \(pop_i\) is the popularity of item i
PARAMETER DESCRIPTION
group

the set of users (user_id)

TYPE: Set[str]

avg_pop_by_users

average popularity by user

TYPE: Dict[str, object]

RETURNS DESCRIPTION
score

gap score

TYPE: float

Source code in clayrs/evaluation/metrics/fairness_metrics.py
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
@staticmethod
def calculate_gap(group: Set[str], avg_pop_by_users: Dict[str, object]) -> float:
    r"""
    Compute the GAP (Group Average Popularity) formula

    $$
    GAP = \frac{\sum_{u \in U}\cdot \frac{\sum_{i \in i_u} pop_i}{|i_u|}}{|G|}
    $$

    Where:

    - $G$ is the set of users
    - $i_u$ is the set of items rated/recommended by/to user $u$
    - $pop_i$ is the popularity of item i

    Args:
        group: the set of users (user_id)
        avg_pop_by_users: average popularity by user

    Returns:
        score (float): gap score
    """
    total_pop = 0
    for user in group:
        if avg_pop_by_users.get(user):
            total_pop += avg_pop_by_users[user]
    return total_pop / len(group)

GiniIndex(top_n=None)

Bases: FairnessMetric

The Gini Index metric measures inequality in recommendation lists. It's a system wide metric, so only its result it will be returned and not those of every user. The metric is calculated as such:

\[ Gini_{sys} = \frac{\sum_i(2i - n - 1)x_i}{n\cdot\sum_i x_i} \]

Where:

  • \(n\) is the total number of distinct items that are being recommended
  • \(x_i\) is the number of times that the item \(i\) has been recommended

A perfectly equal recommender system should recommend every item the same number of times, in which case the Gini index would be equal to 0. The more the recsys is "disegual", the more the Gini Index is closer to 1

If the 'top_n' parameter is specified, then the Gini index will measure inequality considering only the first n items of every recommendation list of all users

PARAMETER DESCRIPTION
top_n

it's a cutoff parameter, if specified the Gini index will be calculated considering only the first 'n' items of every recommendation list of all users. Default is None

TYPE: int DEFAULT: None

Source code in clayrs/evaluation/metrics/fairness_metrics.py
188
189
def __init__(self, top_n: int = None):
    self.__top_n = top_n

GroupFairnessMetric(user_groups)

Bases: FairnessMetric

Abstract class for fairness metrics based on user groups

It has some concrete methods useful for group divisions, since every subclass needs to split users into groups.

PARAMETER DESCRIPTION
user_groups

Dict containing group names as keys and percentage of users as value, used to split users in groups. Users with more popular items rated are grouped into the first group, users with slightly less popular items rated are grouped into the second one, etc.

TYPE: Dict[str, float]

Source code in clayrs/evaluation/metrics/fairness_metrics.py
43
44
def __init__(self, user_groups: Dict[str, float]):
    self.__user_groups = user_groups

get_avg_pop_by_users(data, pop_by_items, group=None) staticmethod

Get the average popularity for each user in the data parameter.

Average popularity of a single user \(u\) is defined as:

\[ avg\_pop_u = \frac{\sum_{i \in i_u} pop_i}{|i_u|} \]
PARAMETER DESCRIPTION
data

The Ratings object that will be used to compute average popularity of each user

TYPE: Ratings

pop_by_items

popularity for each label ('label', 'popularity')

TYPE: Dict

group

(optional) the set of users (user_id)

TYPE: Set[str] DEFAULT: None

RETURNS DESCRIPTION
Dict[str, float]

Python dictionary containing as keys each user id and as values the average popularity of each user

Source code in clayrs/evaluation/metrics/fairness_metrics.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
@staticmethod
def get_avg_pop_by_users(data: Ratings, pop_by_items: Dict, group: Set[str] = None) -> Dict[str, float]:
    r"""
    Get the average popularity for each user in the `data` parameter.

    Average popularity of a single user $u$ is defined as:

    $$
    avg\_pop_u = \frac{\sum_{i \in i_u} pop_i}{|i_u|}
    $$

    Args:
        data: The `Ratings` object that will be used to compute average popularity of each user
        pop_by_items: popularity for each label ('label', 'popularity')
        group: (optional) the set of users (user_id)

    Returns:
        Python dictionary containing as keys each user id and as values the average popularity of each user
    """
    if group is None:
        group = data.unique_user_id_column
        group_int = data.unique_user_idx_column
    else:
        group_int = data.user_map.convert_seq_str2int(list(group))

    avg_pop_by_users = []

    for user_idx in group_int:
        user_interactions_rows = data.get_user_interactions(user_idx, as_indices=True)
        user_items = data.item_id_column[user_interactions_rows]

        avg_pop_by_users.append(get_avg_pop(user_items, pop_by_items))

    avg_pop_by_users = dict(zip(group, avg_pop_by_users))

    return avg_pop_by_users

split_user_in_groups(score_frame, groups, pop_items) staticmethod

Users are split into groups based on the groups parameter, which contains names of the groups as keys, and percentage of how many user must contain a group as values. For example:

groups = {'popular_users': 0.3, 'medium_popular_users': 0.2, 'low_popular_users': 0.5}

Every user will be inserted in a group based on how many popular items the user has rated (in relation to the percentage of users we specified as value in the dictionary):

  • users with many popular items will be inserted into the first group
  • users with niche items rated will be inserted into one of the last groups.

In general users are grouped by \(Popularity\_ratio\) in a descending order. \(Popularity\_ratio\) for a single user \(u\) is defined as:

\[ Popularity\_ratio_u = n\_most\_popular\_items\_rated_u / n\_items\_rated_u \]

The most popular items are the first pop_percentage% items of all items ordered in a descending order by popularity.

The popularity of an item is defined as the number of times it is rated in the original_ratings parameter divided by the total number of users in the original_ratings.

PARAMETER DESCRIPTION
score_frame

the Ratings object

TYPE: Ratings

groups

each key contains the name of the group and each value contains the percentage of the specified group. If the groups don't cover the entire user collection, the rest of the users are considered in a 'default_diverse' group

TYPE: Dict[str, float]

pop_items

set of most popular item_id labels

TYPE: Set[str]

RETURNS DESCRIPTION
Dict[str, Set[str]]

A python dictionary containing as keys each group name and as values the set of user_id belonging to the particular group.

Source code in clayrs/evaluation/metrics/fairness_metrics.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
@staticmethod
def split_user_in_groups(score_frame: Ratings, groups: Dict[str, float],
                         pop_items: Set[str]) -> Dict[str, Set[str]]:
    r"""
    Users are split into groups based on the *groups* parameter, which contains names of the groups as keys,
    and percentage of how many user must contain a group as values. For example:

        groups = {'popular_users': 0.3, 'medium_popular_users': 0.2, 'low_popular_users': 0.5}

    Every user will be inserted in a group based on how many popular items the user has rated (in relation to the
    percentage of users we specified as value in the dictionary):

    * users with many popular items will be inserted into the first group
    * users with niche items rated will be inserted into one of the last groups.

    In general users are grouped by $Popularity\_ratio$ in a descending order. $Popularity\_ratio$ for a
    single user $u$ is defined as:

    $$
    Popularity\_ratio_u = n\_most\_popular\_items\_rated_u / n\_items\_rated_u
    $$

    The *most popular items* are the first `pop_percentage`% items of all items ordered in a descending order by
    popularity.

    The popularity of an item is defined as the number of times it is rated in the `original_ratings` parameter
    divided by the total number of users in the `original_ratings`.

    Args:
        score_frame: the Ratings object
        groups: each key contains the name of the group and each value contains the
            percentage of the specified group. If the groups don't cover the entire user collection,
            the rest of the users are considered in a 'default_diverse' group
        pop_items: set of most popular *item_id* labels

    Returns:
        A python dictionary containing as keys each group name and as values the set of *user_id* belonging to
            the particular group.
    """
    num_of_users = len(score_frame.unique_user_id_column)
    if num_of_users < len(groups):
        raise NotEnoughUsers("You can't split in {} groups {} users! "
                             "Try reducing number of groups".format(len(groups), num_of_users))

    for percentage_chosen in groups.values():
        if not 0 < percentage_chosen <= 1:
            raise ValueError('Incorrect percentage! Valid percentage range: 0 < percentage <= 1')
    total = sum(groups.values())
    if total > 1:
        raise ValueError("Incorrect percentage! Sum of percentage is > than 1")
    elif total < 1:
        raise ValueError("Sum of percentage is < than 1! Please add another group or redistribute percentages "
                         "among already defined group to reach a total of 1!")

    pop_ratio_by_users = pop_ratio_by_user(score_frame, most_pop_items=pop_items)
    pop_ratio_by_users = sorted(pop_ratio_by_users, key=pop_ratio_by_users.get, reverse=True)

    groups_dict: Dict[str, Set[str]] = {}
    last_index = 0
    percentage = 0.0
    for group_name in groups:
        percentage += groups[group_name]
        group_index = round(num_of_users * percentage)
        if group_index == 0:
            logger.warning('Not enough rows for group {}! It will be discarded'.format(group_name))
        else:
            groups_dict[group_name] = set(pop_ratio_by_users[last_index:group_index])
            last_index = group_index
    return groups_dict

PredictionCoverage(catalog)

Bases: FairnessMetric

The Prediction Coverage metric measures in percentage how many distinct items are being recommended in relation to all available items. It's a system wise metric, so only its result it will be returned and not those of every user. The metric is calculated as such:

\[ Prediction Coverage_{sys} = (\frac{|I_p|}{|I|})\cdot100 \]

Where:

  • \(I\) is the set of all available items
  • \(I_p\) is the set of recommended items

The \(I\) must be specified through the 'catalog' parameter

Check the 'Beyond Accuracy: Evaluating Recommender Systems by Coverage and Serendipity' paper for more

PARAMETER DESCRIPTION
catalog

set of item id of the catalog on which the prediction coverage must be computed

TYPE: Set[str]

Source code in clayrs/evaluation/metrics/fairness_metrics.py
270
271
def __init__(self, catalog: Set[str]):
    self.__catalog = set(str(item_id) for item_id in catalog)