Skip to content

Abstract Partitioning class

Partitioning(skip_user_error=True)

Bases: ABC

Abstract class for partitioning technique. Each class must implement the split_single() method which specify how data for a single user will be split

PARAMETER DESCRIPTION
skip_user_error

If set to True, users for which data can't be split will be skipped and only a warning will be logged at the end of the split process specifying n° of users skipped. Otherwise, a ValueError exception is raised

TYPE: bool DEFAULT: True

Source code in clayrs/recsys/partitioning.py
28
29
def __init__(self, skip_user_error: bool = True):
    self.__skip_user_error = skip_user_error

split_all(ratings_to_split, user_list=None)

Concrete method that splits, for every user in the user column of ratings_to_split, the original ratings into train set and test set. If a user_list parameter is set, the method will do the splitting only for the users specified inside the list (Users can be specified as strings or with their mapped integer).

The method returns two lists:

  • The first contains all train set for each split (if the partitioning technique returns more than one split e.g. KFold)
  • The second contains all test set for each split (if the partitioning technique returns more than one split e.g. KFold)

Obviously the two lists will have the same length, and to the train set in position \(i\) corresponds the truth set at position \(i\)

PARAMETER DESCRIPTION
ratings_to_split

Ratings object which contains the interactions of the users that must be split into train set and test set

TYPE: Ratings

user_list

The Set of users for which splitting will be done. If set, splitting will be performed only for users inside the list. Otherwise, splitting will be performed for all users in ratings_to_split parameter. User can be specified with their string id or with their mapped integer

TYPE: Union[Set[int], Set[str]] DEFAULT: None

RAISES DESCRIPTION
ValueError

if skip_user_error=True in the constructor and for at least one user splitting can't be performed

Source code in clayrs/recsys/partitioning.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
def split_all(self, ratings_to_split: Ratings,
              user_list: Union[Set[int], Set[str]] = None) -> Tuple[List[Ratings], List[Ratings]]:
    """
    Concrete method that splits, for every user in the user column of `ratings_to_split`, the original ratings
    into *train set* and *test set*.
    If a `user_list` parameter is set, the method will do the splitting only for the users
    specified inside the list (Users can be specified as *strings* or with their mapped *integer*).

    The method returns two lists:

    * The first contains all train set for each split (if the partitioning technique returns more than one split
    e.g. KFold)
    * The second contains all test set for each split (if the partitioning technique returns more than one split
    e.g. KFold)

    Obviously the two lists will have the same length, and to the *train set* in position $i$ corresponds the
    *truth set* at position $i$

    Args:
        ratings_to_split: `Ratings` object which contains the interactions of the users that must be split
            into *train set* and *test set*
        user_list: The Set of users for which splitting will be done. If set, splitting will be performed only
            for users inside the list. Otherwise, splitting will be performed for all users in `ratings_to_split`
            parameter. User can be specified with their string id or with their mapped integer

    Raises:
        ValueError: if `skip_user_error=True` in the constructor and for at least one user splitting
            can't be performed
    """

    # convert user list to list of int if necessary (strings are passed)
    if user_list is not None:
        all_users = np.array(list(user_list))
        if np.issubdtype(all_users.dtype, str):
            all_users = ratings_to_split.user_map.convert_seq_str2int(all_users)

        all_users = set(all_users)
    else:
        all_users = set(ratings_to_split.unique_user_idx_column)

    # {
    #   0: {'train': [u1_uir, u2_uir]},
    #       'test': [u1_uir, u2_uir]},
    #
    #   1: {'train': [u1_uir, u2_uir]},
    #       'test': [u1_uir, u2_uir]
    #  }
    train_test_dict = defaultdict(lambda: defaultdict(list))
    error_count = 0

    with get_progbar(all_users) as pbar:

        pbar.set_description("Performing {}".format(str(self)))
        for user_idx in pbar:
            user_ratings = ratings_to_split.get_user_interactions(user_idx)
            try:
                user_train_list, user_test_list = self.split_single(user_ratings)
                for split_number, (single_train, single_test) in enumerate(zip(user_train_list, user_test_list)):

                    train_test_dict[split_number]['train'].append(single_train)
                    train_test_dict[split_number]['test'].append(single_test)

            except ValueError as e:
                if self.skip_user_error:
                    error_count += 1
                    continue
                else:
                    raise e from None

    if error_count > 0:
        logger.warning(f"{error_count} users will be skipped because partitioning couldn't be performed\n"
                       f"Change this behavior by setting `skip_user_error` to True")

    train_list = [Ratings.from_uir(np.vstack(train_test_dict[split]['train']),
                                   ratings_to_split.user_map, ratings_to_split.item_map)
                  for split in train_test_dict]

    test_list = [Ratings.from_uir(np.vstack(train_test_dict[split]['test']),
                                  ratings_to_split.user_map, ratings_to_split.item_map)
                 for split in train_test_dict]

    return train_list, test_list

split_single(uir_user) abstractmethod

Abstract method in which each partitioning technique must specify how to split data for a single user

PARAMETER DESCRIPTION
uir_user

uir matrix containing interactions of a single user

TYPE: np.ndarray

RETURNS DESCRIPTION
List[np.ndarray]

The first list contains a uir matrix for each split constituting the train set of the user

List[np.ndarray]

The second list contains a uir matrix for each split constituting the test set of the user

Source code in clayrs/recsys/partitioning.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
@abc.abstractmethod
def split_single(self, uir_user: np.ndarray) -> Tuple[List[np.ndarray], List[np.ndarray]]:
    """
    Abstract method in which each partitioning technique must specify how to split data for a single user

    Args:
        uir_user: uir matrix containing interactions of a single user

    Returns:
        The first list contains a uir matrix for each split constituting the *train set* of the user

        The second list contains a uir matrix for each split constituting the *test set* of the user
    """
    raise NotImplementedError