Skip to content

Feature Selection

Via the feature_selecter function you are able to perform feature selection on a given graph, by keeping properties that are the most important according to a given feature selection algorithm. Check the documentation of the method for more and for a usage example

feature_selector(graph, fs_algorithm_user=None, fs_algorithm_item=None, user_target_nodes=None, item_target_nodes=None, inplace=False)

Given a FullGraph, this method performs feature selection on it and returns the "reduced" graph.

You can choose to reduce only user properties (evaluate the fs_algorithm_user parameter), to reduce only item properties (evaluate the fs_algorithm_item parameter) or both (evaluate the fs_algorithm_user parameter and the fs_algorithm_item parameter). You can also choose different feature selection algorithms* for users and items.

You can also define a custom list of user and item nodes:

  • In this case only properties of those nodes will be considered during the feature selection process (instead of using properties of all users and items)

This function changes a copy of the original graph by default, but you can change this behaviour with the inplace parameter.

Examples:

# create a full graph
full_graph = rs.NXFullGraph(ratings,
                             user_contents_dir='users_codified/', # (1)
                             item_contents_dir='movies_codified/', # (2)
                             user_exo_properties={0}, # (3)
                             item_exo_properties={'dbpedia'}, # (4)
                             link_label='score')

 # perform feature selection by keeping only top 5 property labels
 # according to page rank algorithm
 fs_graph = rs.feature_selector(full_graph,
                                fs_algorithm_item=rs.TopKPageRank(k=5))
PARAMETER DESCRIPTION
graph

Original graph on which feature selection will be performed

TYPE: FullDiGraph

fs_algorithm_user

FeatureSelectionAlgorithm that will be performed on user properties. Can be different from fs_algorithm_item

TYPE: FeatureSelectionAlgorithm DEFAULT: None

fs_algorithm_item

FeatureSelectionAlgorithm that will be performed on item properties. Can be different from fs_algorithm_user

TYPE: FeatureSelectionAlgorithm DEFAULT: None

user_target_nodes

List of user nodes to consider in the feature selection process: only properties of user nodes in this list will be "reduced"

TYPE: list DEFAULT: None

item_target_nodes

List of item nodes to consider in the feature selection process: only properties of item nodes in this list will be "reduced"

TYPE: list DEFAULT: None

inplace

Boolean parameter that let you choose if changes must be performed on the original graph (inplace=True) or on its copy (inplace=False). Default is False

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
FullDiGraph

Copy of the original graph from which the less important Property nodes (the ones having edges with less important property labels) will be removed

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_fn.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def feature_selector(graph: FullDiGraph,
                     fs_algorithm_user: FeatureSelectionAlgorithm = None,
                     fs_algorithm_item: FeatureSelectionAlgorithm = None,
                     user_target_nodes: Iterable[UserNode] = None,
                     item_target_nodes: Iterable[ItemNode] = None,
                     inplace: bool = False) -> FullDiGraph:
    """
    Given a FullGraph, this method performs feature selection on it and returns the "reduced" graph.

    You can choose to reduce only *user properties* (*evaluate the `fs_algorithm_user` parameter*),
    to reduce only *item properties* (*evaluate the `fs_algorithm_item` parameter*) or both (*evaluate
    the `fs_algorithm_user` parameter* and the `fs_algorithm_item` parameter*).
    You can also choose different *feature selection algorithms* for users and items.

    You can also define a custom list of user and item nodes:

    * In this case only properties of those nodes will be considered during the feature selection process (instead of
    using properties of all users and items)

    This function changes a *copy* of the original graph by default, but you can change this behaviour with the
    `inplace` parameter.

    Examples:

        ```python
        # create a full graph
        full_graph = rs.NXFullGraph(ratings,
                                     user_contents_dir='users_codified/', # (1)
                                     item_contents_dir='movies_codified/', # (2)
                                     user_exo_properties={0}, # (3)
                                     item_exo_properties={'dbpedia'}, # (4)
                                     link_label='score')

         # perform feature selection by keeping only top 5 property labels
         # according to page rank algorithm
         fs_graph = rs.feature_selector(full_graph,
                                        fs_algorithm_item=rs.TopKPageRank(k=5))
        ```

    Args:
        graph: Original graph on which feature selection will be performed
        fs_algorithm_user: FeatureSelectionAlgorithm that will be performed on user properties. Can be different from
            `fs_algorithm_item`
        fs_algorithm_item: FeatureSelectionAlgorithm that will be performed on item properties. Can be different from
            `fs_algorithm_user`
        user_target_nodes (list): List of user nodes to consider in the feature selection process: only properties
            of user nodes in this list will be "reduced"
        item_target_nodes (list): List of item nodes to consider in the feature selection process: only properties
            of item nodes in this list will be "reduced"
        inplace: Boolean parameter that let you choose if changes must be performed on the original graph
            (`inplace=True`) or on its copy (`inplace=False`). Default is False

    Returns:
        Copy of the original graph from which the less important Property nodes (the ones having edges with less
            important property labels) will be removed
    """
    if fs_algorithm_user is not None and user_target_nodes is None:
        user_target_nodes = graph.user_nodes

    if fs_algorithm_item is not None and item_target_nodes is None:
        item_target_nodes = graph.item_nodes

    property_labels_to_remove = list()
    user_fs_failed = False
    item_fs_failed = False

    if fs_algorithm_user is not None:
        logger.info("Performing Feature Selection on users")
        try:
            user_props_to_remove = fs_algorithm_user.perform(graph, list(user_target_nodes), mode='to_remove')
            property_labels_to_remove.extend(user_props_to_remove)
        except FeatureSelectionException as e:
            logger.warning(str(e) + "!\nUsers original properties will be kept")
            user_fs_failed = True

    if fs_algorithm_item is not None:
        logger.info("Performing Feature Selection on items")
        try:
            item_props_to_remove = fs_algorithm_item.perform(graph, list(item_target_nodes), mode='to_remove')
            property_labels_to_remove.extend(item_props_to_remove)
        except FeatureSelectionException as e:
            logger.warning(str(e) + "!\nItems original properties will be kept")
            item_fs_failed = True

    # in case user feature selection or item feature selection failed
    # if both failed the original graph is returned
    # if only one of them failed, the original properties (either for items or users) are retrieved
    if user_fs_failed and item_fs_failed:
        logger.warning("Since both feature selection on items and feature selection on users failed or no fs algorithm"
                       "has been defined,\nthe original graph will be returned")

    if inplace is True:
        graph_fs = _delete_property_nodes(graph, property_labels_to_remove)
    else:
        graph_copy = graph.copy()
        graph_fs = _delete_property_nodes(graph_copy, property_labels_to_remove)

    return graph_fs

Feature Selection algorithms

The following are the feature selection algorithms you can use in the fs_algorithms_user and/or in the fs_algorithm_item

TopKPageRank(k=10, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight=True, dangling=None)

Bases: TopKFeatureSelection

Computes the PageRank as FeatureSelection algorithm. Property labels of the original graph will be scored with their page rank score and only the top-k labels will be kept in the feature selected graph, while discarding the others

PARAMETER DESCRIPTION
k

Top-k property labels to keep in the feature selected graph

TYPE: int DEFAULT: 10

alpha

Damping parameter for PageRank, default=0.85.

TYPE: Any DEFAULT: 0.85

personalization

The "personalization vector" consisting of a dictionary with a key some subset of graph nodes and personalization value each of those. At least one personalization value must be non-zero. If not specfiied, a nodes personalization value will be zero. By default, a uniform distribution is used.

TYPE: Any DEFAULT: None

max_iter

Maximum number of iterations in power method eigenvalue solver.

TYPE: Any DEFAULT: 100

tol

Error tolerance used to check convergence in power method solver.

TYPE: Any DEFAULT: 1e-06

nstart

Starting value of PageRank iteration for each node.

TYPE: Any DEFAULT: None

weight

Edge data key to use as weight. If None weights are set to 1.

TYPE: bool DEFAULT: True

dangling

The outedges to be assigned to any "dangling" nodes, i.e., nodes without any outedges. The dict key is the node the outedge points to and the dict value is the weight of that outedge. By default, dangling nodes are given outedges according to the personalization vector (uniform if not specified). This must be selected to result in an irreducible transition matrix (see notes under google_matrix). It may be common to have the dangling dict to be the same as the personalization dict.

TYPE: Any DEFAULT: None

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_alg.py
204
205
206
207
208
209
210
211
212
213
214
def __init__(self, k: int = 10, alpha: Any = 0.85, personalization: Any = None, max_iter: Any = 100,
             tol: Any = 1.0e-6, nstart: Any = None, weight: bool = True, dangling: Any = None):
    super().__init__(k)

    self.alpha = alpha
    self.personalization = personalization
    self.max_iter = max_iter
    self.tol = tol
    self.nstart = nstart
    self.weight = 'weight' if weight is True else None
    self.dangling = dangling

TopKEigenVectorCentrality(k=10, max_iter=100, tol=1e-06, nstart=None, weight=False)

Bases: TopKFeatureSelection

Computes the Eigen Vector Centrality as FeatureSelection algorithm. Property labels of the original graph will be scored with their eigen vector centrality score and only the top-k labels will be kept in the feature selected graph, while discarding the others

PARAMETER DESCRIPTION
k

Top-k property labels to keep in the feature selected graph

TYPE: int DEFAULT: 10

max_iter

Maximum number of iterations in power method.

TYPE: Any DEFAULT: 100

tol

Error tolerance used to check convergence in power method iteration.

TYPE: Any DEFAULT: 1e-06

nstart

Starting value of eigenvector iteration for each node.

TYPE: Any DEFAULT: None

weight

Boolean value which tells the algorithm if weight of the edges must be considered or not. Default is True

TYPE: bool DEFAULT: False

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_alg.py
237
238
239
240
241
242
243
def __init__(self, k: int = 10, max_iter: Any = 100, tol: Any = 1.0e-6, nstart: Any = None, weight: bool = False):
    super().__init__(k)

    self.max_iter = max_iter
    self.tol = tol
    self.nstart = nstart
    self.weight = 'weight' if weight is True else None

TopKDegreeCentrality(k=10)

Bases: TopKFeatureSelection

Computes the Degree Centrality as FeatureSelection algorithm. Property labels of the original graph will be scored with their degree centrality score and only the top-k labels will be kept in the feature selected graph, while discarding the others

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_alg.py
257
258
def __init__(self, k: int = 10):
    super().__init__(k)