Feature Selection

Via the feature_selecter function you are able to perform feature selection on a given graph, by keeping properties that are the most important according to a given feature selection algorithm. Check the documentation of the method for more and for a usage example

`feature_selector(graph, fs_algorithm_user=None, fs_algorithm_item=None, user_target_nodes=None, item_target_nodes=None, inplace=False)`

Given a FullGraph, this method performs feature selection on it and returns the "reduced" graph.

You can choose to reduce only user properties (evaluate the fs_algorithm_user parameter), to reduce only item properties (evaluate the fs_algorithm_item parameter) or both (evaluate the fs_algorithm_user parameter and the fs_algorithm_item parameter). You can also choose different feature selection algorithms* for users and items.

You can also define a custom list of user and item nodes:

In this case only properties of those nodes will be considered during the feature selection process (instead of using properties of all users and items)

This function changes a copy of the original graph by default, but you can change this behaviour with the inplace parameter.

Examples:

# create a full graph
full_graph = rs.NXFullGraph(ratings,
                             user_contents_dir='users_codified/', # (1)
                             item_contents_dir='movies_codified/', # (2)
                             user_exo_properties={0}, # (3)
                             item_exo_properties={'dbpedia'}, # (4)
                             link_label='score')

 # perform feature selection by keeping only top 5 property labels
 # according to page rank algorithm
 fs_graph = rs.feature_selector(full_graph,
                                fs_algorithm_item=rs.TopKPageRank(k=5))

PARAMETER	DESCRIPTION
`graph`	Original graph on which feature selection will be performed TYPE: `FullDiGraph`
`fs_algorithm_user`	FeatureSelectionAlgorithm that will be performed on user properties. Can be different from `fs_algorithm_item` TYPE: `FeatureSelectionAlgorithm` DEFAULT: `None`
`fs_algorithm_item`	FeatureSelectionAlgorithm that will be performed on item properties. Can be different from `fs_algorithm_user` TYPE: `FeatureSelectionAlgorithm` DEFAULT: `None`
`user_target_nodes`	List of user nodes to consider in the feature selection process: only properties of user nodes in this list will be "reduced" TYPE: `list` DEFAULT: `None`
`item_target_nodes`	List of item nodes to consider in the feature selection process: only properties of item nodes in this list will be "reduced" TYPE: `list` DEFAULT: `None`
`inplace`	Boolean parameter that let you choose if changes must be performed on the original graph (`inplace=True`) or on its copy (`inplace=False`). Default is False TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`FullDiGraph`	Copy of the original graph from which the less important Property nodes (the ones having edges with less important property labels) will be removed

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_fn.py

def feature_selector(graph: FullDiGraph,
                     fs_algorithm_user: FeatureSelectionAlgorithm = None,
                     fs_algorithm_item: FeatureSelectionAlgorithm = None,
                     user_target_nodes: Iterable[UserNode] = None,
                     item_target_nodes: Iterable[ItemNode] = None,
                     inplace: bool = False) -> FullDiGraph:
    """
    Given a FullGraph, this method performs feature selection on it and returns the "reduced" graph.

    You can choose to reduce only *user properties* (*evaluate the `fs_algorithm_user` parameter*),
    to reduce only *item properties* (*evaluate the `fs_algorithm_item` parameter*) or both (*evaluate
    the `fs_algorithm_user` parameter* and the `fs_algorithm_item` parameter*).
    You can also choose different *feature selection algorithms* for users and items.

    You can also define a custom list of user and item nodes:

    * In this case only properties of those nodes will be considered during the feature selection process (instead of
    using properties of all users and items)

    This function changes a *copy* of the original graph by default, but you can change this behaviour with the
    `inplace` parameter.

    Examples:

        ```python
        # create a full graph
        full_graph = rs.NXFullGraph(ratings,
                                     user_contents_dir='users_codified/', # (1)
                                     item_contents_dir='movies_codified/', # (2)
                                     user_exo_properties={0}, # (3)
                                     item_exo_properties={'dbpedia'}, # (4)
                                     link_label='score')

         # perform feature selection by keeping only top 5 property labels
         # according to page rank algorithm
         fs_graph = rs.feature_selector(full_graph,
                                        fs_algorithm_item=rs.TopKPageRank(k=5))
        ```

    Args:
        graph: Original graph on which feature selection will be performed
        fs_algorithm_user: FeatureSelectionAlgorithm that will be performed on user properties. Can be different from
            `fs_algorithm_item`
        fs_algorithm_item: FeatureSelectionAlgorithm that will be performed on item properties. Can be different from
            `fs_algorithm_user`
        user_target_nodes (list): List of user nodes to consider in the feature selection process: only properties
            of user nodes in this list will be "reduced"
        item_target_nodes (list): List of item nodes to consider in the feature selection process: only properties
            of item nodes in this list will be "reduced"
        inplace: Boolean parameter that let you choose if changes must be performed on the original graph
            (`inplace=True`) or on its copy (`inplace=False`). Default is False

    Returns:
        Copy of the original graph from which the less important Property nodes (the ones having edges with less
            important property labels) will be removed
    """
    if fs_algorithm_user is not None and user_target_nodes is None:
        user_target_nodes = graph.user_nodes

    if fs_algorithm_item is not None and item_target_nodes is None:
        item_target_nodes = graph.item_nodes

    property_labels_to_remove = list()
    user_fs_failed = False
    item_fs_failed = False

    if fs_algorithm_user is not None:
        logger.info("Performing Feature Selection on users")
        try:
            user_props_to_remove = fs_algorithm_user.perform(graph, list(user_target_nodes), mode='to_remove')
            property_labels_to_remove.extend(user_props_to_remove)
        except FeatureSelectionException as e:
            logger.warning(str(e) + "!\nUsers original properties will be kept")
            user_fs_failed = True

    if fs_algorithm_item is not None:
        logger.info("Performing Feature Selection on items")
        try:
            item_props_to_remove = fs_algorithm_item.perform(graph, list(item_target_nodes), mode='to_remove')
            property_labels_to_remove.extend(item_props_to_remove)
        except FeatureSelectionException as e:
            logger.warning(str(e) + "!\nItems original properties will be kept")
            item_fs_failed = True

    # in case user feature selection or item feature selection failed
    # if both failed the original graph is returned
    # if only one of them failed, the original properties (either for items or users) are retrieved
    if user_fs_failed and item_fs_failed:
        logger.warning("Since both feature selection on items and feature selection on users failed or no fs algorithm"
                       "has been defined,\nthe original graph will be returned")

    if inplace is True:
        graph_fs = _delete_property_nodes(graph, property_labels_to_remove)
    else:
        graph_copy = graph.copy()
        graph_fs = _delete_property_nodes(graph_copy, property_labels_to_remove)

    return graph_fs

Feature Selection algorithms

The following are the feature selection algorithms you can use in the fs_algorithms_user and/or in the fs_algorithm_item

`TopKPageRank(k=10, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight=True, dangling=None)`

Bases: TopKFeatureSelection

Computes the PageRank as FeatureSelection algorithm. Property labels of the original graph will be scored with their page rank score and only the top-k labels will be kept in the feature selected graph, while discarding the others

PARAMETER	DESCRIPTION
`k`	Top-k property labels to keep in the feature selected graph TYPE: `int` DEFAULT: `10`
`alpha`	Damping parameter for PageRank, default=0.85. TYPE: `Any` DEFAULT: `0.85`
`personalization`	The "personalization vector" consisting of a dictionary with a key some subset of graph nodes and personalization value each of those. At least one personalization value must be non-zero. If not specfiied, a nodes personalization value will be zero. By default, a uniform distribution is used. TYPE: `Any` DEFAULT: `None`
`max_iter`	Maximum number of iterations in power method eigenvalue solver. TYPE: `Any` DEFAULT: `100`
`tol`	Error tolerance used to check convergence in power method solver. TYPE: `Any` DEFAULT: `1e-06`
`nstart`	Starting value of PageRank iteration for each node. TYPE: `Any` DEFAULT: `None`
`weight`	Edge data key to use as weight. If None weights are set to 1. TYPE: `bool` DEFAULT: `True`
`dangling`	The outedges to be assigned to any "dangling" nodes, i.e., nodes without any outedges. The dict key is the node the outedge points to and the dict value is the weight of that outedge. By default, dangling nodes are given outedges according to the personalization vector (uniform if not specified). This must be selected to result in an irreducible transition matrix (see notes under google_matrix). It may be common to have the dangling dict to be the same as the personalization dict. TYPE: `Any` DEFAULT: `None`

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_alg.py

def __init__(self, k: int = 10, alpha: Any = 0.85, personalization: Any = None, max_iter: Any = 100,
             tol: Any = 1.0e-6, nstart: Any = None, weight: bool = True, dangling: Any = None):
    super().__init__(k)

    self.alpha = alpha
    self.personalization = personalization
    self.max_iter = max_iter
    self.tol = tol
    self.nstart = nstart
    self.weight = 'weight' if weight is True else None
    self.dangling = dangling

`TopKEigenVectorCentrality(k=10, max_iter=100, tol=1e-06, nstart=None, weight=False)`

Bases: TopKFeatureSelection

Computes the Eigen Vector Centrality as FeatureSelection algorithm. Property labels of the original graph will be scored with their eigen vector centrality score and only the top-k labels will be kept in the feature selected graph, while discarding the others

PARAMETER	DESCRIPTION
`k`	Top-k property labels to keep in the feature selected graph TYPE: `int` DEFAULT: `10`
`max_iter`	Maximum number of iterations in power method. TYPE: `Any` DEFAULT: `100`
`tol`	Error tolerance used to check convergence in power method iteration. TYPE: `Any` DEFAULT: `1e-06`
`nstart`	Starting value of eigenvector iteration for each node. TYPE: `Any` DEFAULT: `None`
`weight`	Boolean value which tells the algorithm if weight of the edges must be considered or not. Default is True TYPE: `bool` DEFAULT: `False`

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_alg.py

def __init__(self, k: int = 10, max_iter: Any = 100, tol: Any = 1.0e-6, nstart: Any = None, weight: bool = False):
    super().__init__(k)

    self.max_iter = max_iter
    self.tol = tol
    self.nstart = nstart
    self.weight = 'weight' if weight is True else None

`TopKDegreeCentrality(k=10)`

Bases: TopKFeatureSelection

Computes the Degree Centrality as FeatureSelection algorithm. Property labels of the original graph will be scored with their degree centrality score and only the top-k labels will be kept in the feature selected graph, while discarding the others

Source code in clayrs/recsys/graphs/feature_selection/feature_selection_alg.py

def __init__(self, k: int = 10):
    super().__init__(k)