Skip to content

Paired statistical tests

PairedTest

Bases: StatisticalTest

perform(df_list)

Method which performs the chosen paired statistical test.

Since it's a paired test, the final result is a pandas DataFrame which contains learning schemas compared in pair. For example if you call the perform() method by passing a list containing three different DataFrames, one for each learning schema to compare:

# Ttest as example since it's a Paired Test
Ttest().perform([user_df1, user_df2, user_df3])

You will obtain a DataFrame comparing all different combinations:

  • (system1, system2)
  • (system1, system3)
  • (system2, system3)

The first value of each cell is the statistic, the second is the p-value

PARAMETER DESCRIPTION
df_list

List containing DataFrames with several metrics to compare, preferably metrics computed for each user. One DataFrame corresponds to one learning schema

TYPE: List[pd.DataFrame]

RETURNS DESCRIPTION
pd.DataFrame

A Pandas DataFrame where each combination of learning schemas are compared in pair. The first value of each cell is the statistic, the second is the p-value

Source code in clayrs/evaluation/statistical_test.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
def perform(self, df_list: List[pd.DataFrame]) -> pd.DataFrame:
    """
    Method which performs the chosen paired statistical test.

    Since it's a paired test, the final result is a pandas DataFrame which contains learning
    schemas compared in pair.
    For example if you call the `perform()` method by passing a list containing three different DataFrames, one for
    each learning schema to compare:

    ```python
    # Ttest as example since it's a Paired Test
    Ttest().perform([user_df1, user_df2, user_df3])
    ```

    You will obtain a DataFrame comparing all different combinations:

    * (system1, system2)
    * (system1, system3)
    * (system2, system3)

    The first value of each cell is the ***statistic***, the second is the ***p-value***

    Args:
        df_list: List containing DataFrames with several metrics to compare, preferably metrics computed for each
            user. One DataFrame corresponds to one learning schema

    Returns:
        A Pandas DataFrame where each combination of learning schemas are compared in pair. The first value of each
            cell is the ***statistic***, the second is the ***p-value***
    """

    user_id_column = self.user_id_column
    if user_id_column is None:
        user_id_column = ["user_id" for _ in range(len(df_list))]
    elif len(user_id_column) == 1:
        user_id_column = [self.user_id_column[0] for _ in range(len(df_list))]
    elif len(user_id_column) != len(df_list):
        raise ValueError("You must either specify a single user_id column for all dataframes or a different "
                         "user_id column for each DataFrame passed!")

    # if the user id column is the index we consider it as an additional column
    df_list_reset = []
    for user_id_col, df in zip(user_id_column, df_list):
        if user_id_col not in df.columns:
            if user_id_col == df.index.name:
                df = df.reset_index()
            else:
                raise KeyError(f"Column {user_id_col} not present neither in the columns nor as index!")

        df_list_reset.append(df)

    data = defaultdict(lambda: defaultdict(dict))

    # this will contain metrics that are in common at least by two system
    global_metrics = set()

    n_system_evaluated = 1
    while len(df_list_reset) != 0:
        df1 = df_list_reset.pop(0)
        user_id_col1 = user_id_column.pop(0)
        for i, (user_id_col2, other_df) in enumerate(zip(user_id_column, df_list_reset),
                                                     start=n_system_evaluated + 1):

            common_metrics = [column for column in df1.columns
                              if column != user_id_col1 and column in other_df.columns]

            common_rows = self._common_users(df1, user_id_col1, other_df, user_id_col2, common_metrics)

            for metric in common_metrics:

                global_metrics.add(metric)
                # drop nan values since otherwise test may behave unexpectedly
                metric_rows = common_rows[[f"{metric}_x", f"{metric}_y"]].dropna()

                score_system1 = metric_rows[f"{metric}_x"]
                score_system2 = metric_rows[f"{metric}_y"]

                statistic, pvalue = self._perform_test(score_system1, score_system2)

                data[(f"system_{n_system_evaluated}", f"system_{i}")][str(metric)]["statistic"] = statistic
                data[(f"system_{n_system_evaluated}", f"system_{i}")][str(metric)]["pvalue"] = pvalue

        n_system_evaluated += 1

    # index will contain (sys1, sys2), (sys1, sys3), ... tuples
    # formatted_data will be in the form
    # {(NDCG, "statistic"): [..., ..., ..., ...], (NDCG, "pvalue"): [..., ..., ..., ...],
    #  (Precision, "statistic"): [..., ..., ..., ...], (Precision, "pvalue"): [..., ..., ..., ...],
    #  ...}
    index = []
    formatted_data = defaultdict(list)
    for df_index_row, df_data_row in data.items():
        index.append(df_index_row)

        for metric_column in global_metrics:
            stat, pval = np.nan, np.nan

            stat_pval_dict = df_data_row.get(metric_column)
            if stat_pval_dict is not None:
                stat = stat_pval_dict["statistic"]
                pval = stat_pval_dict["pvalue"]

            formatted_data[(metric_column, "statistic")].append(stat)
            formatted_data[(metric_column, "pvalue")].append(pval)

    res = pd.DataFrame(formatted_data, index=index)
    res.index.rename("sys_pair", inplace=True)

    return res

Ttest

Bases: PairedTest

Calculate the T-test for the means of two independent samples of scores.

This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

Wilcoxon

Bases: PairedTest

Compute the Wilcoxon rank-sum statistic for two samples.

The Wilcoxon rank-sum test tests the null hypothesis that two sets of measurements are drawn from the same distribution. The alternative hypothesis is that values in one sample are more likely to be larger than the values in the other sample.