Skip to content

TfIdf

SkLearnTfIdf(max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Bases: TfIdfTechnique

Class that produces a sparse vector for each content representing the tf-idf scores of its terms using SkLearn.

Please refer to its documentation for more information about how it's computed

PARAMETER DESCRIPTION
max_df

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

TYPE: Union[float, int] DEFAULT: 1.0

min_df

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

TYPE: Union[float, int] DEFAULT: 1

max_features

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

TYPE: int DEFAULT: None

vocabulary

Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.

TYPE: Union[Mapping, Iterable] DEFAULT: None

binary

If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs).

TYPE: bool DEFAULT: False

dtype

Precision of the tf-idf scores

TYPE: Callable DEFAULT: np.float64

norm

Each output row will have unit norm, either:

  • 'l2': Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
  • 'l1': Sum of absolute values of vector elements is 1. See :func:preprocessing.normalize.

TYPE: str DEFAULT: 'l2'

use_idf

Enable inverse-document-frequency reweighting. If False, idf(t) = 1.

TYPE: bool DEFAULT: True

smooth_idf

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

TYPE: bool DEFAULT: True

sublinear_tf

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

TYPE: bool DEFAULT: False

Source code in clayrs/content_analyzer/field_content_production_techniques/tf_idf.py
81
82
83
84
85
86
87
88
89
def __init__(self, max_df: Union[float, int] = 1.0, min_df: Union[float, int] = 1, max_features: int = None,
             vocabulary: Union[Mapping, Iterable] = None, binary: bool = False, dtype: Callable = np.float64,
             norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False):

    super().__init__()
    self._sk_vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df, max_features=max_features,
                                          vocabulary=vocabulary, binary=binary, dtype=dtype,
                                          norm=norm, use_idf=use_idf, smooth_idf=smooth_idf,
                                          sublinear_tf=sublinear_tf)

WhooshTfIdf()

Bases: TfIdfTechnique

Class that produces a sparse vector for each content representing the tf-idf scores of its terms using Whoosh

The tf-idf computation formula is:

\[ tf \mbox{-} idf = (1 + log10(tf)) * log10(idf) \]
Source code in clayrs/content_analyzer/field_content_production_techniques/tf_idf.py
133
134
def __init__(self):
    super().__init__()