TfIdf
SkLearnTfIdf(max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
Bases: TfIdfTechnique
Class that produces a sparse vector for each content representing the tf-idf scores of its terms using SkLearn.
Please refer to its documentation for more information about how it's computed
PARAMETER | DESCRIPTION |
---|---|
max_df |
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. |
min_df |
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. |
max_features |
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
TYPE:
|
vocabulary |
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. |
binary |
If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs).
TYPE:
|
dtype |
Precision of the tf-idf scores
TYPE:
|
norm |
Each output row will have unit norm, either:
TYPE:
|
use_idf |
Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
TYPE:
|
smooth_idf |
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
TYPE:
|
sublinear_tf |
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
TYPE:
|
Source code in clayrs/content_analyzer/field_content_production_techniques/tf_idf.py
81 82 83 84 85 86 87 88 89 |
|
WhooshTfIdf()
Bases: TfIdfTechnique
Class that produces a sparse vector for each content representing the tf-idf scores of its terms using Whoosh
The tf-idf computation formula is:
Source code in clayrs/content_analyzer/field_content_production_techniques/tf_idf.py
133 134 |
|