TfIdf

`SkLearnTfIdf(max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`

Bases: TfIdfTechnique

Class that produces a sparse vector for each content representing the tf-idf scores of its terms using SkLearn.

Please refer to its documentation for more information about how it's computed

PARAMETER	DESCRIPTION
`max_df`	When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. TYPE: `Union[float, int]` DEFAULT: `1.0`
`min_df`	When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. TYPE: `Union[float, int]` DEFAULT: `1`
`max_features`	If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None. TYPE: `int` DEFAULT: `None`
`vocabulary`	Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. TYPE: `Union[Mapping, Iterable]` DEFAULT: `None`
`binary`	If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs). TYPE: `bool` DEFAULT: `False`
`dtype`	Precision of the tf-idf scores TYPE: `Callable` DEFAULT: `np.float64`
`norm`	Each output row will have unit norm, either: 'l2': Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied. 'l1': Sum of absolute values of vector elements is 1. See :func:`preprocessing.normalize`. TYPE: `str` DEFAULT: `'l2'`
`use_idf`	Enable inverse-document-frequency reweighting. If False, idf(t) = 1. TYPE: `bool` DEFAULT: `True`
`smooth_idf`	Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions. TYPE: `bool` DEFAULT: `True`
`sublinear_tf`	Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf). TYPE: `bool` DEFAULT: `False`

Source code in clayrs/content_analyzer/field_content_production_techniques/tf_idf.py

def __init__(self, max_df: Union[float, int] = 1.0, min_df: Union[float, int] = 1, max_features: int = None,
             vocabulary: Union[Mapping, Iterable] = None, binary: bool = False, dtype: Callable = np.float64,
             norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False):

    super().__init__()
    self._sk_vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df, max_features=max_features,
                                          vocabulary=vocabulary, binary=binary, dtype=dtype,
                                          norm=norm, use_idf=use_idf, smooth_idf=smooth_idf,
                                          sublinear_tf=sublinear_tf)

`WhooshTfIdf()`

Bases: TfIdfTechnique

Class that produces a sparse vector for each content representing the tf-idf scores of its terms using Whoosh

The tf-idf computation formula is:

\[ tf \mbox{-} idf = (1 + log10(tf)) * log10(idf) \]

Source code in clayrs/content_analyzer/field_content_production_techniques/tf_idf.py

def __init__(self):
    super().__init__()