Skip to content

Document Embeddings

Via the following, you can obtain embeddings of document granularity

from clayrs import content_analyzer as ca

# obtain document embeddings by training LDA model
# on corpus of contents to complexly represent
ca.DocumentEmbeddingTechnique(embedding_source=ca.GensimLDA())

DocumentEmbeddingTechnique(embedding_source)

Bases: StandardEmbeddingTechnique

Class that makes use of a document granularity embedding source to produce document embeddings

PARAMETER DESCRIPTION
embedding_source

Any DocumentEmbedding model

TYPE: Union[DocumentEmbeddingLoader, DocumentEmbeddingLearner, str]

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py
227
228
229
230
def __init__(self, embedding_source: Union[DocumentEmbeddingLoader, DocumentEmbeddingLearner, str]):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, DocumentEmbeddingLoader)
    super().__init__(embedding_source)

Document Embedding models

GensimLatentSemanticAnalysis(reference=None, auto_save=True, **kwargs)

Bases: GensimDocumentEmbeddingLearner

Class that implements Latent Semantic Analysis (A.K.A. Latent Semantic Indexing) (LSI) thanks to the Gensim library.

If a pre-trained local Word2Vec model must be loaded, put its path in the reference parameter. Otherwise, a Word2Vec model will be trained from scratch based on the preprocessed corpus of the contents to complexly represent

If you'd like to save the model once trained, set the path in the reference parameter and set auto_save=True. If reference is None, trained model won't be saved after training and will only be used to produce contents in the current run

Additional parameters regarding the model itself could be passed, check gensim documentation to see what else can be customized

PARAMETER DESCRIPTION
reference

Path of the model to load/where the model trained will be saved if auto_save=True. If None the trained model won't be saved after training and will only be used to produce contents in the current run

TYPE: str DEFAULT: None

auto_save

If True, the model will be saved in the path specified in reference parameter

TYPE: bool DEFAULT: True

Source code in clayrs/content_analyzer/embeddings/embedding_learner/latent_semantic_analysis.py
33
34
def __init__(self, reference: str = None, auto_save: bool = True,  **kwargs):
    super().__init__(reference, auto_save, ".model", **kwargs)

GensimLDA(reference=None, auto_save=True, **kwargs)

Bases: GensimDocumentEmbeddingLearner

Class that implements Latent Dirichlet Allocation (LDA) thanks to the Gensim library.

If a pre-trained local Word2Vec model must be loaded, put its path in the reference parameter. Otherwise, a Word2Vec model will be trained from scratch based on the preprocessed corpus of the contents to complexly represent

If you'd like to save the model once trained, set the path in the reference parameter and set auto_save=True. If reference is None, trained model won't be saved after training and will only be used to produce contents in the current run

Additional parameters regarding the model itself could be passed, check gensim documentation to see what else can be customized

PARAMETER DESCRIPTION
reference

Path of the model to load/where the model trained will be saved if auto_save=True. If None the trained model won't be saved after training and will only be used to produce contents in the current run

TYPE: str DEFAULT: None

auto_save

If True, the model will be saved in the path specified in reference parameter

TYPE: bool DEFAULT: True

Source code in clayrs/content_analyzer/embeddings/embedding_learner/lda.py
33
34
def __init__(self, reference: str = None, auto_save: bool = True, **kwargs):
    super().__init__(reference, auto_save, ".model", **kwargs)