Skip to content

Contextualized Embeddings

Via the following, you can obtain embeddings of finer granularity from models which are able to return also embeddings of coarser granularity (e.g. obtain word embeddings from a model which is also able to return sentence embeddings).

For now only models working at sentence and token level are implemented

from clayrs import content_analyzer as ca

# obtain sentence embeddings combining token embeddings with a 
# centroid technique
ca.Sentence2WordEmbedding(embedding_source=ca.BertTransformers('bert-base-uncased'))

Sentence2WordEmbedding(embedding_source)

Bases: DecombiningInWordsEmbeddingTechnique

Class that makes use of a sentence granularity embedding source to produce an embedding matrix with word granularity

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py
470
471
472
473
def __init__(self, embedding_source: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner]):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, SentenceEmbeddingLoader)
    super().__init__(embedding_source)

produce_single_repr(field_data)

Produces a single matrix where each row is the embedding representation of each token of the sentence, while the columns are the hidden dimension of the chosen model

PARAMETER DESCRIPTION
field_data

textual data to complexly represent

TYPE: Union[List[str], str]

RETURNS DESCRIPTION
EmbeddingField

Embedding for each token of the sentence

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
def produce_single_repr(self, field_data: Union[List[str], str]) -> EmbeddingField:
    """
    Produces a single matrix where each row is the embedding representation of each token of the sentence,
    while the columns are the hidden dimension of the chosen model

    Args:
        field_data: textual data to complexly represent

    Returns:
        Embedding for each token of the sentence

    """
    field_data = check_not_tokenized(field_data)
    embedding_source: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner] = self.embedding_source
    words_embeddings = embedding_source.get_embedding_token(field_data)
    return EmbeddingField(words_embeddings)

Model able to return sentence and token embeddings

BertTransformers(model_name='bert-base-uncased', vec_strategy=CatStrategy(1), pooling_strategy=Centroid())

Bases: Transformers

Class that produces sentences/token embeddings using any Bert model from hugging face.

PARAMETER DESCRIPTION
model_name

Name of the embeddings model to download or path where the model is stored locally

TYPE: str DEFAULT: 'bert-base-uncased'

vec_strategy

Strategy which will be used to combine each output layer to obtain a single one

TYPE: VectorStrategy DEFAULT: CatStrategy(1)

pooling_strategy

Strategy which will be used to combine the embedding representation of each token into a single one, representing the embedding of the whole sentence

TYPE: CombiningTechnique DEFAULT: Centroid()

Source code in clayrs/content_analyzer/embeddings/embedding_loader/transformer.py
83
84
85
86
def __init__(self, model_name: str = 'bert-base-uncased',
             vec_strategy: VectorStrategy = CatStrategy(1),
             pooling_strategy: CombiningTechnique = Centroid()):
    super().__init__(model_name, vec_strategy, pooling_strategy)

T5Transformers(model_name='t5-small', vec_strategy=CatStrategy(1), pooling_strategy=Centroid())

Bases: Transformers

Class that produces sentences/token embeddings using sbert.

PARAMETER DESCRIPTION
model_name

Name of the embeddings model to download or path where the model is stored locally

TYPE: str DEFAULT: 't5-small'

vec_strategy

Strategy which will be used to combine each output layer to obtain a single one

TYPE: VectorStrategy DEFAULT: CatStrategy(1)

pooling_strategy

Strategy which will be used to combine the embedding representation of each token into a single one, representing the embedding of the whole sentence

TYPE: CombiningTechnique DEFAULT: Centroid()

Source code in clayrs/content_analyzer/embeddings/embedding_loader/transformer.py
122
123
124
125
def __init__(self, model_name: str = 't5-small',
             vec_strategy: VectorStrategy = CatStrategy(1),
             pooling_strategy: CombiningTechnique = Centroid()):
    super().__init__(model_name, vec_strategy, pooling_strategy)