Contextualized Embeddings

Via the following, you can obtain embeddings of finer granularity from models which are able to return also embeddings of coarser granularity (e.g. obtain word embeddings from a model which is also able to return sentence embeddings).

For now only models working at sentence and token level are implemented

from clayrs import content_analyzer as ca

# obtain sentence embeddings combining token embeddings with a 
# centroid technique
ca.Sentence2WordEmbedding(embedding_source=ca.BertTransformers('bert-base-uncased'))

`Sentence2WordEmbedding(embedding_source)`

Bases: DecombiningInWordsEmbeddingTechnique

Class that makes use of a sentence granularity embedding source to produce an embedding matrix with word granularity

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py

def __init__(self, embedding_source: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner]):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, SentenceEmbeddingLoader)
    super().__init__(embedding_source)

`produce_single_repr(field_data)`

Produces a single matrix where each row is the embedding representation of each token of the sentence, while the columns are the hidden dimension of the chosen model

PARAMETER DESCRIPTION

field_data

textual data to complexly represent

TYPE: Union[List[str], str]

RETURNS	DESCRIPTION
`EmbeddingField`	Embedding for each token of the sentence

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py

def produce_single_repr(self, field_data: Union[List[str], str]) -> EmbeddingField:
    """
    Produces a single matrix where each row is the embedding representation of each token of the sentence,
    while the columns are the hidden dimension of the chosen model

    Args:
        field_data: textual data to complexly represent

    Returns:
        Embedding for each token of the sentence

    """
    field_data = check_not_tokenized(field_data)
    embedding_source: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner] = self.embedding_source
    words_embeddings = embedding_source.get_embedding_token(field_data)
    return EmbeddingField(words_embeddings)

Model able to return sentence and token embeddings

`BertTransformers(model_name='bert-base-uncased', vec_strategy=CatStrategy(1), pooling_strategy=Centroid())`

Bases: Transformers

Class that produces sentences/token embeddings using any Bert model from hugging face.

PARAMETER DESCRIPTION

model_name

Name of the embeddings model to download or path where the model is stored locally

TYPE: str DEFAULT: 'bert-base-uncased'

vec_strategy

Strategy which will be used to combine each output layer to obtain a single one

TYPE: VectorStrategy DEFAULT: CatStrategy(1)

pooling_strategy

Strategy which will be used to combine the embedding representation of each token into a single one, representing the embedding of the whole sentence

TYPE: CombiningTechnique DEFAULT: Centroid()

Source code in clayrs/content_analyzer/embeddings/embedding_loader/transformer.py

def __init__(self, model_name: str = 'bert-base-uncased',
             vec_strategy: VectorStrategy = CatStrategy(1),
             pooling_strategy: CombiningTechnique = Centroid()):
    super().__init__(model_name, vec_strategy, pooling_strategy)

`T5Transformers(model_name='t5-small', vec_strategy=CatStrategy(1), pooling_strategy=Centroid())`

Bases: Transformers

Class that produces sentences/token embeddings using sbert.

PARAMETER DESCRIPTION

model_name

Name of the embeddings model to download or path where the model is stored locally

TYPE: str DEFAULT: 't5-small'

vec_strategy

Strategy which will be used to combine each output layer to obtain a single one

TYPE: VectorStrategy DEFAULT: CatStrategy(1)

pooling_strategy

Strategy which will be used to combine the embedding representation of each token into a single one, representing the embedding of the whole sentence

TYPE: CombiningTechnique DEFAULT: Centroid()

Source code in clayrs/content_analyzer/embeddings/embedding_loader/transformer.py

def __init__(self, model_name: str = 't5-small',
             vec_strategy: VectorStrategy = CatStrategy(1),
             pooling_strategy: CombiningTechnique = Centroid()):
    super().__init__(model_name, vec_strategy, pooling_strategy)