Combine Embeddings

Via the following, you can obtain embeddings of coarser granularity from models which return embeddings of finer granularity (e.g. obtain sentence embeddings from a model which returns word embeddings)

from clayrs import content_analyzer as ca

# obtain sentence embeddings combining token embeddings with a 
# centroid technique
ca.Word2SentenceEmbedding(embedding_source=ca.Gensim('glove-twitter-50'),
                          combining_technique=ca.Centroid())

`Word2SentenceEmbedding(embedding_source, combining_technique)`

Bases: CombiningSentenceEmbeddingTechnique

Class that makes use of a word granularity embedding source to produce sentence embeddings

PARAMETER DESCRIPTION

embedding_source

Any WordEmbedding model

TYPE: Union[WordEmbeddingLoader, WordEmbeddingLearner, str]

combining_technique

Technique used to combine embeddings of finer granularity (word-level) to obtain embeddings of coarser granularity (sentence-level)

TYPE: CombiningTechnique

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py

def __init__(self, embedding_source: Union[WordEmbeddingLoader, WordEmbeddingLearner, str],
             combining_technique: CombiningTechnique):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, WordEmbeddingLoader)
    super().__init__(embedding_source, combining_technique)

`Word2DocEmbedding(embedding_source, combining_technique)`

Bases: CombiningDocumentEmbeddingTechnique

Class that makes use of a word granularity embedding source to produce embeddings of document granularity

PARAMETER DESCRIPTION

embedding_source

Any WordEmbedding model

TYPE: Union[WordEmbeddingLoader, WordEmbeddingLearner, str]

combining_technique

Technique used to combine embeddings of finer granularity (word-level) to obtain embeddings of coarser granularity (doc-level)

TYPE: CombiningTechnique

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py

def __init__(self, embedding_source: Union[WordEmbeddingLoader, WordEmbeddingLearner, str],
             combining_technique: CombiningTechnique):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, WordEmbeddingLoader)
    super().__init__(embedding_source, combining_technique)

`Sentence2DocEmbedding(embedding_source, combining_technique)`

Bases: CombiningDocumentEmbeddingTechnique

Class that makes use of a sentence granularity embedding source to produce embeddings of document granularity

PARAMETER DESCRIPTION

embedding_source

Any SentenceEmbedding model

TYPE: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner, str]

combining_technique

Technique used to combine embeddings of finer granularity (sentence-level) to obtain embeddings of coarser granularity (doc-level)

TYPE: CombiningTechnique

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py

def __init__(self, embedding_source: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner, str],
             combining_technique: CombiningTechnique):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, SentenceEmbeddingLoader)
    super().__init__(embedding_source, combining_technique)

Combining Techniques

`Centroid`

Bases: CombiningTechnique

This class computes the centroid vector of a matrix.

`combine(embedding_matrix)`

Calculates the centroid of the input matrix

PARAMETER DESCRIPTION

embedding_matrix

np bi-dimensional array where rows are words columns are hidden dimension whose centroid will be calculated

TYPE: np.ndarray

RETURNS	DESCRIPTION
`np.ndarray`	Centroid vector of the input matrix

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py

def combine(self, embedding_matrix: np.ndarray) -> np.ndarray:
    """
    Calculates the centroid of the input matrix

    Args:
        embedding_matrix: np bi-dimensional array where rows are words columns are hidden dimension
            whose centroid will be calculated

    Returns:
        Centroid vector of the input matrix
    """
    return np.nanmean(embedding_matrix, axis=0)

`Sum`

Bases: CombiningTechnique

This class computes the sum vector of a matrix.

`combine(embedding_matrix)`

Calculates the sum vector of the input matrix

PARAMETER DESCRIPTION

embedding_matrix

np bi-dimensional array where rows are words columns are hidden dimension whose sum vector will be calculated

TYPE: np.ndarray

RETURNS	DESCRIPTION
`np.ndarray`	Sum vector of the input matrix

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py

def combine(self, embedding_matrix: np.ndarray) -> np.ndarray:
    """
    Calculates the sum vector of the input matrix

    Args:
        embedding_matrix: np bi-dimensional array where rows are words columns are hidden dimension
            whose sum vector will be calculated

    Returns:
        Sum vector of the input matrix
    """
    return np.sum(embedding_matrix, axis=0)

`SingleToken(token_index)`

Bases: CombiningTechnique

Class which takes a specific row as representative of the whole matrix

PARAMETER DESCRIPTION

token_index

index of the row of the matrix to take

TYPE: int

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py

def __init__(self, token_index: int):
    self.token_index = token_index
    super().__init__()

`combine(embedding_matrix)`

Takes the row with index token_index (set in the constructor) from the input embedding_matrix

PARAMETER DESCRIPTION

embedding_matrix

np bi-dimensional array where rows are words columns are hidden dimension from where the single token will be extracted

TYPE: np.ndarray

RETURNS	DESCRIPTION
`np.ndarray`	Single row as representative of the whole matrix

RAISES	DESCRIPTION
`IndexError`	Exception raised when `token_index` (set in the constructor) is out of bounds for the input matrix

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py

def combine(self, embedding_matrix: np.ndarray) -> np.ndarray:
    """
    Takes the row with index `token_index` (set in the constructor) from the input `embedding_matrix`

    Args:
        embedding_matrix: np bi-dimensional array where rows are words columns are hidden dimension
            from where the single token will be extracted

    Returns:
        Single row as representative of the whole matrix

    Raises:
        IndexError: Exception raised when `token_index` (set in the constructor) is out of bounds for the input
            matrix
    """
    try:
        sentence_embedding = embedding_matrix[self.token_index]
    except IndexError:
        raise IndexError(f'The embedding matrix has {embedding_matrix.shape[1]} '
                         f'embeddings but you tried to take the {self.token_index+1}th')
    return sentence_embedding