Skip to content

Combine Embeddings

Via the following, you can obtain embeddings of coarser granularity from models which return embeddings of finer granularity (e.g. obtain sentence embeddings from a model which returns word embeddings)

from clayrs import content_analyzer as ca

# obtain sentence embeddings combining token embeddings with a 
# centroid technique
ca.Word2SentenceEmbedding(embedding_source=ca.Gensim('glove-twitter-50'),
                          combining_technique=ca.Centroid())

Word2SentenceEmbedding(embedding_source, combining_technique)

Bases: CombiningSentenceEmbeddingTechnique

Class that makes use of a word granularity embedding source to produce sentence embeddings

PARAMETER DESCRIPTION
embedding_source

Any WordEmbedding model

TYPE: Union[WordEmbeddingLoader, WordEmbeddingLearner, str]

combining_technique

Technique used to combine embeddings of finer granularity (word-level) to obtain embeddings of coarser granularity (sentence-level)

TYPE: CombiningTechnique

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py
324
325
326
327
328
def __init__(self, embedding_source: Union[WordEmbeddingLoader, WordEmbeddingLearner, str],
             combining_technique: CombiningTechnique):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, WordEmbeddingLoader)
    super().__init__(embedding_source, combining_technique)

Word2DocEmbedding(embedding_source, combining_technique)

Bases: CombiningDocumentEmbeddingTechnique

Class that makes use of a word granularity embedding source to produce embeddings of document granularity

PARAMETER DESCRIPTION
embedding_source

Any WordEmbedding model

TYPE: Union[WordEmbeddingLoader, WordEmbeddingLearner, str]

combining_technique

Technique used to combine embeddings of finer granularity (word-level) to obtain embeddings of coarser granularity (doc-level)

TYPE: CombiningTechnique

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py
376
377
378
379
380
def __init__(self, embedding_source: Union[WordEmbeddingLoader, WordEmbeddingLearner, str],
             combining_technique: CombiningTechnique):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, WordEmbeddingLoader)
    super().__init__(embedding_source, combining_technique)

Sentence2DocEmbedding(embedding_source, combining_technique)

Bases: CombiningDocumentEmbeddingTechnique

Class that makes use of a sentence granularity embedding source to produce embeddings of document granularity

PARAMETER DESCRIPTION
embedding_source

Any SentenceEmbedding model

TYPE: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner, str]

combining_technique

Technique used to combine embeddings of finer granularity (sentence-level) to obtain embeddings of coarser granularity (doc-level)

TYPE: CombiningTechnique

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/embedding_technique.py
404
405
406
407
408
def __init__(self, embedding_source: Union[SentenceEmbeddingLoader, SentenceEmbeddingLearner, str],
             combining_technique: CombiningTechnique):
    # if isinstance(embedding_source, str):
    #     embedding_source = self.from_str_to_embedding_source(embedding_source, SentenceEmbeddingLoader)
    super().__init__(embedding_source, combining_technique)

Combining Techniques

Centroid

Bases: CombiningTechnique

This class computes the centroid vector of a matrix.

combine(embedding_matrix)

Calculates the centroid of the input matrix

PARAMETER DESCRIPTION
embedding_matrix

np bi-dimensional array where rows are words columns are hidden dimension whose centroid will be calculated

TYPE: np.ndarray

RETURNS DESCRIPTION
np.ndarray

Centroid vector of the input matrix

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py
34
35
36
37
38
39
40
41
42
43
44
45
def combine(self, embedding_matrix: np.ndarray) -> np.ndarray:
    """
    Calculates the centroid of the input matrix

    Args:
        embedding_matrix: np bi-dimensional array where rows are words columns are hidden dimension
            whose centroid will be calculated

    Returns:
        Centroid vector of the input matrix
    """
    return np.nanmean(embedding_matrix, axis=0)

Sum

Bases: CombiningTechnique

This class computes the sum vector of a matrix.

combine(embedding_matrix)

Calculates the sum vector of the input matrix

PARAMETER DESCRIPTION
embedding_matrix

np bi-dimensional array where rows are words columns are hidden dimension whose sum vector will be calculated

TYPE: np.ndarray

RETURNS DESCRIPTION
np.ndarray

Sum vector of the input matrix

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py
58
59
60
61
62
63
64
65
66
67
68
69
def combine(self, embedding_matrix: np.ndarray) -> np.ndarray:
    """
    Calculates the sum vector of the input matrix

    Args:
        embedding_matrix: np bi-dimensional array where rows are words columns are hidden dimension
            whose sum vector will be calculated

    Returns:
        Sum vector of the input matrix
    """
    return np.sum(embedding_matrix, axis=0)

SingleToken(token_index)

Bases: CombiningTechnique

Class which takes a specific row as representative of the whole matrix

PARAMETER DESCRIPTION
token_index

index of the row of the matrix to take

TYPE: int

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py
85
86
87
def __init__(self, token_index: int):
    self.token_index = token_index
    super().__init__()

combine(embedding_matrix)

Takes the row with index token_index (set in the constructor) from the input embedding_matrix

PARAMETER DESCRIPTION
embedding_matrix

np bi-dimensional array where rows are words columns are hidden dimension from where the single token will be extracted

TYPE: np.ndarray

RETURNS DESCRIPTION
np.ndarray

Single row as representative of the whole matrix

RAISES DESCRIPTION
IndexError

Exception raised when token_index (set in the constructor) is out of bounds for the input matrix

Source code in clayrs/content_analyzer/field_content_production_techniques/embedding_technique/combining_technique.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def combine(self, embedding_matrix: np.ndarray) -> np.ndarray:
    """
    Takes the row with index `token_index` (set in the constructor) from the input `embedding_matrix`

    Args:
        embedding_matrix: np bi-dimensional array where rows are words columns are hidden dimension
            from where the single token will be extracted

    Returns:
        Single row as representative of the whole matrix

    Raises:
        IndexError: Exception raised when `token_index` (set in the constructor) is out of bounds for the input
            matrix
    """
    try:
        sentence_embedding = embedding_matrix[self.token_index]
    except IndexError:
        raise IndexError(f'The embedding matrix has {embedding_matrix.shape[1]} '
                         f'embeddings but you tried to take the {self.token_index+1}th')
    return sentence_embedding