Skip to content

Index interface

IndexInterface(directory)

Bases: TextInterface

Abstract class that takes care of serializing and deserializing text in an indexed structure using the Whoosh library

PARAMETER DESCRIPTION
directory

Path of the directory where the content will be serialized

TYPE: str

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
26
27
28
29
30
31
def __init__(self, directory: str):
    super().__init__(directory)
    self.__doc = None  # document that is currently being created and will be added to the index
    self.__writer = None  # index writer
    self.__doc_index = 0  # current position the document will have in the index once it is serialized
    self.__schema_changed = False  # true if the schema has been changed, false otherwise

schema_type property abstractmethod

Whoosh uses a Schema that defines, for each field of the content, how to store the data. In the case of this project, every field will have the same structure and will share the same field type. This method returns said field type.

get_field(field_name, content_id)

Uses a search index to retrieve the content corresponding to the content_id (if it is a string) or in the corresponding position (if it is an integer), and returns the data in the field corresponding to the field_name

PARAMETER DESCRIPTION
field_name

name of the field from which the data will be retrieved

TYPE: str

content_id

either the position or Id of the content that contains the specified field

TYPE: Union[str, int]

RETURNS DESCRIPTION
str

Data contained in the field of the content

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def get_field(self, field_name: str, content_id: Union[str, int]) -> str:
    """
    Uses a search index to retrieve the content corresponding to the content_id (if it is a string) or in the
    corresponding position (if it is an integer), and returns the data in the field corresponding to the field_name

    Args:
        field_name (str): name of the field from which the data will be retrieved
        content_id (Union[str, int]): either the position or Id of the content that contains the specified field

    Returns:
        Data contained in the field of the content
    """
    ix = open_dir(self.directory)
    with ix.searcher() as searcher:
        if isinstance(content_id, str):
            query = Term("content_id", content_id)
            result = searcher.search(query)
            result = result[0][field_name]
        elif isinstance(content_id, int):
            result = searcher.reader().stored_fields(content_id)[field_name]
        return result

get_tf_idf(field_name, content_id)

Calculates the tf-idf for the words contained in the field of the content whose id is content_id (if it is a string) or in the given position (if it is an integer).

The tf-idf computation formula is:

\[ tf \mbox{-} idf = (1 + log10(tf)) * log10(idf) \]
PARAMETER DESCRIPTION
field_name

Name of the field containing the words for which calculate the tf-idf

TYPE: str

content_id

either the position or Id of the content that contains the specified field

TYPE: Union[str, int]

RETURNS DESCRIPTION
words_bag

Dictionary whose keys are the words contained in the field, and the corresponding values are the tf-idf values

TYPE: Dict[str, float]

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def get_tf_idf(self, field_name: str, content_id: Union[str, int]) -> Dict[str, float]:
    r"""
    Calculates the tf-idf for the words contained in the field of the content whose id
    is content_id (if it is a string) or in the given position (if it is an integer).

    The tf-idf computation formula is:

    $$
    tf \mbox{-} idf = (1 + log10(tf)) * log10(idf)
    $$

    Args:
        field_name: Name of the field containing the words for which calculate the tf-idf
        content_id: either the position or Id of the content that contains the specified field

    Returns:
        words_bag: Dictionary whose keys are the words contained in the field, and the
            corresponding values are the tf-idf values
    """
    ix = open_dir(self.directory)
    words_bag = {}
    with ix.searcher() as searcher:
        if isinstance(content_id, str):
            query = Term("content_id", content_id)
            doc_num = searcher.search(query).docnum(0)
        elif isinstance(content_id, int):
            doc_num = content_id

        # if the document has the field == "" (length == 0) then the bag of word is empty
        if len(searcher.ixreader.stored_fields(doc_num)[field_name]) > 0:
            # retrieves the frequency vector (used for tf)
            list_with_freq = [term_with_freq for term_with_freq
                              in searcher.vector(doc_num, field_name).items_as("frequency")]
            for term, freq in list_with_freq:
                tf = 1 + math.log10(freq)
                idf = math.log10(searcher.doc_count()/searcher.doc_frequency(field_name, term))
                words_bag[term] = tf*idf
    return words_bag

init_writing(delete_old=False)

Creates the index locally (in the directory passed in the constructor) and initializes the index writer. If an index already exists in the directory, what happens depend on the attribute delete_old passed as argument

PARAMETER DESCRIPTION
delete_old

if True, the index that was in the same directory is destroyed and replaced; if False, the index is simply opened

TYPE: bool DEFAULT: False

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def init_writing(self, delete_old: bool = False):
    """
    Creates the index locally (in the directory passed in the constructor) and initializes the index writer.
    If an index already exists in the directory, what happens depend on the attribute delete_old passed as argument

    Args:
        delete_old (bool): if True, the index that was in the same directory is destroyed and replaced;
            if False, the index is simply opened
    """
    if os.path.exists(self.directory):
        if delete_old:
            self.delete()
            os.mkdir(self.directory)
            ix = create_in(self.directory, Schema())
            self.__writer = ix.writer()
        else:
            ix = open_dir(self.directory)
            self.__writer = ix.writer()
            self.__doc_index = self.__writer.reader().doc_count()
    else:
        os.mkdir(self.directory)
        ix = create_in(self.directory, Schema())
        self.__writer = ix.writer()

new_content()

The new content is a document that will be indexed. In this case the document is a dictionary with the name of the field as key and the data inside the field as value

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
67
68
69
70
71
72
def new_content(self):
    """
    The new content is a document that will be indexed. In this case the document is a dictionary with
    the name of the field as key and the data inside the field as value
    """
    self.__doc = {}

new_field(field_name, field_data)

Adds a new field to the document that is being created. Since the index Schema is generated dynamically, if the field name is not in the Schema already it is added to it

PARAMETER DESCRIPTION
field_name

Name of the new field

TYPE: str

field_data

Data to put into the field

TYPE: object

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
74
75
76
77
78
79
80
81
82
83
84
85
86
def new_field(self, field_name: str, field_data: object):
    """
    Adds a new field to the document that is being created. Since the index Schema is generated dynamically, if
    the field name is not in the Schema already it is added to it

    Args:
        field_name (str): Name of the new field
        field_data (object): Data to put into the field
    """
    if field_name not in open_dir(self.directory).schema.names():
        self.__writer.add_field(field_name, self.schema_type)
        self.__schema_changed = True
    self.__doc[field_name] = field_data

query(string_query, results_number, mask_list=None, candidate_list=None, classic_similarity=True)

Uses a search index to query the index in order to retrieve specific contents using a query expressed in string form

PARAMETER DESCRIPTION
string_query

query expressed as a string

TYPE: str

results_number

number of results the searcher will return for the query

TYPE: int

mask_list

list of content_ids of items to ignore in the search process

TYPE: list DEFAULT: None

candidate_list

list of content_ids of items to consider in the search process, if it is not None only items in the list will be considered

TYPE: list DEFAULT: None

classic_similarity

if True, classic tf idf is used for scoring, otherwise BM25F is used

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
results

the final results dictionary containing the results found from the search index for the query. The dictionary will be in the following form:

{content_id: {"item": item_dictionary, "score": item_score}, ...}

content_id is the content_id for the corresponding item item_dictionary is the dictionary of the item containing the fields as keys and the contents as values. So it will be in the following form:

{"Plot": "this is the plot", "Genre": "this is the Genre"}

The item_dictionary will not contain the content_id since it is already defined and used as key of the external dictionary items_score is the score given to the item for the query by the index searcher

TYPE: dict

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
def query(self, string_query: str, results_number: int, mask_list: list = None,
          candidate_list: list = None, classic_similarity: bool = True) -> dict:
    """
    Uses a search index to query the index in order to retrieve specific contents using a query expressed in string
    form

    Args:
        string_query: query expressed as a string
        results_number: number of results the searcher will return for the query
        mask_list: list of content_ids of items to ignore in the search process
        candidate_list: list of content_ids of items to consider in the search process,
            if it is not None only items in the list will be considered
        classic_similarity: if True, classic tf idf is used for scoring, otherwise BM25F is used

    Returns:
        results: the final results dictionary containing the results found from the search index for the
            query. The dictionary will be in the following form:

                {content_id: {"item": item_dictionary, "score": item_score}, ...}

            content_id is the content_id for the corresponding item
            item_dictionary is the dictionary of the item containing the fields as keys and the contents as values.
            So it will be in the following form:

                {"Plot": "this is the plot", "Genre": "this is the Genre"}

            The item_dictionary will not contain the content_id since it is already defined and used as key of the
            external dictionary
            items_score is the score given to the item for the query by the index searcher
    """
    ix = open_dir(self.directory)
    with ix.searcher(weighting=TF_IDF if classic_similarity else BM25F) as searcher:
        candidate_query_list = None
        mask_query_list = None

        # the mask list contains the content_id for the items to ignore in the searching process
        # from the mask list a mask query is created and it will be used by the searcher
        if mask_list is not None:
            mask_query_list = []
            for document in mask_list:
                mask_query_list.append(Term("content_id", document))
            mask_query_list = Or(mask_query_list)

        # the candidate list contains the content_id for the items to consider in the searching process
        # from the candidate list a candidate query is created and it will be used by the searcher
        if candidate_list is not None:
            candidate_query_list = []
            for candidate in candidate_list:
                candidate_query_list.append(Term("content_id", candidate))
            candidate_query_list = Or(candidate_query_list)

        schema = ix.schema
        parser = QueryParser("content_id", schema=schema, group=OrGroup)
        # regular expression to match the possible field styles
        # examples: "content_id" or "Genre#2" or "Genre#2#custom_id"
        parser.add_plugin(FieldsPlugin(r'(?P<text>[\w-]+(\#[\w-]+(\#[\w-]+)?)?|[*]):'))
        query = parser.parse(string_query)
        score_docs = \
            searcher.search(query, limit=results_number, filter=candidate_query_list, mask=mask_query_list)

        # creation of the results dictionary, This phase is necessary because the Hit objects returned by the
        # searcher as results need the reader inside the search index in order to return information
        # so it would be impossible to access a field or the score of the item from outside this method
        # because of that this dictionary containing the most important infos is created
        results = {}
        for hit in score_docs:
            hit_dict = dict(hit)
            content_id = hit_dict.pop("content_id")
            results[content_id] = {}
            results[content_id]["item"] = hit_dict
            results[content_id]["score"] = hit.score
        return results

serialize_content()

Serializes the content in the index. If the schema changed, the writer will commit the changes to the schema before adding the document to the index. Once the document is indexed, it can be deleted from the IndexInterface and the document position in the index is returned

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def serialize_content(self) -> int:
    """
    Serializes the content in the index. If the schema changed, the writer will commit the changes to the schema
    before adding the document to the index. Once the document is indexed, it can be deleted from the IndexInterface
    and the document position in the index is returned
    """
    if self.__schema_changed:
        self.__writer.commit()
        self.__writer = open_dir(self.directory).writer()
        self.__schema_changed = False
    self.__writer.add_document(**self.__doc)
    del self.__doc
    self.__doc_index += 1
    return self.__doc_index - 1

stop_writing()

Stops the index writer and commits the operations

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
103
104
105
106
107
108
def stop_writing(self):
    """
    Stops the index writer and commits the operations
    """
    self.__writer.commit()
    del self.__writer

KeywordIndex(directory)

Bases: IndexInterface

This class implements the schema_type method: KeyWord. This is useful for splitting the indexed text in a list of tokens. The Frequency vector is also added so that the tf calculation is possible. Commas is true in case of a "content_id" field data containing white spaces

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
260
261
def __init__(self, directory: str):
    super().__init__(directory)

SearchIndex(directory)

Bases: IndexInterface

This class implements the schema_type method: Text. By using a SimpleAnalyzer for the field, the data is kept as much as the original as possible

Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
280
281
def __init__(self, directory: str):
    super().__init__(directory)