Index interface
IndexInterface(directory)
Bases: TextInterface
Abstract class that takes care of serializing and deserializing text in an indexed structure using the Whoosh library
PARAMETER | DESCRIPTION |
---|---|
directory |
Path of the directory where the content will be serialized
TYPE:
|
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
26 27 28 29 30 31 |
|
schema_type
property
abstractmethod
Whoosh uses a Schema that defines, for each field of the content, how to store the data. In the case of this project, every field will have the same structure and will share the same field type. This method returns said field type.
get_field(field_name, content_id)
Uses a search index to retrieve the content corresponding to the content_id (if it is a string) or in the corresponding position (if it is an integer), and returns the data in the field corresponding to the field_name
PARAMETER | DESCRIPTION |
---|---|
field_name |
name of the field from which the data will be retrieved
TYPE:
|
content_id |
either the position or Id of the content that contains the specified field |
RETURNS | DESCRIPTION |
---|---|
str
|
Data contained in the field of the content |
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
get_tf_idf(field_name, content_id)
Calculates the tf-idf for the words contained in the field of the content whose id is content_id (if it is a string) or in the given position (if it is an integer).
The tf-idf computation formula is:
PARAMETER | DESCRIPTION |
---|---|
field_name |
Name of the field containing the words for which calculate the tf-idf
TYPE:
|
content_id |
either the position or Id of the content that contains the specified field |
RETURNS | DESCRIPTION |
---|---|
words_bag
|
Dictionary whose keys are the words contained in the field, and the corresponding values are the tf-idf values |
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
init_writing(delete_old=False)
Creates the index locally (in the directory passed in the constructor) and initializes the index writer. If an index already exists in the directory, what happens depend on the attribute delete_old passed as argument
PARAMETER | DESCRIPTION |
---|---|
delete_old |
if True, the index that was in the same directory is destroyed and replaced; if False, the index is simply opened
TYPE:
|
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
new_content()
The new content is a document that will be indexed. In this case the document is a dictionary with the name of the field as key and the data inside the field as value
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
67 68 69 70 71 72 |
|
new_field(field_name, field_data)
Adds a new field to the document that is being created. Since the index Schema is generated dynamically, if the field name is not in the Schema already it is added to it
PARAMETER | DESCRIPTION |
---|---|
field_name |
Name of the new field
TYPE:
|
field_data |
Data to put into the field
TYPE:
|
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
query(string_query, results_number, mask_list=None, candidate_list=None, classic_similarity=True)
Uses a search index to query the index in order to retrieve specific contents using a query expressed in string form
PARAMETER | DESCRIPTION |
---|---|
string_query |
query expressed as a string
TYPE:
|
results_number |
number of results the searcher will return for the query
TYPE:
|
mask_list |
list of content_ids of items to ignore in the search process
TYPE:
|
candidate_list |
list of content_ids of items to consider in the search process, if it is not None only items in the list will be considered
TYPE:
|
classic_similarity |
if True, classic tf idf is used for scoring, otherwise BM25F is used
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
results
|
the final results dictionary containing the results found from the search index for the query. The dictionary will be in the following form:
content_id is the content_id for the corresponding item item_dictionary is the dictionary of the item containing the fields as keys and the contents as values. So it will be in the following form:
The item_dictionary will not contain the content_id since it is already defined and used as key of the external dictionary items_score is the score given to the item for the query by the index searcher
TYPE:
|
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
serialize_content()
Serializes the content in the index. If the schema changed, the writer will commit the changes to the schema before adding the document to the index. Once the document is indexed, it can be deleted from the IndexInterface and the document position in the index is returned
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
|
stop_writing()
Stops the index writer and commits the operations
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
103 104 105 106 107 108 |
|
KeywordIndex(directory)
Bases: IndexInterface
This class implements the schema_type method: KeyWord. This is useful for splitting the indexed text in a list of tokens. The Frequency vector is also added so that the tf calculation is possible. Commas is true in case of a "content_id" field data containing white spaces
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
260 261 |
|
SearchIndex(directory)
Bases: IndexInterface
This class implements the schema_type method: Text. By using a SimpleAnalyzer for the field, the data is kept as much as the original as possible
Source code in clayrs/content_analyzer/memory_interfaces/text_interface.py
280 281 |
|