Skip to content

Raw Source Wrappers

CSVFile(file_path, separator=',', has_header=True, encoding='utf-8-sig')

Bases: RawInformationSource

Wrapper for a CSV file. This class is able to read from a CSV file where each entry is separated by the a certain separator (, by default). So by using this class you can also read TSV file for examples, by specifying separator='\t'.

A CSV File most typically has a header: in this case, each entry can be referenced with its column header. In case the CSV File hasn't a header, simply specify has_header=False: in this case, each entry can be referenced with a string representing its positional index (e.g. '0' for entry in the first position, '1' for the entry in the second position, etc.)

You can iterate over the whole content of the raw source with a simple for loop: each row will be returned as a dictionary where keys are strings representing the positional indices, values are the entries

Examples:

Consider the following CSV file with header

movie_id,movie_title,release_year
1,Jumanji,1995
2,Toy Story,1995

>>> file = CSVFile(csv_path)
>>> print(list(file))
[{'movie_id': '1', 'movie_title': 'Jumanji', 'release_year': '1995'},
{'movie_id': '2', 'movie_title': 'Toy Story', 'release_year': '1995'}]

Consider the following TSV file with no header

1   Jumanji 1995
2   Toy Story   1995

>>> file = CSVFile(tsv_path, separator='\t', has_header=False)
>>> print(list(file))
[{'0': '1', '1': 'Jumanji', '2': '1995'},
{'0': '2', '1': 'Toy Story', '2': '1995'}]
PARAMETER DESCRIPTION
file_path

Path of the dat file

TYPE: str

separator

Character which separates each entry. By default is a comma (,), but in case you need to read from a TSV file simply change this parameter to \t

TYPE: str DEFAULT: ','

has_header

Boolean value which specifies if the file has an header or not. Default is True

TYPE: bool DEFAULT: True

encoding

Define the type of encoding of data stored in the source (example: "utf-8")

TYPE: str DEFAULT: 'utf-8-sig'

Source code in clayrs/content_analyzer/raw_information_source.py
232
233
234
235
def __init__(self, file_path: str, separator: str = ',', has_header: bool = True, encoding: str = "utf-8-sig"):
    super().__init__(file_path, encoding)
    self.__has_header = has_header
    self.__separator = separator

representative_name: str property

Method which returns a meaningful name for the raw source.

In this case it's simply the file name + its extension

RETURNS DESCRIPTION
str

The representative name for the raw source

DATFile(file_path, encoding='utf-8')

Bases: RawInformationSource

Wrapper for a DAT file. This class is able to read from a DAT file where each entry is separated by the :: string. Since a DAT file has no header, each entry can be referenced with a string representing its positional index (e.g. '0' for entry in the first position, '1' for the entry in the second position, etc.)

You can iterate over the whole content of the raw source with a simple for loop: each row will be returned as a dictionary where keys are strings representing the positional indices, values are the entries

Examples:

Consider the following DAT file

10::worker::75011
11::without occupation::76112

>>> file = DATFile(dat_path)
>>> print(list(file))
[{'0': '10', '1': 'worker', '2': '75011'},
{'0': '11', '1': 'without occupation', '2': '76112'}]
PARAMETER DESCRIPTION
file_path

path of the dat file

TYPE: str

encoding

define the type of encoding of data stored in the source (example: "utf-8")

TYPE: str DEFAULT: 'utf-8'

Source code in clayrs/content_analyzer/raw_information_source.py
83
84
def __init__(self, file_path: str, encoding: str = "utf-8"):
    super().__init__(file_path, encoding)

representative_name: str property

Method which returns a meaningful name for the raw source.

In this case it's simply the file name + its extension

RETURNS DESCRIPTION
str

The representative name for the raw source

JSONFile(file_path, encoding='utf-8')

Bases: RawInformationSource

Wrapper for a JSON file. This class is able to read from a JSON file where each "row" is a dictionary-like object inside a list

You can iterate over the whole content of the raw source with a simple for loop: each row will be returned as a dictionary

Examples:

Consider the following JSON file

[{"Title":"Jumanji","Year":"1995"},
 {"Title":"Toy Story","Year":"1995"}]

>>> file = JSONFile(json_path)
>>> print(list(file))
[{'Title': 'Jumanji', 'Year': '1995'},
 {'Title': 'Toy Story', 'Year': '1995'}]
PARAMETER DESCRIPTION
file_path

path of the dat file

TYPE: str

encoding

define the type of encoding of data stored in the source (example: "utf-8")

TYPE: str DEFAULT: 'utf-8'

Source code in clayrs/content_analyzer/raw_information_source.py
150
151
def __init__(self, file_path: str, encoding: str = "utf-8"):
    super().__init__(file_path, encoding)

representative_name: str property

Method which returns a meaningful name for the raw source.

In this case it's simply the file name + its extension

RETURNS DESCRIPTION
str

The representative name for the raw source

SQLDatabase(host, username, password, database_name, table_name, encoding='utf-8')

Bases: RawInformationSource

Wrapper for a SQL database.

You can iterate over the whole content of the raw source with a simple for loop: each row will be returned as a dictionary where keys are strings representing the positional indices, values are the entries

Examples:

Consider the following SQL table for the databaase 'movies' in localhost

+----------+-------------+--------------+
| Movie ID | Movie Title | Release Year |
+----------+-------------+--------------+
|        1 | Jumanji     |         1995 |
|        2 | Toy Story   |         1995 |
+----------+-------------+--------------+

>>> file = SQLDatabase(host='127.0.0.1', username='root', password='root',
>>>                    database_name='movies', table_name='movies_table')
>>> print(list(file))
[{'Movie ID': '1', 'Movie Title': 'Jumanji', 'Release Year': '1995'},
{'Movie ID': '2', 'Movie Title': 'Toy Story', 'Release Year': '1995'}]
PARAMETER DESCRIPTION
host

host ip of the sql server

TYPE: str

username

username for the access

TYPE: str

password

password for the access

TYPE: str

database_name

name of database

TYPE: str

table_name

name of the database table where data is stored

TYPE: str

encoding

Define the type of encoding of data stored in the source (example: "utf-8")

TYPE: str DEFAULT: 'utf-8'

Source code in clayrs/content_analyzer/raw_information_source.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def __init__(self, host: str,
             username: str,
             password: str,
             database_name: str,
             table_name: str,
             encoding: str = "utf-8"):
    super().__init__('', encoding)
    self.__host: str = host
    self.__username: str = username
    self.__password: str = password
    self.__database_name: str = database_name
    self.__table_name: str = table_name

    conn = mysql.connector.connect(host=self.__host,
                                   user=self.__username,
                                   password=self.__password,
                                   charset=self.encoding)
    cursor = conn.cursor()
    query = """USE """ + self.__database_name + """;"""
    cursor.execute(query)
    conn.commit()
    self.__conn = conn

representative_name: str property

Method which returns a meaningful name for the raw source.

In this case it's the host name followed by the table name

RETURNS DESCRIPTION
str

The representative name for the raw source