Welcome to PyVertica documentation!

pyvertica is a package which contains the shared logic for connecting and writing to a Vertica database.

Installation

pyvertica can be installing by executing pip install pyvertica.

Note

When using the BaseImporter, do not forget to create the batch history table. An SQL example can be found in the class documentation.

Command-line usage

vertica_batch_import

Tool to import a CSV-like file directly into Vertica.

usage: vertica_batch_import [-h] [--commit]
                            [--partial-commit-after PARTIAL_COMMIT_AFTER]
                            [--log {debug,info,warning,error,critical}]
                            [--truncate-table] [--delimiter DELIMITER]
                            [--enclosed-by ENCLOSED_BY] [--skip SKIP]
                            [--null NULL]
                            [--record-terminator RECORD_TEMINATOR]
                            dsn table_name file_path

Vertica batch importer

positional arguments:
  dsn                   ODBC data source name
  table_name            name of table (including schema, eg: staging.my_table)
  file_path             absolute path to the file to import

optional arguments:
  -h, --help            show this help message and exit
  --commit              commit after import (without it will perform a dry-
                        run)
  --partial-commit-after PARTIAL_COMMIT_AFTER
                        partial commit after num of lines (default: 1000000)
  --log {debug,info,warning,error,critical}
                        loglevel of loghandler (default: info)
  --truncate-table      truncate table before import
  --delimiter DELIMITER
                        delimiter to split columns (default: ;)
  --enclosed-by ENCLOSED_BY
                        the quote character (default: ")
  --skip SKIP           number of lines to skip (default: 0)
  --null NULL           represents a null value (default: empty string)
  --record-terminator RECORD_TEMINATOR
                        specifies the end of a record (default: newline)

vertica_migrate

Tool to migrate data from one to another Vertica cluster.

usage: vertica_migrate [-h] [--commit]
                       [--log-level {debug,info,warning,error,critical}]
                       [--skip-ddls] [--clever-ddls] [--skip-data]
                       [--even-not-empty] [--limit LIMIT] [--truncate]
                       [--source-not-reconnect] [--target-not-reconnect]
                       [--config-path CONFIG_PATH]
                       source target [objects [objects ...]]

Vertica Migrator

positional arguments:
  source                ODBC data source name
  target                ODBC data source name
  objects               List of objects (schemas or table) to migrate

optional arguments:
  -h, --help            show this help message and exit
  --commit              commit DDLS and copy data (without it will perform a
                        dry-run)
  --log-level {debug,info,warning,error,critical}
                        loglevel of loghandler (default: info)
  --skip-ddls           Do not copy the DDLs over.
  --clever-ddls         If when copying a DDL an object with the same name
                        already exists, skip the copy.
  --skip-data           Do not copy the data over.
  --even-not-empty      Do not stop if the target DB is not empty.
  --limit LIMIT         Limit the number of rows to copy over, per table.
  --truncate            Truncate destination tables before copying data over.
  --source-not-reconnect
                        Do not try to avoid load balancer by reconnecting.
  --target-not-reconnect
                        Do not try to avoid load balancer by reconnecting.
  --config-path CONFIG_PATH
                        Absolute path to a config file (useful for storing
                        credentials).

To not expose passwords on the command-line, it is mandatory to pass them as a config file (--config-path). Example:

[vertica_migrate]
target_pwd=targetpassword
source_pwd=sourcepassword
log_level=warning

Any command-line argument accepting a string (eg: --log-level warning) is accepted (eg: log_level=warning). The following extra options are available in the config-file:

target_user
Username of the target Vertica database.
target_pwd
Password of the target Vertica database.
target_host
Hostname of the target Vertica database.
source_user
Username of the source Vertica database.
source_pwd
Password of the source Vertica database.
source_host
Hostname of the target Vertica database.

Usage within Python code

Creating a PYODBC connection to Vertica

pyvertica.connection.connection_details(con)

Given one connection objects returns information about it.

Parameters:con – An instance of pyodbc.Connection.
return:

A dict with the following keys / values:

host
Connected node IP address
user
Connected username
db
Connected database name
pyvertica.connection.get_connection(reconnect=True, **kwargs)

Get pyodbc connection for the given dsn.

Usage example:

from pyvertica.connection import get_connection


connection = get_connection('TestDSN')
cursor = connection.cursor()

The connection will be made in two steps (with the assumption that you are connection via a load-balancer). The first step is connecting to the load-balancer and selecting a random node address. Then it will connect to that specific node and return this connection instance. This is done to avoid that all the data has to pass the load-balancer.

Note

Depending on the given keyword arguments, you need to have a odbc.ini file on your system.

Parameters:
Returns:

Return an instance of pyodbc.Connection.

Writing multiple lines in a batch

class pyvertica.batch.VerticaBatch(table_name, odbc_kwargs={}, truncate_table=False, reconnect=True, analyze_constraints=True, column_list=, []copy_options={}, connection=None)

Object for writing multiple records to Vertica in a batch.

Usage example:

from pyvertica.batch import VerticaBatch

batch = VerticaBatch(
    odbc_kwargs={'dsn': 'VerticaDWH'},
    table_name='schema.my_table',
    truncate=True,
    column_list=['column_1', 'column_2'],
    copy_options={
        'DELIMITER': ',',
    }
)

row_list = [
    ['row_1_val_1', 'row_1_val_2'],
    ['row_2_val_1', 'row_2_val_2'],
    ...
]

for column_data_list in row_list:
    batch.insert_list(column_data_list)

error_bool, error_file_obj = batch.get_errors()

if error_bool:
    print error_file_obj.read()

batch.commit()

Note

It is also possible to call commit() multiple times (for example after every 50000 records). Please note that after the first insert and after calling commit(), the output of get_errors() will reflect the new serie of inserts and thus not contain the “old” inserts.

Note

Creating a new batch object will not create a lock on the target table. This will happen only after first insert.

Note

Although the batch object is automagically reusable, after a commit() the locks are realeased up to next insert.

Parameters:
  • table_name – A str representing the table name (including the schema) to write to. Example: 'staging.my_table'.
  • odbc_kwargs

    A dict containing the ODBC connection keyword arguments. E.g.:

    {
        'dsn': 'TestDSN',
    }
    
  • truncate_table – A bool indicating if the table needs truncating before first insert. Default: False. Optional.
  • reconnect – A bool passed to the connection object to decide if pyvertica should directly reconnect to a random node to bypass a load balancer.
  • analyze_constraints – A bool indicating if a ANALYZE_CONSTRAINTS startement should be executed when getting errors. Default: True. Optional.
  • column_list – A list containing the columns that will be written. Optional.
  • copy_options – A dict containing the keys to override. For a list of existing keys and their defaults, see copy_options_dict. Optional.
  • connection – A pyodbc.Connection to use instead of opening a new connection. If this parameter is supplied, odbc_kwargs may not be supplied. Default: None. Optional.
commit()

Commit the current transaction.

copy_options_dict = {'REJECTEDFILE': True, 'REJECTMAX': 0, 'DELIMITER': ';', 'NO COMMIT': True, 'ENCLOSED BY': '"', 'SKIP': 0, 'NULL': '', 'RECORD TERMINATOR': '\x01'}

Default copy options for SQL query.

Note

By default REJECTEDFILE is set to __debug__, which is True, unless you’ve set the PYTHONOPTIMIZE environment variable.

get_batch_count()

Return number (int) of inserted items since last commit.

Warning

When using insert_raw() this value represents the number of raw str objects inserted, not the number of lines!

Returns:An int.
get_cursor()

Return a cursor to the database.

This is useful when you want to add extra data within the same transaction of the batch import.

Returns:Instance of pyodbc.Cursor.
get_errors()

Get errors that were raised since the last commit.

This will check constraint errors and rejected data by the database. Please note that this will remove the rejected data file after calling this method. Therfore it is not possible to call this method more than once per batch!

Note

Since this is checking the contraints as well, it is assumed that all contrains were met before starting the batch. Otherwise, these errors will show up within this method.

Returns:A tuple with as first item a int representing the number of errors. The second item is a file-like object containing the error-data in plain text. Since this is an instance of tempfile.TemporaryFile, it will be removed automatically.

Note

The file-like object can be empty, when REJECTEDFILE is set to False.

get_total_count()

Return total number (int) of inserted items.

Warning

When using insert_raw() this value represents the number of raw str objects inserted, not the number of lines!

Returns:An int.
insert_line(*args, **kwargs)

Insert a str containing all the values.

This is useful when inserting lines directly from a CSV file for example.

Note

When you have a loghandler with DEBUG level, every query will be logged. For performance reason, this log statement is only executed when __debug__ equals True (which is the default case). For a better performance, you should invoke the Python interpreter with the -O argument or set the environment variable PYTHONOPTIMIZE to something.

Example:

batch.insert_line('"value_1";"value_2"')
Parameters:line_str – A str representing the line to insert. Make sure the str is formatted according copy_options_dict. Example: '"value1";"value2";"value3"'.
insert_list(value_list)

Insert a list of values (instead of a str representing a line).

Example:

batch.insert_list(['value_1', 'value_2'])
Parameters:value_list – A list. Each item should represent a column value.
insert_raw(*args, **kwargs)

Insert a raw str.

A raw str does not have to be a complete row, but can be a part of a row or even multiple rows. This is useful when you have a file that is already in a format readable by Vertica.

rollback()

Rollback the current transaction.

Base importer class

class pyvertica.importer.BaseImporter(reader_obj, schema_name, batch_source_path, odbc_kwargs={}, **kwargs)

Base class for importing data into Vertica.

Note, before using this base importer, make sure you have created the history table for batch imports:

CREATE TABLE meta.batch_history (
    batch_source_name VARCHAR(255),
    batch_source_type_name VARCHAR(255),
    batch_source_path VARCHAR(255),
    batch_import_timestamp TIMESTAMP
)

Usage example:

class AdGroupPerformanceReportImporter(BaseImporter):
    table_name = 'adwords_ad_group_performance'
    batch_source_name = 'adwords_api'
    batch_source_type_name = 'ad_group_performance_report'

    mapping_list = (
        {
            'field_name': 'AccountCurrencyCode',
            'db_field_name': 'account_currency_code',
            'db_data_type': 'VARCHAR(10)',
        },
        {
            'field_name': 'AccountDescriptiveName',
            'db_field_name': 'account_descriptive_name',
            'db_data_type': 'VARCHAR(512)',
        },
        {
            'field_name': 'AccountTimeZoneId',
            'db_field_name': 'account_time_zone_id',
            'db_data_type': 'VARCHAR(100)',
        },
        ...
    )

iterable_object = [
    {
        'AccountCurrencyCode': 'EUR',
        'AccountDescriptiveName': 'Test account description',
        'AccountTimeZoneId': '(GMT+01:00) Amsterdam,'
    },
    ...
]

report_importer = AdGroupPerformanceReportImporter(
    iterable_object,
    dsn='VerticaTST',
    schema_name='test',
    batch_source_path='ADGROUP_PERFORMANCE_REPORT.1234.20120521',
)
report_importer.start_import()

In the example above, we are importing a list of dicts. More likely, this iterable_object would be a reader class for your data-source, which is iterable and would return a dict with the expected fields`.

Parameters:
  • reader_obj

    An object that is iterable and returns for every line the data as a dict.

    Note

    Passing a list of dict objects will work as well.

  • odbc_kwargs

    A dict containing the ODBC connection keyword arguments. E.g.:

    {
        'dsn': 'TestDSN',
    }
    
  • schema_name – Name of the DB schema to use.
  • batch_source_path – A str describing the path to the source. This can be a file path when importing from a file, or an identifier when the source is an API. This should be unique for every import!
  • kwargs – Optional extra keyword arguments, will be stored as self._kwargs.
batch_history_table = 'meta.batch_history'

Name of the database table containing the batch history (including the schema name) (str). The structure of this table is:

batch_source_name VARCHAR(255)
batch_source_type_name VARCHAR(255)
batch_source_path VARCHAR(255)
batch_import_timestamp TIMESTAMP
batch_source_name = ''

The name of the source which the data is retrieved from. E.g.: for AdWords, this this could be something like 'adwords_api'.

batch_source_type_name = ''

The type of data that is imported from the source. E.g.: for the AdWords API, this could be something like ADGROUP_PERFORMANCE_REPORT.

extra_fields = ({'db_data_type': 'VARCHAR(255)', 'field_name': 'batch_source_name'}, {'db_data_type': 'VARCHAR(255)', 'field_name': 'batch_source_path'}, {'db_data_type': 'TIMESTAMP', 'field_name': 'batch_import_timestamp'})

A tuple of dict objects to prepend fields to the data.

This list enables the possibility to add extra data to the database, for example data related to the import (source name, source identifier, import timestamp, ...).

Each dict must contain the following keys:

field_name
A str representing the DB field name.
db_data_type
A str representing the field type in the database, eg: 'varchar(10)'.

Then, for every field, you should define a method within your class which is is named following this template: get_extra_{field_name}_data. This method will be called for every imported record with a dict containing the row data.

Warning

Make sure there is no collision between these fields and the fields defined in mapping_list.

classmethod get_batch_source_path_exists(batch_source_path, odbc_kwargs={})

Check if the batch source-path exists in the database.

Parameters:
Returns:

True if it already exists, else False.

get_extra_batch_import_timestamp_data(row_data_dict)

Return batch import timestamp.

Parameters:row_data_dict – A dict containing the row-data.
Returns:A str in ISO 8601 format, with a space a separator.
get_extra_batch_source_name_data(row_data_dict)

Return batch source name.

Parameters:row_data_dict – A dict containing the row-data.
Returns:The value set in batch_source_name.
get_extra_batch_source_path_data(row_data_dict)

Return batch source path.

Parameters:row_data_dict – A dict containing the row-data.
Returns:A str containing the batch-source path (this is given as the batch_source_path argument on constructing this class).
classmethod get_last_imported_batch_source_path(odbc_kwargs)

Return the last imported batch source-path.

Parameters:odbc_kwargs

A dict containing the ODBC connection keyword arguments. E.g.:

{
    'dsn': 'TestDSN',
}
Returns:A str representing the last imported batch source-path.
get_sql_create_table_statement()

Return SQL statement for creating the DB table.

This will first use the fields specified in extra_fields, followed by the fields in mapping_list.

Returns:A str representing the SQL statement.
mapping_list = ()

A tuple of dict objects to map record columns to db columns.

The fields specified in this list might be a sub-set of the available fields in the record. Only specify the fields you want to store in the DB.

Each dict must contain the following keys:

field_name
A str representing the field name as it is in the dict containing the data (as returned by the reader_obj).
db_data_type
A str representing the field type in the database, eg: 'varchar(10)'.

Optionally, each dict can contain the following keys:

db_field_name
A str representing the field name within the database. This only needs to be set when it does not match with the source field name.
start_import()

Start the import.

This will import all the data from the reader_obj argument (given when constructing BaseImporter). In case there are no errors, it will commit the import at the end.

Raises:BatchImportError when there are errors during the import. Errors are logged to the logger object.
Raises:AlreadyImportedError when there already an import exists with the same batch-source path. Before starting your import, you can test this by calling get_batch_source_path_exists().
table_name = ''

Name of the database table (excluding the schema) (str).

Exceptions

exception pyvertica.importer.BatchImportError

Exception is raised when the batch import is returning errors.

exception pyvertica.importer.AlreadyImportedError

Exception is raised when trying to import the same source twice.

exception pyvertica.migrate.VerticaMigratorError

Error specific to the pyvertica.migrate module.