Python API Reference

data_diff.connect(db_conf: Union[str, dict], thread_count: Optional[int] = 1) → Database

Connect to a database using the given database configuration.

Configuration can be given either as a URI string, or as a dict of {option: value}.

thread_count determines the max number of worker threads per database, if relevant. None means no limit.

Parameters

db_conf (str | dict) – The configuration for the database to connect. URI or dict.
thread_count (int, optional) – Size of the threadpool. Ignored by cloud databases. (default: 1)

Note: For non-cloud databases, a low thread-pool size may be a performance bottleneck.

Supported drivers: - postgresql - mysql - oracle - snowflake - bigquery - redshift - presto - databricks - trino - clickhouse - vertica

class data_diff.TableDiffer(bisection_factor: int = 32, bisection_threshold: Number = 16384, threaded: bool = True, max_threadpool_size: (NoneType+int) = 1, debug: bool = False, stats: dict[(Any*Any)] = <factory>)

Finds the diff between two SQL tables

The algorithm uses hashing to quickly check if the tables are different, and then applies a bisection search recursively to find the differences efficiently.

Works best for comparing tables that are mostly the same, with minor discrepencies.

Parameters

bisection_factor (int) – Into how many segments to bisect per iteration.
bisection_threshold (Number) – When should we stop bisecting and compare locally (in row count).
threaded (bool) – Enable/disable threaded diffing. Needed to take advantage of database threads.
max_threadpool_size (int) – Maximum size of each threadpool. None means auto. Only relevant when threaded is True. There may be many pools, so number of actual threads can be a lot higher.

diff_tables(table1: TableSegment, table2: TableSegment) → Iterator[Tuple[str, tuple]]

Diff the given tables.

Parameters

table1 (TableSegment) – The “before” table to compare. Or: source table
table2 (TableSegment) – The “after” table to compare. Or: target table

Returns

An iterator that yield pair-tuples, representing the diff. Items can be either (‘-’, columns) for items in table1 but not in table2 (‘+’, columns) for items in table2 but not in table1 Where columns is a tuple of values for the involved columns, i.e. (id, …extra)

__init__(bisection_factor: int = 32, bisection_threshold: Number = 16384, threaded: bool = True, max_threadpool_size: (NoneType+int) = 1, debug: bool = False, stats: dict[(Any*Any)] = <factory>) → None

class data_diff.TableSegment(database: Database = <object object>, table_path: tuple[str] = <object object>, key_column: str = <object object>, update_column: (str+NoneType) = None, extra_columns: tuple[str] = (), min_key: (NoneType+(bytes+ArithUUID+ArithAlphanumeric+str+int)) = None, max_key: (NoneType+(bytes+ArithUUID+ArithAlphanumeric+str+int)) = None, min_update: (datetime+NoneType) = None, max_update: (datetime+NoneType) = None, where: (str+NoneType) = None, case_sensitive: bool = True, _schema: (NoneType+CaseAwareMapping) = None)

Signifies a segment of rows (and selected columns) within a table

Parameters

database (Database) – Database instance. See connect()
table_path (DbPath) – Path to table in form of a tuple. e.g. (‘my_dataset’, ‘table_name’)
key_column (str) – Name of the key column, which uniquely identifies each row (usually id)
update_column (str, optional) – Name of updated column, which signals that rows changed (usually updated_at or last_update)
extra_columns (Tuple[str, ...], optional) – Extra columns to compare
min_key (DbKey, optional) – Lowest key_column value, used to restrict the segment
max_key (DbKey, optional) – Highest key_column value, used to restrict the segment
min_update (DbTime, optional) – Lowest update_column value, used to restrict the segment
max_update (DbTime, optional) – Highest update_column value, used to restrict the segment
where (str, optional) – An additional ‘where’ expression to restrict the search space.
case_sensitive (bool) – If false, the case of column names will adjust according to the schema. Default is true.

get_values() → list: Download all the relevant values of the segment from the database

choose_checkpoints(count: int) → List[Union[int, str, bytes, ArithUUID, ArithAlphanumeric]]: Suggests a bunch of evenly-spaced checkpoints to split by (not including start, end)

segment_by_checkpoints(checkpoints: List[Union[int, str, bytes, ArithUUID, ArithAlphanumeric]]) → List[TableSegment]: Split the current TableSegment to a bunch of smaller ones, separated by the given checkpoints

new(**kwargs) → TableSegment: Using new() creates a copy of the instance using ‘replace()’

count() → Tuple[int, int]: Count how many rows are in the segment, in one pass.

count_and_checksum() → Tuple[int, int]: Count and checksum the rows in the segment, in one pass.

__init__(database: Database = <object object>, table_path: tuple[str] = <object object>, key_column: str = <object object>, update_column: (str+NoneType) = None, extra_columns: tuple[str] = (), min_key: (NoneType+(bytes+ArithUUID+ArithAlphanumeric+str+int)) = None, max_key: (NoneType+(bytes+ArithUUID+ArithAlphanumeric+str+int)) = None, min_update: (datetime+NoneType) = None, max_update: (datetime+NoneType) = None, where: (str+NoneType) = None, case_sensitive: bool = True, _schema: (NoneType+CaseAwareMapping) = None) → None

class data_diff.databases.database_types.AbstractDatabase

abstract quote(s: str): Quote SQL name (implementation specific)

abstract to_string(s: str) → str: Provide SQL for casting a column to string

abstract concat(l: List[str]) → str: Provide SQL for concatenating a bunch of column into a string

abstract is_distinct_from(a: str, b: str) → str: Provide SQL for a comparison where NULL = NULL is true

abstract timestamp_value(t: datetime) → str: Provide SQL for the given timestamp value

abstract md5_to_int(s: str) → str: Provide SQL for computing md5 and returning an int

abstract offset_limit(offset: Optional[int] = None, limit: Optional[int] = None): Provide SQL fragment for limit and offset inside a select

abstract select_table_schema(path: Tuple[str, ...]) → str: Provide SQL for selecting the table schema as (name, type, date_prec, num_prec)

abstract query_table_schema(path: Tuple[str, ...]) → Dict[str, tuple]: Query the table for its schema for table in ‘path’, and return {column: tuple} where the tuple is (table_name, col_name, type_repr, datetime_precision?, numeric_precision?, numeric_scale?)

abstract parse_table_name(name: str) → Tuple[str, ...]: Parse the given table name into a DbPath

abstract close(): Close connection(s) to the database instance. Querying will stop functioning.

abstract normalize_timestamp(value: str, coltype: TemporalType) → str

Creates an SQL expression, that converts ‘value’ to a normalized timestamp.

The returned expression must accept any SQL datetime/timestamp, and return a string.

Date format: YYYY-MM-DD HH:mm:SS.FFFFFF

Precision of dates should be rounded up/down according to coltype.rounds

abstract normalize_number(value: str, coltype: FractionalType) → str

Creates an SQL expression, that converts ‘value’ to a normalized number.

The returned expression must accept any SQL int/numeric/float, and return a string.

Floats/Decimals are expected in the format “I.P”

Where I is the integer part of the number (as many digits as necessary), and must be at least one digit (0). P is the fractional digits, the amount of which is specified with coltype.precision. Trailing zeroes may be necessary. If P is 0, the dot is omitted.

Note: We use ‘precision’ differently than most databases. For decimals, it’s the same as numeric_scale, and for floats, who use binary precision, it can be calculated as log10(2**numeric_precision).

abstract normalize_uuid(value: str, coltype: ColType_UUID) → str

Creates an SQL expression, that converts ‘value’ to a normalized uuid.

i.e. just makes sure there is no trailing whitespace.

normalize_value_by_type(value: str, coltype: ColType) → str

Creates an SQL expression, that converts ‘value’ to a normalized representation.

The returned expression must accept any SQL value, and return a string.

The default implementation dispatches to a method according to coltype:

TemporalType    -> normalize_timestamp()
FractionalType  -> normalize_number()
*else*          -> to_string()

(`Integer` falls in the *else* category)

data_diff.DbKey

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

alias of Union[int, str, bytes, ArithUUID, ArithAlphanumeric]

data_diff.DbTime = <class 'datetime.datetime'>

datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])

The year, month and day arguments are required. tzinfo may be None, or an instance of a tzinfo subclass. The remaining arguments may be ints.

data_diff.DbPath

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

alias of Tuple[str, …]