Introductionο
Data-diff is a command-line tool and Python library to efficiently diff rows across two different databases.
β Verifies across many different databases (e.g. PostgreSQL -> Snowflake) !
π Outputs diff of rows in detail
π¨ Simple CLI/API to create monitoring and alerts
π₯ Verify 25M+ rows in <10s, and 1B+ rows in ~5min.
βΎοΈ Works for tables with 10s of billions of rows
For more information, See our README
How to installο
Requires Python 3.7+ with pip.
pip install data-diff
or when you need extras like mysql and postgresql:
pip install "data-diff[mysql,postgresql]"
How to use from Pythonο
# Optional: Set logging to display the progress of the diff
import logging
logging.basicConfig(level=logging.INFO)
from data_diff import connect_to_table, diff_tables
table1 = connect_to_table("postgresql:///", "table_name", "id")
table2 = connect_to_table("mysql:///", "table_name", "id")
for sign, columns in diff_tables(table1, table2):
print(sign, columns)
# Example output:
+ ('4775622148347', '2022-06-05 16:57:32.000000')
- ('4775622312187', '2022-06-05 16:57:32.000000')
- ('4777375432955', '2022-06-07 16:57:36.000000')
Resourcesο
Source code (git): https://github.com/datafold/data-diff
- API Reference
- Tutorials
TODO