Data-diff is a command-line tool and Python library to efficiently diff rows across two different databases.
⇄ Verifies across many different databases (e.g. PostgreSQL -> Snowflake) !
🔍 Outputs diff of rows in detail
🚨 Simple CLI/API to create monitoring and alerts
🔥 Verify 25M+ rows in <10s, and 1B+ rows in ~5min.
♾️ Works for tables with 10s of billions of rows
For more information, See our README
How to install
Requires Python 3.7+ with pip.
pip install data-diff
or when you need extras like mysql and postgresql:
pip install "data-diff[mysql,postgresql]"
How to use from Python
# Optional: Set logging to display the progress of the diff import logging logging.basicConfig(level=logging.INFO) from data_diff import connect_to_table, diff_tables table1 = connect_to_table("postgresql:///", "table_name", "id") table2 = connect_to_table("mysql:///", "table_name", "id") for sign, columns in diff_tables(table1, table2): print(sign, columns) # Example output: + ('4775622148347', '2022-06-05 16:57:32.000000') - ('4775622312187', '2022-06-05 16:57:32.000000') - ('4777375432955', '2022-06-07 16:57:36.000000')
Source code (git): https://github.com/datafold/data-diff
- API Reference