DiffKit is an application, and a framework, for comparing two tables of data, field-by-field. The tables can come from any of a number of sources, such as an RDBMS or CSV file, and DiffKit is able to mix different kinds of sources in the same diff operation. DiffKit is like the Unix diffutility, but for tables instead of lines of text.

DiffKit is able to report the diffs at both the row and field level, and allows the user to configure the comparison (what to compare, how to compare it, what to ignore). DiffKit is highly customizable with respect to the sources of tabular data, the details of the comparison, and the characteristics of the output (diff report).

DiffKit is Free Open Source Software (FOSS), licensed under the Apache License, Version 2.0. DiffKit was designed specifically with Enterprise Diff’ng in mind. It supports very high data volume operations (tens of millions of rows). It is a command line utility, driven by text configuration files, and as such is very well suited to scripting, automated operation, and embedding into Unix workflows.

The principal use of DiffKit is for regression testing of processes that create or manipulate persistent table data:

"Relational database management systems (RDBMSs) often persist mission-critical data which is updated by many applications and potentially thousands if not millions of end users. Furthermore, they implement important functionality in the form of database methods (stored procedures, stored functions, and/or triggers) and database objects (e.g. Java or C# instances). The best way to ensure the continuing quality of these assets, at least from a technical point of view, you should have a full regression test suite which you can run on a regular basis."

from Database Testing: How to Regression Test a Relational Database


DiffKit is also well suited for comparing SQL table structures, not just SQL table data. It is not uncommon in an enterprise development shop to have many different copies of, what is presumably, the same database. In this cases, it is critically important to either keep the DDL in synch across all of these copies, or to at least understand how they differ. DiffKit is able to give you perfect visibility into the differences between the definition of database objects.



  • High performance, for large datasets (+10MM rows).

  • Very low memory overhead, even on very large datasets.

  • High quality-- comprehensive embedded regression test suite for the application/framework.

  • Java run everywhere (tm) — Linux, Solaris, OS X, Windows, etc.

  • Cross database-- Oracle, MySQL, DB2, and any JDBC datasource.

  • Command-line driven; no GUI needed; can run in headless environments.

  • XML configuration file driven.

  • Free Open Source Software.

  • Apache License, Version 2.0.

  • Clean Object Oriented Design make extension easy.

  • Easily embeddable as a Java library (jar).