EBay has open sourced the DQSolution which is an Data Quality solution for distributed data systems at any scale in both streaming or batch data context. You can fork it from https://ebay.github.io/DQSolution/. DQSolution creates a unified process to define and construct data quality measurement pipeline across multiple data systems to provide:
- Accuracy Measurement - Accuracy of a data asset compared to a verifiable source
- Data Profiling - Statistical analysis and assessment of data values within a data asset for consistency, uniqueness and logic
- Anomaly detection - Pre-built algorithm functions for the identification of events which do not conform to an expected pattern in a data asset
- Visualization - Dashboards that can report the state of data quality
- Real Time - The data quality checks can be executed in real-time to detect issues faster
- Extensible - The solution can work with multiple data systems
- Scalable - The solution is designed to work on large volumes of data. It currently runs on ~1.2 PB of data
- Self-Service - The solution provides a simple user interface to define new data assets and rules. It also allows users to visualize the data quality dashboards and personalize their view of the dashboards.
Github:https://ebay.github.io/DQSolution/ Please fork!Thanks!
Contact us: lzhixing at ebay dot com