Data Pipeline 2 - Data Transformation Toolkit for Java Released


News: Data Pipeline 2 - Data Transformation Toolkit for Java Released

  1. Data Pipeline makes it easy to convert, process, and transform data in Java applications. The new release includes... * Open source (GPL) and commercial licenses * New data readers, writers, and transformers * Java 5 support Features * Readers: CSV, fixed-width, Excel, database, weblogs, custom * Writers: CSV, fixed-width, Excel, database, PDF, Word, XML, custom * Operations: validate, filter, sort, lookup, remove duplicates, convert, throttle, calculate, custom, and more * Run-time expression language for filters, validation, and calculated fields A typical scenario might be to: 1. read a CSV file 2. remove duplicate records 3. add calculated field 4. remove unused columns 5. save to database -- code snippet --------------------------------------------- DataReader reader = new CSVReader(new File("credit-balance.csv")) .setFieldNamesInFirstRow(true); // Use only the "Rating" and "CreditLimit" fields in duplicate test reader = new RemoveDuplicatesReader(reader, new FieldList("Rating", "CreditLimit")); // Add AvailableCredit field, remove "CreditLimit", "Balance" fields reader = new TransformingReader(reader) .add(new SetCalculatedField("AvailableCredit", "parseDouble(CreditLimit) - parseDouble(Balance)")) .add(new ExcludeFields("CreditLimit", "Balance")); DataWriter writer = new JdbcWriter(getJdbcConnection(), "dp_credit_balance") .setAutoCloseConnection(true); JobTemplate.DEFAULT.transfer(reader, writer); -- code snippet --------------------------------------------- We look forward to hearing your feedback. Downloads: Examples: Getting Started: Forums: Dele Taylor North Concepts Inc.

    Threaded Messages (2)

  2. Smooks[ Go to top ]

    how does it compare with smooks? Memory usage, stream manipulation, performance, supported data types..?
  3. Smooks[ Go to top ]

    Hi Eric, Thanks for the inquiry. We're currently working on a formal Smooks comparison (including performance and memory usage), we'll post an article here when it's ready. Let me address a couple parts of your questions in the mean time... The data types Data Pipeline currently supports are blob, boolean, byte, char, date, datetime, double, float, int, long, short, string, time, & undefined (any other Java object). We list them in the FieldType enum. Here's a list of some of the stream manipulation we support: * Sorting (single or multi-field with collation) * External (disk-based) sorting for large data sets * Filters (programmatic or using our run-time expression language) * Data validation (programmatic or using our run-time expression language) * Calculated fields (programmatic or using our run-time expression language) * Field copying * Field renaming * Duplicate records * Remove duplicate records (using selected fields or entire record) * Field removal (black list/exclusion or white list/inclusion) * Field selection and arrangement * Field conversion and formatting (see BasicFieldTransformer for a better idea) * Field aggregation (minimum, maximum, average, sum, & count) * Lookups (from database, another data reader, or custom) * Throttling and metering I hope this gives you a better idea about Data Pipeline. Cheers, Dele