Home

News: Reading L10n Comma-Separated Values

  1. Reading L10n Comma-Separated Values (6 messages)

    Although you can think that Comma-Separated Values (CSV) files are simple files where each value is separated by commas, it is far from reality. The most important part of CSV files are delimiters, and one can think that the delimiter, as type name suggests, is comma. But this assumption is not always true and CSV is often applied to files using other delimiters. Typical example is found in countries where comma (,) is used as decimal separator, because you would have no way to discern between decimal number or separator, semicolon (;) is used.

    If your application is distributed across different countries and your application should support  Comma-Separated Values files, then you should take care about this important fact.

    Let's see how can we resolve this problem so we can read CSV files depending on country. 

    For reading CSV files we are going to use openCSV library. This API contains one class for readingCSV files (CsvReader) and another for writing (CsvWriter). By default it uses comma as column delimiter, but you can also inform which delimiter character must be used.

    Read Full Post

    Threaded Messages (6)

  2. Oh please...[ Go to top ]

    First off, CSV isn't a lot of things.

    It's not standardized, and it's not simple. (Actually it is, but not the way people think.)

    It's also not new.

    However, there is a de facto implementation of CSV, one that actually "gets it right".

    And that implementation is Excel.

    The VAST majority of small CSV files are coming from Excel because folks don't want to read Excel spreadsheets, and CSV is easier.

    It's certainly easier than Excel, but (as highlighted by this article) it has pitfalls as well.

    Well, the truth is, that it doesn't really have pitfalls. Naive implementaitons do.

    There's nothing wrong with having commas in a CSV values. That's why CSV has quoting. There's nothing wrong with having quotes in CSV values. That's why CSV has escaping. With quotes and escaping, you can put pretty much any thing in a CSV file. Including newlines.

    Which brings us once again to naive implementations. You can't assume that each row is delimted by a newline. 

    Rows in CSV ARE delimted by newlines, but not every newline is necessarily a row delimiter. They could be within a quoted value.

    So, if your CSV reader is little more than readline followed by split(","), you're in for some pain.

    But OpenCSV is actually pretty good and handles this madness for you.

     

  3. opencsv[ Go to top ]

    But OpenCSV is actually pretty good and handles this madness for you.

    That's why he uses opencsv:

    "For reading CSV files we are going to use openCSV library."

     

  4. opencsv[ Go to top ]

    Also, apache commons-lang has a pretty good utility for parsing delimited text.  Notice I don't say "comma delimited text".  It has lots of options for delimiters, quote characters, ignoring empty values, trimming leading and trailing white space, etc...  The one thing it doesn't do is the line delimiters.  It accepts a single record at a time, so you'd have to do that on your own.

     

    // Prototyped instance for general CSV parsing

    StrTokenizer tokenizer = StrTokenizer.getCSVInstance();

    tokenizer.reset(input);

    // There's also iterator style methods as well.

    String tokens[] = tokenizer.getTokenArray();

    // Reuse the same tokenizer with additional input.

    tokenizer.reset(input);

    tokens = tokenizer.getTokenArray();

     

     

  5. In the Nordics, CSV is separated by semicolon.

    If you're producing CSV solely for Excel you can put the following text on the first line to specify the separator as semicolon: sep=;

    I believel sep=, works as well.

     

  6. Superb CSV library[ Go to top ]

    I find SuperCSV to be quite superb when it comes to reading CSV :-)

  7. Give FlatPack a go..[ Go to top ]

    FlatPack is an Apache 2 open source project that handles CSV or fixed-width parsing.

    By CSV, it could be anything like ; and you can also specify how to 'wrap' something that contains a ',' for instance Excel would use " if the value contains a comma.

    http://flatpack.sf.net

    usage is trivial and by column name (if you have them or by a config file)

     

    Enjoy

    Benoit

     

    //Obtain the proper parser for your needs

    Parser parser = DefaultParserFactory.getInstance().newDelimitedParser(

            new FileReader("DataFile.txt"),  //txt file to parse

            ',', //delimiter

            '"'); //text qualifier

     

    //obtain DataSet

    DataSet ds = parser.parse();

     

    while (ds.next()){ //loop through file

        ds.getString("mycolumnName");

    }