Discussions

News: Commons IO 1.1 released

  1. Commons IO 1.1 released (26 messages)

    The Commons IO team is pleased to announce the release of commons-io-1.1.

    Commons IO is a library of utility, file filter, endian and stream classes that aim to make working with IO much more pleasant. It has no dependencies. Many of these classes probably should be in the JDK itself.

    This release fixes all open bugs, and adds various enhancements, including:
    - FilenameUtils - A static utility class for working with filenames without File objects
    - FileSystemUtils - A static utility class that allows you to get the free space on a drive
    - IOUtils/FileUtils - read and write files line by line into a List
    - WildcardFilter - A new filter that can match using wildcard file names

    This release is binary and source compatible with 1.0 according to our tests. There are some minor semantic changes caused by bug fixes which should not affect the vast majority of 1.0 users - please check the release notes for full details. To simplify the API, there has also been a deprecation - please check the release notes. We recommend all users of commons-io-1.0 upgrade to 1.1 to pickup the numerous bugs fixes.

    Commons IO Website:
    http://jakarta.apache.org/commons/io/

    Release notes:
    http://jakarta.apache.org/commons/io/upgradeto1_1.html

    Download:
    http://jakarta.apache.org/site/downloads/downloads_commons-io.cgi

    Enjoy!
    The Commons-IO Team

    Threaded Messages (26)

  2. Like the content equals method that compares two files, character by freakin character, incurring a synchonization lock every time : http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/io/trunk/src/java/org/apache/commons/io/IOUtils.java?view=markup

    Commons code is unforunately like that, it may work but its usually Bile-worthy.
  3. I think you have no ideea what the BufferedInputStream and BufferedReader are doing.
    Please, take a look first at that classes before post something stupid.
  4. yes, BufferedInputStream is synchronized.
    See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4097272
  5. Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
  6. Easier, perhaps, but not more efficient. The MD5 hash does not magically appear-- the file must be read to compute the hash.
  7. Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.

    .. but an MD5 is slower than a byte-by-byte comparison ;-)

    Peace,

    Cameron Purdy
    Tangosol Coherence: Also supports clustered spatial indexes.
  8. Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
    .. but an MD5 is slower than a byte-by-byte comparison ;-)

    The first time ;)

    Kirk
  9. Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
    .. but an MD5 is slower than a byte-by-byte comparison ;-)Peace,Cameron PurdyTangosol Coherence: Also supports clustered spatial indexes.

    That's interesting, because to compare two bytes you have to read each block into the processor cache. If you compute a hash then you can read bigger blocks (twice as big, in fact.) I guess the hash is kept in a register.

    Since fetching data from main memory takes ~200ns, whereas computing the next value of the hash probably takes 10ns (I don't know md5), it could make a difference.

    Although if the files must be read from disk then this shouldn't make a difference. But if they are in the buffer cache it might make a difference.
  10. Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.

    .. but an MD5 is slower than a byte-by-byte comparison ;-)

    That's interesting, because to compare two bytes you have to read each block into the processor cache. If you compute a hash then you can read bigger blocks (twice as big, in fact.) I guess the hash is kept in a register.Since fetching data from main memory takes ~200ns, whereas computing the next value of the hash probably takes 10ns (I don't know md5), it could make a difference. Although if the files must be read from disk then this shouldn't make a difference. But if they are in the buffer cache it might make a difference.

    A modern processor does not fetch data from main memory. Instead, the memory manager will load an entire cache line, so streaming 1 or 2 or 4 bytes at a time will make little or no difference in terms of memory access speeds.

    Further, if you can grab 2 bytes at a time to do a hash, then you can grab 2 bytes at a time to do a compare ;-) .. and I'm just suggesting that a compare is faster than the math performed within a MD5 checksum.

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Coherent Caching
  11. <quote>
    That's interesting, because to compare two bytes you have to read each block into the processor cache. If you compute a hash then you can read bigger blocks (twice as big, in fact.) I guess the hash is kept in a register.

    Since fetching data from main memory takes ~200ns, whereas computing the next value of the hash probably takes 10ns (I don't know md5), it could make a difference.
    </quote>
    How often do you calculate MD5 hashes using native calls? :)
    Byte-to-byte comparison is an easy and fast way to compare files. There is no need to fetch all the streams to compare them. In a VERY optimistic scenario it is enough to read a single byte from 2 streams to determine whether they differ. Moreover any kind of hash code must not be used for comparison.

    Of course stream comparison may be optimized to use byte arrays instead of a single byte, but it's another story...

    Regards,
    Theodore Kupolov
  12. In a VERY optimistic scenario it is enough to read a single byte from 2 streams to determine whether they differ.

    And we all missed the obvious -- 99.999% of the time the file sizes will differ, thus proving inequality without anything more than an OS _stat() call. ;-)

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Coherent Caching
  13. Hah Hah[ Go to top ]

    Good point that man.
  14. File sizes differ[ Go to top ]

    And we all missed the obvious -- 99.999% of the time the file sizes will differ, thus proving inequality without anything more than an OS _stat() call.

    The IOUtils methods compare streams/readers, so file length comparison isn't an option.

    However for the FileUtils method this is possible, and should have already been there. Sadly it isn't, but thanks to TSS I'll add it to SVN in the next few days ;-)
  15. And we all missed the obvious -- 99.999% of the time the file sizes will differ, thus proving inequality without anything more than an OS _stat() call. ;-)Peace,Cameron Purdy
    And if the answer is less obvious i.e. file sizes are identical, you can perhaps have fun adapting the ideas in section 1.2 over here: http://www.springeronline.com/sgw/cda/pageitems/document/cda_downloaddocument/0,11996,0-0-45-153143-0,00.pdf
  16. Excuse me but...[ Go to top ]

    IMHO, comparing byte-to-byte two files is necessary only until the first difference is found... so the worst case (the two files are equal) involves reading as many files as making an MD5 does, but if the two files are different, you'll end up comparing possibly very,very few files... am I missing something here? Thanks for the interesting discussion..
  17. Efficiency[ Go to top ]

    Thanks for the input. We will have a look at that method to see if it can be improved. It wasn't changed in this release, and nobody reported an issue with it until now ;-)

    Please bear in mind that Commons IO is intended to make simple use of IO much easier to use. If outright maximum performance is what you care about then you should always write your own code. However, at this point I think a quick reminder is in order - premature optimisation is A Bad Thing!
  18. Efficiency[ Go to top ]

    premature optimisation is A Bad Thing!

    could you be so kind to you explain. I dont disagree but am curious what your motivation is.
  19. premature optimisation is A Bad Thing!
    could you be so kind to you explain. I dont disagree but am curious what your motivation is.

    Commons IO, to a certain degree, offers you a choice between more readable/understandable code (commons-io), and the fastest possible most optimised code (code it yourself). Sometimes you need that outright performance, but most of the time you don't (but you may think you do...)

    Now, I suspect that most readers of TSS are very good coders and are perfectly capable of coding up their own version of a library like commons-io. Chances are that some will also find performance tweaks as they do so.

    That's fine! If you, as a developer want to code the classes yourself then thats great. But there is necessarily a maintanance overhead in doing so. So, maybe you should question if its actually necessry to do so when someone has helpfully written a library for you.

    The original poster had made the point that he felt he could code a faster version of a particular method (he's probably right ;-) My comment about optimisation is merely to say, don't worry about optimising this piece of code UNLESS your performance analysis of your application shows that its a real performance bottleneck (ie. not just one you think will be a botteneck).

    Or to paraphrase - 'Premature optimization is the root of all evil'. (see Google)

    PS. If anyone does have better/faster code for any commons-io method then please submit a bug report. We really want to know!
  20. Optimisation vs Readability[ Go to top ]

    ...My comment about optimisation is merely to say, don't worry about optimising this piece of code UNLESS your performance analysis of your application shows that its a real performance bottleneck (ie. not just one you think will be a botteneck).Or to paraphrase - 'Premature optimization is the root of all evil'. (see Google)...

    Bravo, and I totally agree on this point. Generally, you
    shouldn't need to optimise your code unless some "observation" points to a performance problem.

    However, the following should always be considered:

    1) there is no excuse not to code optimially during the development cycle in the first place.
    2) toolkit and framework code, of all things,
    should be coded for performance and lightest footprint
    possible. As Commons is a toolkit, it should be
    coded with this in mind.
    3) the usual time/space/cost trade-off applies, as
    always...

    Genarally though, it looks like the Commons-IO team
    have considered these points carefully.

    As for the MD5 debate, I side (peacefully) with the
    byte-byte comparison is quicker than hashing argument.

    Thanks,

    Gary
    _________________________________________________________
  21. How about renaming a file?[ Go to top ]

    As simple as it sounds, I haven't found a direct API that allows you to do this!? Unless one has to manually copy the contents to the NewFile and then Delete the OldFile.
  22. How about renaming a file?[ Go to top ]

    Ain't that related to OS specific functionality ? As for file permissions.
  23. Renaming[ Go to top ]

    As simple as it sounds, I haven't found a direct API that allows you to do this!? Unless one has to manually copy the contents to the NewFile and then Delete the OldFile.

    File.renameTo normally does the trick if they are both on the same drive.

    Commons IO would like to add a forceMove method at some point in the future to handle cases such as moving from one drive to another.
  24. Renaming[ Go to top ]

    Yes File.renameTo works most of the time unless you have some OS level permission that prevents from doing so. This brings up the question of how you are able to do a forceMove, if SecurityManager.checkWrite denies access?
  25. cool, thanks[ Go to top ]

    The other day I was unit testing a "save file" operation and couldn't figure out how to convert a File object to a byte array without using a multi-part object. Looks like this is an oldie in the commons IO api but now that I know about it, thanks. That helps.
  26. Commons IO 1.1 released[ Go to top ]

    Maybe I'm missing something here, but just because the MD5 hash of a file equals the MD5 hash of another file doesn't necessarily mean that the contents are the same.


    Adam
  27. Commons IO 1.1 released[ Go to top ]

    Maybe I'm missing something here, but just because the MD5 hash of a file equals the MD5 hash of another file doesn't necessarily mean that the contents are the same.Adam

    Adam, it is generally acepted that the MD5 hashing algorithm
    is relatively collision free; although it has been
    recently (a few years ago) proven that there are
    certain cases where it is possible (with similar inputs)
    to cause a collision to occur (Google it for history on
    this, but http://en.wikipedia.org/wiki/MD4 is a good one).

    MD5 was brought in to replace the weaker MD4. It's gotta be
    better that CRC32 though, right?

    Hope that helps.

    - Gary