The Commons IO team is pleased to announce the release of commons-io-1.1.
Commons IO is a library of utility, file filter, endian and stream classes that aim to make working with IO much more pleasant. It has no dependencies. Many of these classes probably should be in the JDK itself.
This release fixes all open bugs, and adds various enhancements, including:
- FilenameUtils - A static utility class for working with filenames without File objects
- FileSystemUtils - A static utility class that allows you to get the free space on a drive
- IOUtils/FileUtils - read and write files line by line into a List
- WildcardFilter - A new filter that can match using wildcard file names
This release is binary and source compatible with 1.0 according to our tests. There are some minor semantic changes caused by bug fixes which should not affect the vast majority of 1.0 users - please check the release notes for full details. To simplify the API, there has also been a deprecation - please check the release notes. We recommend all users of commons-io-1.0 upgrade to 1.1 to pickup the numerous bugs fixes.
Commons IO Website:
http://jakarta.apache.org/commons/io/
Release notes:
http://jakarta.apache.org/commons/io/upgradeto1_1.html
Download:
http://jakarta.apache.org/site/downloads/downloads_commons-io.cgi
Enjoy!
The Commons-IO Team
-
Commons IO 1.1 released (26 messages)
- Posted by: Stephen Colebourne
- Posted on: October 11 2005 05:26 EDT
Threaded Messages (26)
- This library could be much more efficient by John Shuster on October 11 2005 06:19 EDT
- This library could be much more efficient by Fulop Levente on October 11 2005 06:36 EDT
-
This library could be much more efficient by arnaud masson on October 11 2005 07:08 EDT
-
This library could be much more efficient by Steve Garcia on October 11 2005 01:59 EDT
- Re: This library could be much more efficient by Steve Perl on October 11 2005 02:31 EDT
-
This library could be much more efficient by Cameron Purdy on October 11 2005 02:33 EDT
- This library could be much more efficient by Kirk Pepperdine on October 11 2005 03:44 EDT
-
This library could be much more efficient by Guglielmo Lichtner on October 11 2005 07:23 EDT
- This library could be much more efficient by Cameron Purdy on October 11 2005 10:07 EDT
-
This library could be much more efficient by Fyodor Kupolov on October 12 2005 03:33 EDT
-
This library could be much more efficient by Cameron Purdy on October 12 2005 04:02 EDT
- Hah Hah by John Shuster on October 12 2005 06:04 EDT
- File sizes differ by Stephen Colebourne on October 13 2005 08:02 EDT
-
This library could be much more efficient by Alain Rogister on October 13 2005 10:32 EDT
- Excuse me but... by John Shuster on October 14 2005 03:28 EDT
-
This library could be much more efficient by Cameron Purdy on October 12 2005 04:02 EDT
-
This library could be much more efficient by Steve Garcia on October 11 2005 01:59 EDT
-
This library could be much more efficient by arnaud masson on October 11 2005 07:08 EDT
- Efficiency by Stephen Colebourne on October 11 2005 07:21 EDT
-
Efficiency by Dennis Bekkering on October 11 2005 06:36 EDT
-
Optimisation vs Readability by Stephen Colebourne on October 11 2005 07:14 EDT
- Optimisation vs Readability by Gary Watson on October 12 2005 02:35 EDT
-
Optimisation vs Readability by Stephen Colebourne on October 11 2005 07:14 EDT
-
Efficiency by Dennis Bekkering on October 11 2005 06:36 EDT
- This library could be much more efficient by Fulop Levente on October 11 2005 06:36 EDT
- How about renaming a file? by Emad Benjamin on October 11 2005 08:52 EDT
- How about renaming a file? by Julien Delfosse on October 11 2005 08:55 EDT
- Renaming by Stephen Colebourne on October 11 2005 09:14 EDT
- Renaming by Emad Benjamin on October 11 2005 11:47 EDT
- cool, thanks by Michael Boyd on October 11 2005 11:26 EDT
- Commons IO 1.1 released by Adam Clarricoates on October 12 2005 13:56 EDT
- Commons IO 1.1 released by Gary Watson on October 12 2005 14:44 EDT
-
This library could be much more efficient[ Go to top ]
- Posted by: John Shuster
- Posted on: October 11 2005 06:19 EDT
- in response to Stephen Colebourne
Like the content equals method that compares two files, character by freakin character, incurring a synchonization lock every time : http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/io/trunk/src/java/org/apache/commons/io/IOUtils.java?view=markup
Commons code is unforunately like that, it may work but its usually Bile-worthy. -
This library could be much more efficient[ Go to top ]
- Posted by: Fulop Levente
- Posted on: October 11 2005 06:36 EDT
- in response to John Shuster
I think you have no ideea what the BufferedInputStream and BufferedReader are doing.
Please, take a look first at that classes before post something stupid. -
This library could be much more efficient[ Go to top ]
- Posted by: arnaud masson
- Posted on: October 11 2005 07:08 EDT
- in response to Fulop Levente
yes, BufferedInputStream is synchronized.
See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4097272 -
This library could be much more efficient[ Go to top ]
- Posted by: Steve Garcia
- Posted on: October 11 2005 13:59 EDT
- in response to arnaud masson
Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute. -
Re: This library could be much more efficient[ Go to top ]
- Posted by: Steve Perl
- Posted on: October 11 2005 14:31 EDT
- in response to Steve Garcia
Easier, perhaps, but not more efficient. The MD5 hash does not magically appear-- the file must be read to compute the hash. -
This library could be much more efficient[ Go to top ]
- Posted by: Cameron Purdy
- Posted on: October 11 2005 14:33 EDT
- in response to Steve Garcia
Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
.. but an MD5 is slower than a byte-by-byte comparison ;-)
Peace,
Cameron Purdy
Tangosol Coherence: Also supports clustered spatial indexes. -
This library could be much more efficient[ Go to top ]
- Posted by: Kirk Pepperdine
- Posted on: October 11 2005 15:44 EDT
- in response to Cameron Purdy
Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
.. but an MD5 is slower than a byte-by-byte comparison ;-)
The first time ;)
Kirk -
This library could be much more efficient[ Go to top ]
- Posted by: Guglielmo Lichtner
- Posted on: October 11 2005 19:23 EDT
- in response to Cameron Purdy
Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
.. but an MD5 is slower than a byte-by-byte comparison ;-)Peace,Cameron PurdyTangosol Coherence: Also supports clustered spatial indexes.
That's interesting, because to compare two bytes you have to read each block into the processor cache. If you compute a hash then you can read bigger blocks (twice as big, in fact.) I guess the hash is kept in a register.
Since fetching data from main memory takes ~200ns, whereas computing the next value of the hash probably takes 10ns (I don't know md5), it could make a difference.
Although if the files must be read from disk then this shouldn't make a difference. But if they are in the buffer cache it might make a difference. -
This library could be much more efficient[ Go to top ]
- Posted by: Cameron Purdy
- Posted on: October 11 2005 22:07 EDT
- in response to Guglielmo Lichtner
Comparing contents of files is easy - just take an MD5 hash of the file contents and compare the resultant hash value. Easier to write and quicker to execute.
.. but an MD5 is slower than a byte-by-byte comparison ;-)
That's interesting, because to compare two bytes you have to read each block into the processor cache. If you compute a hash then you can read bigger blocks (twice as big, in fact.) I guess the hash is kept in a register.Since fetching data from main memory takes ~200ns, whereas computing the next value of the hash probably takes 10ns (I don't know md5), it could make a difference. Although if the files must be read from disk then this shouldn't make a difference. But if they are in the buffer cache it might make a difference.
A modern processor does not fetch data from main memory. Instead, the memory manager will load an entire cache line, so streaming 1 or 2 or 4 bytes at a time will make little or no difference in terms of memory access speeds.
Further, if you can grab 2 bytes at a time to do a hash, then you can grab 2 bytes at a time to do a compare ;-) .. and I'm just suggesting that a compare is faster than the math performed within a MD5 checksum.
Peace,
Cameron Purdy
Tangosol Coherence: Clustered Coherent Caching -
This library could be much more efficient[ Go to top ]
- Posted by: Fyodor Kupolov
- Posted on: October 12 2005 15:33 EDT
- in response to Guglielmo Lichtner
<quote>
That's interesting, because to compare two bytes you have to read each block into the processor cache. If you compute a hash then you can read bigger blocks (twice as big, in fact.) I guess the hash is kept in a register.
Since fetching data from main memory takes ~200ns, whereas computing the next value of the hash probably takes 10ns (I don't know md5), it could make a difference.
</quote>
How often do you calculate MD5 hashes using native calls? :)
Byte-to-byte comparison is an easy and fast way to compare files. There is no need to fetch all the streams to compare them. In a VERY optimistic scenario it is enough to read a single byte from 2 streams to determine whether they differ. Moreover any kind of hash code must not be used for comparison.
Of course stream comparison may be optimized to use byte arrays instead of a single byte, but it's another story...
Regards,
Theodore Kupolov -
This library could be much more efficient[ Go to top ]
- Posted by: Cameron Purdy
- Posted on: October 12 2005 16:02 EDT
- in response to Fyodor Kupolov
In a VERY optimistic scenario it is enough to read a single byte from 2 streams to determine whether they differ.
And we all missed the obvious -- 99.999% of the time the file sizes will differ, thus proving inequality without anything more than an OS _stat() call. ;-)
Peace,
Cameron Purdy
Tangosol Coherence: Clustered Coherent Caching -
Hah Hah[ Go to top ]
- Posted by: John Shuster
- Posted on: October 12 2005 18:04 EDT
- in response to Cameron Purdy
Good point that man. -
File sizes differ[ Go to top ]
- Posted by: Stephen Colebourne
- Posted on: October 13 2005 08:02 EDT
- in response to Cameron Purdy
And we all missed the obvious -- 99.999% of the time the file sizes will differ, thus proving inequality without anything more than an OS _stat() call.
The IOUtils methods compare streams/readers, so file length comparison isn't an option.
However for the FileUtils method this is possible, and should have already been there. Sadly it isn't, but thanks to TSS I'll add it to SVN in the next few days ;-) -
This library could be much more efficient[ Go to top ]
- Posted by: Alain Rogister
- Posted on: October 13 2005 10:32 EDT
- in response to Cameron Purdy
And we all missed the obvious -- 99.999% of the time the file sizes will differ, thus proving inequality without anything more than an OS _stat() call. ;-)Peace,Cameron Purdy
And if the answer is less obvious i.e. file sizes are identical, you can perhaps have fun adapting the ideas in section 1.2 over here: http://www.springeronline.com/sgw/cda/pageitems/document/cda_downloaddocument/0,11996,0-0-45-153143-0,00.pdf -
Excuse me but...[ Go to top ]
- Posted by: John Shuster
- Posted on: October 14 2005 03:28 EDT
- in response to Alain Rogister
IMHO, comparing byte-to-byte two files is necessary only until the first difference is found... so the worst case (the two files are equal) involves reading as many files as making an MD5 does, but if the two files are different, you'll end up comparing possibly very,very few files... am I missing something here? Thanks for the interesting discussion.. -
Efficiency[ Go to top ]
- Posted by: Stephen Colebourne
- Posted on: October 11 2005 07:21 EDT
- in response to John Shuster
Thanks for the input. We will have a look at that method to see if it can be improved. It wasn't changed in this release, and nobody reported an issue with it until now ;-)
Please bear in mind that Commons IO is intended to make simple use of IO much easier to use. If outright maximum performance is what you care about then you should always write your own code. However, at this point I think a quick reminder is in order - premature optimisation is A Bad Thing! -
Efficiency[ Go to top ]
- Posted by: Dennis Bekkering
- Posted on: October 11 2005 18:36 EDT
- in response to Stephen Colebourne
premature optimisation is A Bad Thing!
could you be so kind to you explain. I dont disagree but am curious what your motivation is. -
Optimisation vs Readability[ Go to top ]
- Posted by: Stephen Colebourne
- Posted on: October 11 2005 19:14 EDT
- in response to Dennis Bekkering
premature optimisation is A Bad Thing!
could you be so kind to you explain. I dont disagree but am curious what your motivation is.
Commons IO, to a certain degree, offers you a choice between more readable/understandable code (commons-io), and the fastest possible most optimised code (code it yourself). Sometimes you need that outright performance, but most of the time you don't (but you may think you do...)
Now, I suspect that most readers of TSS are very good coders and are perfectly capable of coding up their own version of a library like commons-io. Chances are that some will also find performance tweaks as they do so.
That's fine! If you, as a developer want to code the classes yourself then thats great. But there is necessarily a maintanance overhead in doing so. So, maybe you should question if its actually necessry to do so when someone has helpfully written a library for you.
The original poster had made the point that he felt he could code a faster version of a particular method (he's probably right ;-) My comment about optimisation is merely to say, don't worry about optimising this piece of code UNLESS your performance analysis of your application shows that its a real performance bottleneck (ie. not just one you think will be a botteneck).
Or to paraphrase - 'Premature optimization is the root of all evil'. (see Google)
PS. If anyone does have better/faster code for any commons-io method then please submit a bug report. We really want to know! -
Optimisation vs Readability[ Go to top ]
- Posted by: Gary Watson
- Posted on: October 12 2005 14:35 EDT
- in response to Stephen Colebourne
...My comment about optimisation is merely to say, don't worry about optimising this piece of code UNLESS your performance analysis of your application shows that its a real performance bottleneck (ie. not just one you think will be a botteneck).Or to paraphrase - 'Premature optimization is the root of all evil'. (see Google)...
Bravo, and I totally agree on this point. Generally, you
shouldn't need to optimise your code unless some "observation" points to a performance problem.
However, the following should always be considered:
1) there is no excuse not to code optimially during the development cycle in the first place.
2) toolkit and framework code, of all things,
should be coded for performance and lightest footprint
possible. As Commons is a toolkit, it should be
coded with this in mind.
3) the usual time/space/cost trade-off applies, as
always...
Genarally though, it looks like the Commons-IO team
have considered these points carefully.
As for the MD5 debate, I side (peacefully) with the
byte-byte comparison is quicker than hashing argument.
Thanks,
Gary
_________________________________________________________ -
How about renaming a file?[ Go to top ]
- Posted by: Emad Benjamin
- Posted on: October 11 2005 08:52 EDT
- in response to Stephen Colebourne
As simple as it sounds, I haven't found a direct API that allows you to do this!? Unless one has to manually copy the contents to the NewFile and then Delete the OldFile. -
How about renaming a file?[ Go to top ]
- Posted by: Julien Delfosse
- Posted on: October 11 2005 08:55 EDT
- in response to Emad Benjamin
Ain't that related to OS specific functionality ? As for file permissions. -
Renaming[ Go to top ]
- Posted by: Stephen Colebourne
- Posted on: October 11 2005 09:14 EDT
- in response to Emad Benjamin
As simple as it sounds, I haven't found a direct API that allows you to do this!? Unless one has to manually copy the contents to the NewFile and then Delete the OldFile.
File.renameTo normally does the trick if they are both on the same drive.
Commons IO would like to add a forceMove method at some point in the future to handle cases such as moving from one drive to another. -
Renaming[ Go to top ]
- Posted by: Emad Benjamin
- Posted on: October 11 2005 11:47 EDT
- in response to Stephen Colebourne
Yes File.renameTo works most of the time unless you have some OS level permission that prevents from doing so. This brings up the question of how you are able to do a forceMove, if SecurityManager.checkWrite denies access? -
cool, thanks[ Go to top ]
- Posted by: Michael Boyd
- Posted on: October 11 2005 11:26 EDT
- in response to Stephen Colebourne
The other day I was unit testing a "save file" operation and couldn't figure out how to convert a File object to a byte array without using a multi-part object. Looks like this is an oldie in the commons IO api but now that I know about it, thanks. That helps. -
Commons IO 1.1 released[ Go to top ]
- Posted by: Adam Clarricoates
- Posted on: October 12 2005 13:56 EDT
- in response to Stephen Colebourne
Maybe I'm missing something here, but just because the MD5 hash of a file equals the MD5 hash of another file doesn't necessarily mean that the contents are the same.
Adam -
Commons IO 1.1 released[ Go to top ]
- Posted by: Gary Watson
- Posted on: October 12 2005 14:44 EDT
- in response to Adam Clarricoates
Maybe I'm missing something here, but just because the MD5 hash of a file equals the MD5 hash of another file doesn't necessarily mean that the contents are the same.Adam
Adam, it is generally acepted that the MD5 hashing algorithm
is relatively collision free; although it has been
recently (a few years ago) proven that there are
certain cases where it is possible (with similar inputs)
to cause a collision to occur (Google it for history on
this, but http://en.wikipedia.org/wiki/MD4 is a good one).
MD5 was brought in to replace the weaker MD4. It's gotta be
better that CRC32 though, right?
Hope that helps.
- Gary