Home

News: Java Best Practices – Vector vs ArrayList vs HashSet

  1. Continuing our series of articles concerning proposed practices while working with the Java programming language, we are going to perform a performance comparison between the three probably most used Collection implementation classes. To make things more realistic we are going to test against a multi–threading environment so as to discuss and demonstrate how to utilize Vector, ArrayList and/or HashSet for high performance applications.


    Read more:

    Java Code Geeks: Java Best Practices – Vector vs ArrayList vs HashSet

    Threaded Messages (18)

  2. The analysis you've provided is in no way shocking when you consider the implementation details of the varying Collection types.  That said, your blog posting is an extremely good explanation of why/when to use the varying Collection types for junior Java developers.

    However, it would have been interesting for comparison, if you had also included the concurrent Collections.

  3. This test is a very important as I use ArrayLists most of the time

    did not know why sets overcome the vectors and arraylists in removing elements

    http://www.thejavacode.com

  4. This test is a very important as I use ArrayLists most of the time

    did not know why sets overcome the vectors and arraylists in removing elements

    http://www.thejavacode.com

    The reason is that on ArrayList and Vecrtor, the remove() method will result in a System.arraycopy() call if any element except the last element is removed (the last element being the element with index: size - 1).  Removing the first element means the entire rest of the array is copied which is an O(n) operation.  Since the test removes all the elements in the List, the full test becomes O(n^2) (slow.)

    HashSet remove does not do any such array copies so it's remove is O(1) or constant time.  For the full test it then is O(n) (fast.).  If the tests were rewritten to remove the last element from the ArrayList and Vector, you would likely similar performance to the HashSet.

  5. What is the point of introducing multiple threads into this test?  Since all the interactions are put through a synchronized bottleneck, I doubt your results are much different than they would be if you just used a single thread. Most likely, the synchronization overhead simply dillutes the results.  I don't know that this is really 'realistic' either.  The vast majority of the time collections are used, they need not be synchronized.

    Multithreading might be interesting if you were testing these against concurrent collections.

    Also, if you use the constructor for each type that takes an integer, the performance for your add() tests will be different.  I would expect that the HashSet would perform much better.  Additionally, the performance of HashSet is affected by how unique the hashcodes are for the objects you add.  None of your hashcodes collide in this test.  Lastly, the size of the collections is going to make a big difference in the relative performance.

    In any event, this kind of thing has been done to death.  Here's a paper from 2008 with a much more in-depth analysis: http://scrtchpad.wordpress.com/2008/10/21/java-collections-performance-benchmark/

  6. We selected to use multiple threads to all the tests since multi-threading is a must when performance is concerned. And when we say performance we refer to the overall performance of an application not just the portion where you may have to use a collection implementation class. Thus we selected to simulate a multi-threading environment since we presume developing a high performance application. That is 'realistic' for us.

    We could use the constructor for each type that initializes each collection size, and i agree that this would provide slightly different performance results (HashSet would benefit the most surely) but our intention was to perform a relative performance comparison between the aforementioned collection implementation classes and not to provide tips on how to finetune each one of them so as to perform better.

    Lastly as clearly stated in all test cases, for the comparison to be fare we assume unique, not NULL elements and we do not mind the ordering of elements in the collections. This, to our opinion, is the most common case scenario.

     

  7. We selected to use multiple threads to all the tests since multi-threading is a must when performance is concerned. And when we say performance we refer to the overall performance of an application not just the portion where you may have to use a collection implementation class. Thus we selected to simulate a multi-threading environment since we presume developing a high performance application. That is 'realistic' for us.

    If you are developing high-performance parallel applications, you should be looking at the concurrent collections and other approaches.  I also don't think you understand my point.  Three threads adding 100 items each through a synchonization bottleneck is likely to perform at roughly the same speed as a single thread adding 300 items.  If you don't believe that, you should test it.

    We could use the constructor for each type that initializes each collection size, and i agree that this would provide slightly different performance results (HashSet would benefit the most surely) but our intention was to perform a relative performance comparison between the aforementioned collection implementation classes and not to provide tips on how to finetune each one of them so as to perform better.

    The problem is that you are using these in a naive way so all you are showing is how slow they can be.  See my response to Muhammed above.  You are removing items from the lists in the slowest possible way.  Is your intention to compare worst case scenarios?

    Lastly as clearly stated in all test cases, for the comparison to be fare we assume unique, not NULL elements and we do not mind the ordering of elements in the collections. This, to our opinion, is the most common case scenario.

    You can have unique objects without having unique hashcodes and if you don't care about the ordering, why not test removing items from the end of the lists?

  8. We selected to use multiple threads to all the tests since multi-threading is a must when performance is concerned. And when we say performance we refer to the overall performance of an application not just the portion where you may have to use a collection implementation class. Thus we selected to simulate a multi-threading environment since we presume developing a high performance application. That is 'realistic' for us.

    Just one more note on this.  Those developing high-performance concurrent applications should seek to avoid the use of synchronized wrappers around collections and synchronized blocks around iterators.  There's no simple answer as to what the best alternative would be because it depends on the problem.  But, in general, synchronization is the enemy of performance.

    If you want to talk about a specific problem and how to make it perform well in a multi-threaded application, that would be interesting.

  9. This is pointless[ Go to top ]

    This analysis is pointless. List and Set in java have 2 different semantics and so of course 2 different behaviors and then performances under the proposed use cases. It makes much more sense to compare data structures with the same semantic (i.e. ArrayList and LinkedList). What I mean to say is that In my software I could seamlessly replace an ArrayList with a LinkedList (even based on performance considerations) without changing its behavior. I cannot do the same replacing an ArrayList with an HashSet because a Set doesn't keep the insertion order of its items and doesn't allow to store the same item more than once.

    To resume it is pointless to compare 2 data strctures not having common functionalities.

    Adding elements is not a common functionality between List and Set: try to add the same item twice to a List and do the same with a Set. Will you obtain the same result? If the answer is no, it isn't a common functionality.

    Iterating over elements is not a common functionality between List and Set: try to add 10 items to a List and do the same with a Set. Will you obtain the same sequence of items while iterating over them? If the answer is no, it isn't a common functionality.

    The fact that 2 functionalities have the same name doesn't mean that they are in common provided that they have 2 different behaviors.

  10. This is pointless[ Go to top ]

    As clearly stated in our test cases, to make a fare comparison we presume that no NULL elements are allowed, that all elements are unique and that we do not mind the ordering of elements withing each collection implementation class. It is obvious that when someone has specific requirements, such as to maintain the insertion order in the collection or to be able to insert the same element multiple times, then the appropriate collection implementation class must be used.

  11. Not a valid comparison[ Go to top ]

    While you may have stated your conditions for the testing, I would have to agree with Mario. HashSet or any set implementation should not be in the testing simply because HashSet has a different sementic compared to Vectors and ArrayLists. 

    If however, you had compared the performance of Vectors, ArrayLists, and LinkedLists in a multi-threaded environment, it would have more sense versus including sets in your test.

  12. This is pointless[ Go to top ]

    I believe what Mario is saying is that you are comparing Apples to Oranges.

    In your post, you compare Sets with List implementations; doing so, you are testing the implementation of classes with totally different contracts. If you compared different Set (like Hash and Tree) implementations, that's ok. If you compare different List (like ArrayList and  Linked) implementations, that's ok too. If you compare Lists with Sets, then I don't know what you are testing.

    Someone trained in programing will first choose the interface corresponding to his need. If order and indexed (numerical) access are important, then a List is appropriate. If locating the presence of an element is important, then you need a Set.  From your requirements (storing and retrieving objects), I would bet that a Set should almost always win (if hash codes are good). About the test cases:

    - TC1 is ok

    - TC2 and TC3 do not test your requirements.

    - TC4 is plain weird: you treat a List as a queue (ok, that's a way of looking at things). As for the Set test, I have no clue what you are testing. I would have expected the removal of a random element as order is uncalled for. What you do is unclear and seems artificial. 

    I think it's a good start (it actually reminds me of a programing assignment), but these tests require a bit more thought. and in its current state, they do not show anything interesting. I've got the following questions.

    - Why do you not test specialised data structures if you want to consider parallelism? If I want to write a high performance multi-core app, I would probably need to use complex strategies to perform efficient synchronisation.

    -Why do you use the iterator's remove() method? It's the first time I hear someone considering that it's a standard way of removing objects.

  13. This is pointless[ Go to top ]

    I believe what Mario is saying is that you are comparing Apples to Oranges.

     

    In your post, you compare Sets with List implementations; doing so, you are testing the implementation of classes with totally different contracts. If you compared different Set (like Hash and Tree) implementations, that's ok. If you compare different List (like ArrayList and  Linked) implementations, that's ok too. If you compare Lists with Sets, then I don't know what you are testing.

    <!-- @page { margin: 0.79in } P { margin-bottom: 0.08in } -->

    We are testing different Collection implementation classes against a specific case scenario, which we believe is the most commonly used. I totally agree that List and Set have different semantics and contracts, nevertheless this does not prohibit them to be all candidates for implementing a certain case scenario.

    Someone trained in programing will first choose the interface corresponding to his need. If order and indexed (numerical) access are important, then a List is appropriate. If locating the presence of an element is important, then you need a Set.  From your requirements (storing and retrieving objects), I would bet that a Set should almost always win (if hash codes are good).

    Which one would you choose if your case scenario dictated that order and indexed access nor locating the presence of an element are important? If you had to implement a case scenario where multiple threads provided unique not null objects that you should asynchronously process which one would you favor and use? Yes someone trained in programing should first choose the interface corresponding to his needs (the case scenario that must be implemented). According to our specific test case scenario all collection implementation classes can be utilized. You say that you would "bet" that a Set should always win, our article demonstrates that a Set does not outperform Lists under all test cases, furthermore we show you a way of utilizing a Set so as to achieve maximum performance results for the combined (adding and removing elements concurrently) operation.

    Now for your questions :

    Why do you not test specialised data structures if you want to consider parallelism? If I want to write a high performance multi-core app, I would probably need to use complex strategies to perform efficient synchronisation.

    You do not have to use specialized data structures in order to consider parallelism. In the vast majority of cases plain is better and way more efficient. Test case #4 in particular demonstrates an "efficient synchronization" approach where with the use of two distinct HashSets we synchronize on different objects for the addition and retraction procedures so as to achieve maximum performance results.

    Why do you use the iterator's remove() method? It's the first time I hear someone considering that it's a standard way of removing objects.

    We never stated that using iterator's remove() method is the "standard" way of removing objects. We clearly state that we use this approach in order for the comparison to be fare. Furthermore we comment on the performance results when using the "standard" remove(0) operation for Lists.

  14. This is pointless[ Go to top ]

    Which one would you choose if your case scenario dictated that order and indexed access nor locating the presence of an element are important? If you had to implement a case scenario where multiple threads provided unique not null objects that you should asynchronously process which one would you favor and use?

    I develop in Java from about 12 years and I have never met a similar scenario in my working life. Did somebody experience a similar use-case?

    More in detail EITHER I can accept the same item more than once in my Collection OR I can't. This constraint must be enforced by the underlying Collection not by the piece of code that use it. In other words what I know is if I can accept or not the same instance in my collection more than once, not if I will ever feed it with the same item more than once. That is true especially in a multithreaded application as the one you proposed.

    Is it clearer now?

  15. This is pointless[ Go to top ]

    You do not have to use specialized data structures in order to consider parallelism. In the vast majority of cases plain is better and way more efficient. Test case #4 in particular demonstrates an "efficient synchronization" approach where with the use of two distinct HashSets we synchronize on different objects for the addition and retraction procedures so as to achieve maximum performance results.

    The implementation that you refer to here is not efficient in any standard use of the term I highly doubt that it produces maximum performance results.  Are you attempting to implement a work queue?  That would be an interesting thing to test.  You should at the very least test it against all of the implementations of Queue in the JDK.

    http://download.oracle.com/javase/6/docs/api/java/util/Queue.html

    You could then implement Queue using your approach and let other propose their own implementations.

  16. I have to agree with Stephane. In doing your tests, it seems you did not conisder the costs associated with the computation of hash codes. If you look through the code for generating  a hash code for a String object, you would notice that there is substantial overhead in computing the hash for string objects. It would be interesting if you actually compared against a HashSet of integers for example whereby the computation for the hash code is rather straightforward.

    In addition, you have to understand the implementation of a HashSet actually uses a HashMap internally. And in addition, understanding the manner in which a hash collection works, you would understand that there are buckets with the respective values to be stored. In addition, the more elements you store, the more often the buckets have to be grown based on the threshold that was set for the map. On top of this, System.arrayCopy is implemented in native code, whereas the movement and reallocation of buckets is actually done in Java code. That in addition brings about some overhead. 

    I guess it's coz of all these factors that most people are bringing up the point that including sets in your test is not appropriate

  17. So, after James' and others' posts, should we say: Java Best Practices in a multi-thread environment - Vector, ArrayList, HashSet: Noooooo  and java.util.concurrent: YESSSSSS?

  18. So, after James' and others' posts, should we say: Java Best Practices in a multi-thread environment - Vector, ArrayList, HashSet: Noooooo  and java.util.concurrent: YESSSSSS?

    It's not quite that simple.  The concurrent collections minimize thread contention (roughly: blocking) but add significant overhead.  Synchronization in Java is pretty fast.  So there's a tradeoff, you can minimize thread contention by adding execution overhead or you can minimize execution overhead by adding contention.

    Which is better?  The answer (as usual) is that it depends.  If you don't have many threads and you have a small number of short bottlenecks, you are probably going to have better throughput with synchronization.  If you have a large number of threads that must share resources regularly, a concurrent collection may pay for it's overhead.

    What's really challenging is that if you don't know where your code is ultimately going to be deployed, it's very hard to know what the best choice will be.  This is the reason that the more people know about multithreaded the more they tend to conclude that imperative languages (specifically the thread model) are not adequate for the (expected) massively multi-threaded future.

  19. I'm happy to see that you updated case 2.  I'm not sure what changed in test 4 though.