I have this requirement where I need to read a huge XML file (100+ Megs), parse it (StAX) record by record (boundary conditions as defined by the business logic) into an intermediate data structure (HashMap) and then write the contents of the structure to a file.
What would be the optimum solution to this?
Should I make use of an array of HashMap(s) as the intermediate structure and have one thread parse a record in the XML and put it into the structure and another thread read from the structure and write to the file? The problem is that the method that does this functionality can return only once the entire data in the XML has been written to the file. This method is invoked from a web application. I cannot background it for the time being.
Further, should I use memory mapped files (java.nio) when reading the XML file and writing to the output file?
Is there a way I can monitor the memory usage before invoking this method, midway through the method and at the end of the method?
With all that reading from and writing to disk, way you store the data in memory probably isn't going to be significant. I don't know myself why you would use HashMaps (but you could). I would be inclined myself to use a FIFO, and pause the StAX parser when/if the FIFO depth gets to a certain value.
I haven't used the 'nio' classes, I can't comment on whether they would help in this situation. I'm not aware that any of the usual XML libraries use them, which suggests to me that they may not make a significant difference for this application.
The map is necessary due to the way the XML is structured and what information needs to be extracted from it. Let's just say that there is no proper demarcation in the XML that corresponds to a line in the output CSV file.
As of now, I am going ahead with a producer consumer like approach where the parser puts records (hashmap) into the blocking queue and the consumer (output file writer) reads from the queue and writes to the file.
Any reason you can't style it to CSV with a/series of stylesheet(s)? Just a thought.