I've been tasked with designing a J2EE application that is supposed to perform the following functions...
1. Clients submit very large document collection files ranging from I'd say around 100MB to 2.5GB and possibly even larger.
2. Server side code takes this document collection and splits it into constituent document files. The individual documents range from a few KB to several MB. As the document collection is split, parsing is done to capture identifying key information from within the documents.
3. The documents are stored on magneto-optical disks.
4. The identifying key information for each document and information necessary to retrieve the document from magneto-optical disk (disk number, path) is stored in a RDBMS.
5. Clients submit requests for documents by key.
6. Each day, a process is run that retrieves documents from magneto-optical disks and stores them in a temporary directory.
7. Clients view and/or print the documents from the temporary directory after they are retrieved. Documents are never altered after initial submission.
8. The documents in the temporary directory are deleted after a period of time.
My initial idea is to have a thick client submit the document collections to a session bean, which uses a helper class or resource adapter to split the documents. I'm not sure what is the best way to transfer such large files to EJB components, or if this is even a viable way to do things. I suppose it would be possible to have the document collections split into individual documents in the client, and transfer each document to the EJB components.
For client viewing of the documents, I'm thinking of contructing a web applications that allows users to request a document and see the status of pending requests that they made. Once a document is available, they can select it and view the document. Can you use file io (OutputStream?) in session beans to transfer the documents to JSP's for viewing, or do you have to use resource adapters to do this?
You can try using zip stream to transfer data from the client to server. Especially if its a text file, then the ratio you'd get will be really good.. like 80 - 95 % depending upon the data.
Splitting, I dont' know what kind of data you have. Can it be split at any location.. ? Then you can use the paging policy, and split them with a well defined page size. This would help in locating any segment much faster and easily storable in the DB too.
Hope this helps.
Might as well break the stream up at the client itself, if the network failes, then at least partially has been uploaded.
Your requirements suggest a doc management system. Don't waste time and money, buy it off the shelf, e.g. Documentum.