parsing a text file in java

Discussions

Performance and scalability: parsing a text file in java

  1. parsing a text file in java (18 messages)

    Hi All,
    A very happy new year to you all !!!!

    I have to parse a text file that is of the size 20 mb. I want to know whether I can use java for doing this or i can use shell scripting?

    I want to know what is the maximum size of a file that could be parsed with java. Like for example can i parse a txt file whose size is > 20 m.b. Will that be a hit on the perfomance?

    I hope the scenario what I am presenting is clear.

    So guys can you throw some light on this and pleeeease do let me know the options that could give me a optimum performance to parse a txt file whose size is greater than 20 m.b.

    Thanks a ton in advance.

    KSNP.

    Threaded Messages (18)

  2. what kind of parsing?[ Go to top ]

    what kind of parsing do you plan to do? do you plan to use regex?
    check out http://www.tbray.org/ongoing/When/200x/2004/08/22/PJre for a useful information on java regex and perl regex.

    Also, garbage collection is something to keep in mind. If you produce hundreds of thousands of tokens from the file you're parsing, then you will need to tune the VM for max performance.

    if you're using unix tools like grep, sed, awk, you get free multi-processor stream parsing as you pipe the output from one tool to the next.

    all greatly depends on the complexity of data you need to parse out and what memory footprint that will present - if you just need to parse a few lines that match a regex, then just use the simplest tool for the job.
  3. Thanks Jason[ Go to top ]

    Hi Jason,
    Thanks for your quick response. The parsing i am talking is,
    there are some text files that are stored in a folder and the size of the each file could be between 20 to 30 m.b and each line could have data delimitted by '|'.
    For example i could have data like

    abc|def|geh|nagendra|jason...

    and there could be around 70,000 lines like this in that file. I have to take each line and get the values of abc, def, geh.

    In this scenario could you please suggest me the optimum approach. As you say using java, we have to tune the JVM for maximum performance.
    Your valuable inputs could really help me in the desing of this approach.

    thanks Jason.

    Cheers.
    Nagendra
  4. For example i could have data like abc|def|geh|nagendra|jason...and there could be around 70,000 lines like this in that file. I have to take each line and get the values of abc, def, geh.

    After you get the values of abc, def, etc. what are you going to do with the values?

    Are you going to store then in a relational database?

    If you are going to do that, you may consider any of the ETL (extract, transform, load) tools available for the database you are going to use. I am pretty sure that this is going to be the faster approach available.

    On the other hand, as long as for years we have had available JIT compilers inside the JVM, a massive process written in Java for the task you described should perform ok, apart from the fact that the garbage collector probably is going to work a lot, in order to free the memory for all the tokens that you are going to create.


    Jose Ramon Huerga

    http://www.terra.es/personal/jrhuerga
  5. Hi Jose,
    Thanks for the response.
    I have to load the data into the oracle RDBMS after some content validations. We thought of using SQL Loader to do this. So our approach would be

    Step:- 1
    -----------
    Parse the file using java and search for all the mandatory data using regular expressions. If valid,

    Step:- 2
    -----------
    Use the SQL Loader and load the data to the database.


    I will go through the link that you have given and I will come back to you if required.

    Cheers
    KSNP.
  6. hi, i want to parse the text file in java and want to store the data in Oracle Data base.. can u please help me in that?
  7. Parsing library for java[ Go to top ]

    Have a look at the open source library http://jsapar.tigris.org. It provides a lot of functionality to parse csv and flat files into java objects etc.
  8. hi, i want to parse the text file in java and want to store the data in Oracle Data base.. can u please help me in that?
    I assume you want to extract table data from the text or html file ? Use this command in biterscripting. script SS_WebPageToCSV.txt page("http://www.somesite.com/path/to/html page") or script SS_WebPageToCSV.txt page("/path/to/local text file") I use this script a lot to extract stock market data. Once in CSV or TSV format, you can upload the data to Oracle.
  9. i have similar data in the text file like
    2112\\ABC
    |3000|35|3001||3002|EOR!2000|3000|40..........

    could you please help me withthis scenario

    thanks
  10. Parse 'C' file by Java[ Go to top ]

    hello people, I have been trying to read a file using JAVA, and scan to get a list of the occurence of a certain pattern in the file and return to me with line number. Ex: I read a 'C' src file as input to my JAVA program, which search's for all the "functions calls" made in the 'C' file. How would I do that ??? Thanks Ashu
  11. parsing a text file in java[ Go to top ]

    I would never use Java for this - but would do this in Shell with sed and awk. It would just blazing speed. I have done similar stuff long ago with upto 1GB of files (you read it right - 1GB :-). It just blazes in speed, compared to any Java implementation.

    -Sanjay
  12. thanks Ganesh[ Go to top ]

    Hi Ganesh,
    Thanks for your response. I will have a final discussion with my folks and I am sure we will go the unix approach.


    thanks and Regards
    Nagendra
  13. i've done this in java[ Go to top ]

    a year ago or something, i had to work with files, averaging 200-300mb, the files were csv, and i had to load that data to a database, ( some logic had to be checked so, it had to be java ), what i did, was read every line, splitting it with a regex ( String.split("") ), it wasnt slow, and you dont consume all the memory because its line by line.

    my advise is to make a little test.

    (javolution may help)
  14. parsing a text file in java[ Go to top ]


    import org.apache.commons.lang.RandomStringUtils;
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.PrintStream;
    import java.util.Random;


    public class GenerateFile {

      public static void main(String[] args) throws Exception {
    File file = new File("hugefile.txt");
    PrintStream ps = new PrintStream(new FileOutputStream(file));
    Random random = new Random(10);

    StringBuffer sb = new StringBuffer();
    for(int i=0; i<700000; i++){
    for(int j=0; j<10; j++){
    sb.append(RandomStringUtils.random(3+random.nextInt(10)%10, true, true));
    if(j<9){
    sb.append("|");
    }
    }
    ps.println(sb.toString());
    sb = new StringBuffer();
    }

    ps.close();
      }
    }



    import org.apache.commons.lang.StringUtils;
    import org.apache.commons.lang.time.StopWatch;
    import java.io.File;
    import java.io.FileReader;
    import java.io.BufferedReader;



    public class TokenizeFile {

      public static void main(String[] args) throws Exception {
    File file = new File("hugefile.txt");
    BufferedReader br = new BufferedReader(new FileReader(file));

    StopWatch stopWatch = new StopWatch();

    stopWatch.start();

    String line = null;
    long totalLinesProcessed = 0l;

    while((line=br.readLine())!=null){
    totalLinesProcessed ++;
    StringUtils.split(line, "|");
    }


    stopWatch.stop();

    br.close();

    System.out.println("Total lines processed = "+totalLinesProcessed+" Time taken = "+stopWatch.getTime() +" ms");
      }
    }

    If you run the above two files; you would be processing a 55 MB file.

    A sample run results:
    Total lines processed = 700000 Time taken = 4457 ms (1.6 GHZ, 512 MB RAM)

    So, Java is not a bad option if you are looking under 5 secs.
  15. parsing a text file in java[ Go to top ]

    Hi Kishore, Jason and Ganesh,
    Thanks for your valuable time and inputs. You guys have definetly shown me the way to go about that.

    Will appreciate your timely help. And I will be posting some more questions in near future. Yesterday when i had a discussion with my folks, they were talking about SQL Loader,Could any of you throw some light of using the SQL Loader approach?


    Thanks and Regards
    Nagendra
  16. Any body help me to convert html file into csv file or excel file using java. I have no idea how to proceed. is there any third party API's available. Please guide me. Thanks in advance Vivek
  17. Hi All, I am trying to parse a text file with pipeline delimited values,for eg: data1|data2|data3|... When i use scanner to delimit the values using scanner.useDelimiter("|") , it is not taking pipeline as a delimiter. The same works with # delimited values. Why is that and is there any way to parse this values. I need to insert them in database after parsing them. Any help is highly appreciated.Its needed immediately for me. Thank you so much. Lakshmi.
  18. scanner.useDelimiter takes a regular expression as parameter and the | character has a special meaning (logical OR between expression parts). In order to use the | character as delimiter you have to escape it like this:

    scanner.useDelimiter("
    |")

    /J

  19. It seems that the editor removed bachslash characters from the post. It should, however be quoted with double back slash characters before the '|' character

    /J