Looking for the fastest open source library to parse CSV

  1. May 16, 2015 #1
    I would like to to read and process huge CSV files with millions of records, which we collected from the communications carriers' network.

    Here is my logic as a simplified procedure:
    1) Read the CSV file with FTP protocol
    2) Parse the CSV file with my own logic, such as combination, duplicates deletion and so on.
    3) Store the parsed data into the MySQL database.
    4) Do analysis based on the MySQL database.
    5) Generate report as Excel,PPT,Word.

    Currently we are using the library JavaCSV, but it's not good enough for my project. The fastest library I could find recently is uniVocity-parsers.

    Do you have any other suggestion? Thanks.
  3. May 16, 2015 #2


    Is there any reason you can't do the entire process through Excel?
  4. May 16, 2015 #3
    Thanks for your quick reply.
    Sorry, but I need to program with my own business logic (which is complex, interacting with other business modules in my system) in step 2.
  5. May 16, 2015 #4


    The maximum rows you can have in MS Excel is 1,048,576. This is also the limit with open/libreoffice. Libreoffice tends to get crashy if you try to graph very large datasets. Excel is better about that but still rather slow. I don't know how Openoffice or other spread sheets respond to very large datasets.

    Once you get to tens to hundreds of thousands of rows, spreadsheets tend to be suboptimal for any kind of data analysis. I have heard that "R" is good for that sort of thing but I have never learned to use it properly so can't give any useful report on it.

  6. May 16, 2015 #5


