Looking for the fastest open source library to parse CSV

Click For Summary

Discussion Overview

The discussion revolves around finding the fastest open source library for parsing large CSV files, specifically for processing millions of records collected from communications carriers' networks. Participants explore various approaches to reading, parsing, and analyzing CSV data, as well as alternatives to the current library being used.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant outlines a multi-step process for handling CSV files, including reading via FTP, custom parsing, storing in MySQL, and generating reports, while expressing dissatisfaction with the current library, JavaCSV, and suggesting uniVocity-parsers as a potential alternative.
  • Another participant questions the necessity of using custom programming logic, suggesting that the entire process could be handled through Excel.
  • A follow-up response reiterates the need for custom business logic in the parsing step, indicating that the complexity of the logic is a barrier to using Excel.
  • One participant points out the limitations of Excel and LibreOffice when dealing with large datasets, noting performance issues and suggesting that tools like "R" could be better suited for data analysis, although they admit to not being proficient in it.
  • Another participant recommends using MySQL's load data file command as a more efficient method for importing CSV data, suggesting that SQL could then be used for data manipulation tasks such as removing duplicates.

Areas of Agreement / Disagreement

Participants express differing views on the best approach to handle large CSV files, with no consensus on a single solution. Some advocate for using specialized libraries, while others suggest leveraging existing tools like Excel or MySQL commands.

Contextual Notes

Limitations include the maximum row count in Excel and LibreOffice, which may hinder analysis of large datasets. The discussion also highlights the need for custom business logic, which complicates the use of simpler tools.

Jerry Sivan
Messages
3
Reaction score
0
I would like to to read and process huge CSV files with millions of records, which we collected from the communications carriers' network.Here is my logic as a simplified procedure:
1) Read the CSV file with FTP protocol
2) Parse the CSV file with my own logic, such as combination, duplicates deletion and so on.
3) Store the parsed data into the MySQL database.
4) Do analysis based on the MySQL database.
5) Generate report as Excel,PPT,Word.Currently we are using the library JavaCSV, but it's not good enough for my project. The fastest library I could find recently is uniVocity-parsers.Do you have any other suggestion? Thanks.
 
Engineering news on Phys.org
Is there any reason you can't do the entire process through Excel?
 
russ_watters said:
Is there any reason you can't do the entire process through Excel?
Thanks for your quick reply.
Sorry, but I need to program with my own business logic (which is complex, interacting with other business modules in my system) in step 2.
 
The maximum rows you can have in MS Excel is 1,048,576. This is also the limit with open/libreoffice. Libreoffice tends to get crashy if you try to graph very large datasets. Excel is better about that but still rather slow. I don't know how Openoffice or other spread sheets respond to very large datasets.

Once you get to tens to hundreds of thousands of rows, spreadsheets tend to be suboptimal for any kind of data analysis. I have heard that "R" is good for that sort of thing but I have never learned to use it properly so can't give any useful report on it.

BoB
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
10K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
5K
  • · Replies 152 ·
6
Replies
152
Views
12K
  • · Replies 1 ·
Replies
1
Views
3K