Looking for the fastest open source library to parse CSV

In summary, the individual is seeking suggestions for processing large CSV files with millions of records. They currently use JavaCSV but are looking for a faster alternative, such as uniVocity-parsers. They also mention needing to program with their own business logic, which Excel may not be suitable for. They are open to using MySQL and SQL for data analysis and organization.
  • #1
Jerry Sivan
3
0
I would like to to read and process huge CSV files with millions of records, which we collected from the communications carriers' network.Here is my logic as a simplified procedure:
1) Read the CSV file with FTP protocol
2) Parse the CSV file with my own logic, such as combination, duplicates deletion and so on.
3) Store the parsed data into the MySQL database.
4) Do analysis based on the MySQL database.
5) Generate report as Excel,PPT,Word.Currently we are using the library JavaCSV, but it's not good enough for my project. The fastest library I could find recently is uniVocity-parsers.Do you have any other suggestion? Thanks.
 
Engineering news on Phys.org
  • #2
Is there any reason you can't do the entire process through Excel?
 
  • #3
russ_watters said:
Is there any reason you can't do the entire process through Excel?
Thanks for your quick reply.
Sorry, but I need to program with my own business logic (which is complex, interacting with other business modules in my system) in step 2.
 
  • #4
The maximum rows you can have in MS Excel is 1,048,576. This is also the limit with open/libreoffice. Libreoffice tends to get crashy if you try to graph very large datasets. Excel is better about that but still rather slow. I don't know how Openoffice or other spread sheets respond to very large datasets.

Once you get to tens to hundreds of thousands of rows, spreadsheets tend to be suboptimal for any kind of data analysis. I have heard that "R" is good for that sort of thing but I have never learned to use it properly so can't give any useful report on it.

BoB
 

1. What is the fastest open source library for parsing CSV files?

The fastest open source library for parsing CSV files is generally considered to be Apache Commons CSV. It is known for its speed and efficient memory usage.

2. How does Apache Commons CSV compare to other open source libraries for parsing CSV?

Apache Commons CSV is significantly faster than many other open source libraries for parsing CSV. Some benchmarks have shown it to be as much as 10 times faster than other popular libraries.

3. Is Apache Commons CSV easy to use?

Yes, Apache Commons CSV is known for its user-friendly API and easy-to-use features. It also has comprehensive documentation and a helpful community for support.

4. Can Apache Commons CSV handle large CSV files?

Yes, Apache Commons CSV is designed to handle large CSV files efficiently. It uses a streaming approach, which means it does not load the entire file into memory at once, making it suitable for large datasets.

5. Is Apache Commons CSV compatible with different programming languages?

Yes, Apache Commons CSV is compatible with multiple programming languages, including Java, Python, and C++. It also has built-in support for different CSV formats and configurations.

Similar threads

  • Programming and Computer Science
Replies
15
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
12
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
2K
Replies
152
Views
5K
  • Feedback and Announcements
Replies
0
Views
94K
  • Special and General Relativity
Replies
1
Views
2K
  • Biology and Medical
Replies
6
Views
5K
Back
Top