Should I rewrite my modules in order to implement json/pickle?

  • Context: Python 
  • Thread starter Thread starter Eclair_de_XII
  • Start date Start date
  • Tags Tags
    Modules
Click For Summary

Discussion Overview

The discussion revolves around the consideration of rewriting existing Python modules to implement JSON or pickle for data persistence. Participants explore the implications of such a decision, including efficiency, readability, and the structure of data storage.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Conceptual clarification

Main Points Raised

  • One participant expresses reluctance to rewrite a large module, questioning whether loading objects from data-persistence files would be faster or slower than from normal text files.
  • Another participant suggests that loading time differences would likely be negligible and emphasizes the importance of having a separate data layer for easier refactoring.
  • It is noted that CSV and TXT files are more human-readable, which can aid in debugging and understanding data storage.
  • A participant outlines various factors influencing the choice of file format, including data complexity, update frequency, and access needs.
  • One contributor clarifies the structural differences between .txt, .csv, and .json file types, emphasizing that the choice should depend on portability and data structure.
  • A later reply indicates that the participant has stored auxiliary data in .csv files, suggesting a potential advantage in rewriting the code to utilize structured formats.

Areas of Agreement / Disagreement

Participants do not reach a consensus on whether rewriting the modules is advisable. Multiple competing views remain regarding the benefits and drawbacks of using JSON or pickle versus traditional text files.

Contextual Notes

Participants highlight various considerations such as the complexity of data, the need for human readability, and the potential for introducing bugs when reinventing data structures. There is also mention of the importance of understanding MIME types and their implications for data handling.

Who May Find This Useful

This discussion may be useful for programmers considering data storage options in Python, particularly those weighing the trade-offs between different file formats and data structures.

Eclair_de_XII
Messages
1,082
Reaction score
91
TL;DR
Basically, I wrote a bunch of code that web-scraped data off the internet into some objects, and stored the objects by writing them into .csv and .txt files. Is there any advantage towards storing them into .json or .pkl files, as opposed to the file types thus mentioned? I mean, I am vaguely aware that the code to store the objects into the former is much less complicated than any code used to store them into the latter. But for example, is loading objects from .json or .pkl files faster?
I'm figuratively beating myself up for not knowing about these modules when I went to write my modules. On one hand, I feel like it would be a giant hassle to rewrite them just to implement some module. One of these modules is over a thousand lines long, which might be inconsequential to professional coders, but is certainly an accomplishment for an amateur. Rewriting that specific module alone just makes me not look forward to the task that might improve how efficiently it runs; a silly thought, I should think, considering Python programs generally run slower than programs written in compiled languages such as C/++/#. On the other, I feel like it would be good practice to get into the habit of using efficient data-storage techniques.

I am aware that I've made it clear that I definitely do not want to go through with this task, and I am aware that no one is forcing me to do so except for myself. But my question remains: Would loading objects from data-persistence files be faster or slower than loading them from normal text files?
 
Technology news on Phys.org
It wouldn’t be materially different in loading time. JSON might be a good choice if you choose a sensible structure because you could use the data in another program without having to work out how to read it.

Writing a 1,000 line module that you dread refactoring is a very bad idea. Reading and writing data should have been in a separate ‘data layer’ module that could be refactored easily.

Study abstraction and data models so you do this better next time.
 
  • Like
Likes   Reactions: sysprog
Eclair_de_XII said:
Would loading objects from data-persistence files be faster or slower than loading them from normal text files?
Probably not enough to make it worth the effort to rewrite your module to use them.

There are also other considerations besides speed. CSV and TXT files are easier for humans to read (particularly CSV if the humans have a spreadsheet program handy that they can load the CSV file into). Often being able to read your data files independently of your program is very helpful for debugging. It's also helpful for others to understand what data your program is storing.
 
  • Like
Likes   Reactions: sysprog
This is always the programmers dilemma. Do I use one file format or another? Do I use flat file or database? Do I use key/value or json or csv or …?

The answer usually depends many factors:

- on how complex the data is (json favors structured data) vs ease of accessing the data (csv favors easy access via spreadsheet app or python program).

- on whether the data needs to be updated periodically favors the database vs sorting the file and inserting the data vs appending the data to the front or back…

- on whether the some portion of the data needs to be selectively accessed favors a database vs reading the whole file to find the data

- on how the data will be used later on by yourself vs others vs web accessible…

This is why programmers dream, go mad, become managers, and then gods of data wrangling, forge empires and move the world…. I could go on but then I’d be making a molehill into a mountain fortress.
 
  • Like
Likes   Reactions: sysprog
I don't really understand the question and it seems to me that you do not fully understand what do these MIME types represent.

.txt, .csv, or .json are all text files. What differentiates them is how the content is structured.
  • .txt is the general case and the content is a succession of characters without any specified structure;
  • .csv is text that is structured in a tabular form like a spreadsheet;
  • .json is text that represents structured data, like arrays and objects.
The only reason to use one versus another is portability. Any program knows what to expect with .csv and .json files, can quickly make sense of the info recorded, and could present a nice spreadsheet or make the data available as an array or object to easily search through it.

On the other end, .txt file could be anything, thus a program can only present it as a succession of characters and let the user deal with it. But if the way you structured your data is only known to you and only used by you, it doesn't really matter. It is just that you might have had reinvented the wheel for a structure that already exists and introduced bugs that have already been resolved with programs/modules that have been tried & tested for a very long time.

If you have stored objects in a tabular form, maybe it would be advantageous to rewrite your code. You might even find that a lot of what you have written will be replaced with a lot less code. But there is also the great advice: «If it works, don't mess with it.»
 
  • Like
Likes   Reactions: sysprog
jack action said:
tabular form
Yes, most of the auxilliary data I have stored in .csv files are pd.DataFrame, pd.Series, and dict objects.
 

Similar threads

  • · Replies 10 ·
Replies
10
Views
6K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
Replies
1
Views
2K
  • · Replies 22 ·
Replies
22
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
Replies
2
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K
Replies
3
Views
2K