Should I rewrite my modules in order to implement json/pickle?

  • Context: Python 
  • Thread starter Thread starter Eclair_de_XII
  • Start date Start date
  • Tags Tags
    Modules
Click For Summary
SUMMARY

The discussion centers on the decision to rewrite existing Python modules to implement JSON or pickle for data persistence. While the participant expresses reluctance due to the complexity of their current 1,000-line module, they acknowledge the potential benefits of using structured data formats like JSON for efficiency and interoperability. Key considerations include the readability of CSV and TXT files for debugging and the complexity of the data being handled. Ultimately, the consensus suggests that unless significant performance improvements are expected, the effort to refactor may not be justified.

PREREQUISITES
  • Understanding of Python data serialization formats: JSON and pickle
  • Familiarity with data structures: arrays, objects, and tabular formats
  • Knowledge of file formats: .txt, .csv, and their use cases
  • Basic principles of software design: abstraction and data models
NEXT STEPS
  • Research the performance differences between JSON and pickle for data persistence in Python
  • Learn about Python's built-in libraries for handling CSV and JSON files
  • Explore best practices for structuring data in Python applications
  • Investigate the use of data layer modules to improve code maintainability
USEFUL FOR

Python developers, software engineers, and data analysts looking to optimize data storage and improve code efficiency through better data handling practices.

Eclair_de_XII
Messages
1,082
Reaction score
91
TL;DR
Basically, I wrote a bunch of code that web-scraped data off the internet into some objects, and stored the objects by writing them into .csv and .txt files. Is there any advantage towards storing them into .json or .pkl files, as opposed to the file types thus mentioned? I mean, I am vaguely aware that the code to store the objects into the former is much less complicated than any code used to store them into the latter. But for example, is loading objects from .json or .pkl files faster?
I'm figuratively beating myself up for not knowing about these modules when I went to write my modules. On one hand, I feel like it would be a giant hassle to rewrite them just to implement some module. One of these modules is over a thousand lines long, which might be inconsequential to professional coders, but is certainly an accomplishment for an amateur. Rewriting that specific module alone just makes me not look forward to the task that might improve how efficiently it runs; a silly thought, I should think, considering Python programs generally run slower than programs written in compiled languages such as C/++/#. On the other, I feel like it would be good practice to get into the habit of using efficient data-storage techniques.

I am aware that I've made it clear that I definitely do not want to go through with this task, and I am aware that no one is forcing me to do so except for myself. But my question remains: Would loading objects from data-persistence files be faster or slower than loading them from normal text files?
 
Technology news on Phys.org
It wouldn’t be materially different in loading time. JSON might be a good choice if you choose a sensible structure because you could use the data in another program without having to work out how to read it.

Writing a 1,000 line module that you dread refactoring is a very bad idea. Reading and writing data should have been in a separate ‘data layer’ module that could be refactored easily.

Study abstraction and data models so you do this better next time.
 
  • Like
Likes   Reactions: sysprog
Eclair_de_XII said:
Would loading objects from data-persistence files be faster or slower than loading them from normal text files?
Probably not enough to make it worth the effort to rewrite your module to use them.

There are also other considerations besides speed. CSV and TXT files are easier for humans to read (particularly CSV if the humans have a spreadsheet program handy that they can load the CSV file into). Often being able to read your data files independently of your program is very helpful for debugging. It's also helpful for others to understand what data your program is storing.
 
  • Like
Likes   Reactions: sysprog
This is always the programmers dilemma. Do I use one file format or another? Do I use flat file or database? Do I use key/value or json or csv or …?

The answer usually depends many factors:

- on how complex the data is (json favors structured data) vs ease of accessing the data (csv favors easy access via spreadsheet app or python program).

- on whether the data needs to be updated periodically favors the database vs sorting the file and inserting the data vs appending the data to the front or back…

- on whether the some portion of the data needs to be selectively accessed favors a database vs reading the whole file to find the data

- on how the data will be used later on by yourself vs others vs web accessible…

This is why programmers dream, go mad, become managers, and then gods of data wrangling, forge empires and move the world…. I could go on but then I’d be making a molehill into a mountain fortress.
 
  • Like
Likes   Reactions: sysprog
I don't really understand the question and it seems to me that you do not fully understand what do these MIME types represent.

.txt, .csv, or .json are all text files. What differentiates them is how the content is structured.
  • .txt is the general case and the content is a succession of characters without any specified structure;
  • .csv is text that is structured in a tabular form like a spreadsheet;
  • .json is text that represents structured data, like arrays and objects.
The only reason to use one versus another is portability. Any program knows what to expect with .csv and .json files, can quickly make sense of the info recorded, and could present a nice spreadsheet or make the data available as an array or object to easily search through it.

On the other end, .txt file could be anything, thus a program can only present it as a succession of characters and let the user deal with it. But if the way you structured your data is only known to you and only used by you, it doesn't really matter. It is just that you might have had reinvented the wheel for a structure that already exists and introduced bugs that have already been resolved with programs/modules that have been tried & tested for a very long time.

If you have stored objects in a tabular form, maybe it would be advantageous to rewrite your code. You might even find that a lot of what you have written will be replaced with a lot less code. But there is also the great advice: «If it works, don't mess with it.»
 
  • Like
Likes   Reactions: sysprog
jack action said:
tabular form
Yes, most of the auxilliary data I have stored in .csv files are pd.DataFrame, pd.Series, and dict objects.
 

Similar threads

  • · Replies 10 ·
Replies
10
Views
6K
  • · Replies 9 ·
Replies
9
Views
1K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
Replies
1
Views
2K
  • · Replies 22 ·
Replies
22
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
Replies
2
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K
Replies
3
Views
2K