MySQL: Best way to warehouse older data

  • Thread starter Thread starter cpscdave
  • Start date Start date
  • Tags Tags
    Data Mysql
Click For Summary

Discussion Overview

The discussion revolves around strategies for warehousing older data in a MySQL database, particularly focusing on the efficiency of data transfer and deletion processes. Participants explore various methods, including scripting approaches and database partitioning, while also addressing performance concerns and SQL command differences.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes a Python script designed to transfer older data to a new database and delete the original records, but notes performance issues with larger tables.
  • Another participant suggests using RANGE partitioning instead of creating a new table, mentioning that it could simplify future data management and potentially improve performance.
  • Concerns are raised about the efficiency of the current method, with suggestions to add an index on the timestamp column to enhance performance.
  • A participant proposes using a .sql file for executing SQL commands directly, rather than relying on a Python script, to avoid connection issues and ensure transactional integrity.
  • Several participants discuss the differences between DELETE, DROP, and TRUNCATE commands in SQL, highlighting performance implications and rollback capabilities.
  • Questions arise regarding the differences in behavior between DELETE and TRUNCATE, particularly in terms of table state and constraints.
  • Participants express uncertainty about the definitions and behaviors of SQL commands across different database systems, particularly between MySQL and SQL Server.

Areas of Agreement / Disagreement

Participants express a variety of opinions on the best approach to warehousing data, with no clear consensus on the most efficient method. There is also ongoing debate regarding the implications of using different SQL commands, with some participants agreeing on the general principles but differing on specifics.

Contextual Notes

Limitations include potential performance issues with large datasets, the need for proper indexing, and the implications of using different SQL commands. There is also uncertainty regarding the behavior of SQL commands in different database systems.

Who May Find This Useful

Database administrators, data engineers, and developers working with MySQL or similar database systems may find this discussion relevant, particularly those dealing with large datasets and data warehousing strategies.

cpscdave
Messages
402
Reaction score
120
Hi everyone,
I have a MySQL DB with a couple of large tables (a couple hundred million rows). Each row has a timestamp associated with it. Since this is just historical data, and not needed we want to copy older data into another database, and then delete the copied records.

I've written a python script that accomplishes this by running a series of queries:

Code:
CREATE DATABASE IF NOT EXISTS `olddata`;
CREATE TABLE IF NOT EXISTS `oldTable`;
INSERT INTO `olddata`.oldeTable` SELECT * FROM `currData`.`currTable` WHERE timestamp < someValue;
DELETE FROM `currData`.`currTable` WHERE timestamp < someValue;

What is the most efficent way to run these queries?
I've tried running it so that we timestamp is the ultimate goal value, but the query takes a long time to run, and can timeout on some of the larger tables,
So I've changed the script to try and do the same thing but looping through the oldest value + 1 month repeatedly till we hit the goal.
Or is there a third way that would be better??

Thanks,
-Dave
 
Technology news on Phys.org
cpscdave said:
So I've changed the script to try and do the same thing but looping through the oldest value + 1 month repeatedly till we hit the goal.
I'm sure there is a better way, but this is what I would do too :smile:
 
What you want to do seems to be a good candidate for partitioning - namely RANGE partitioning - with an ALTER TABLE statement. You don't need to create another table that you won't use, you basically separate your actual table into sub-tables. If you ever need to delete the older data, it will require a simple TRUNCATE PARTITION statement.

But I don't know if it will be better performance-wise compared to your method as the manual says «In effect, different portions of a table are stored as separate tables in different locations», so it looks to be the same, but mySQL might be better optimized for it and it will be a lot easier for future addition of data.

And if you still want to go your way, you should have an index on your timestamp column, if you don't have one already.
 
Last edited:
  • Like
Likes   Reactions: stoomart
jack action said:
And if still want to go your way, you should have an index on your timestamp column, if you don't have one already.

Thanks for the feedbacks! I had thought about pre-adding the index, but been having some problems with the connection timing out. I just got it to work by running the query locally on the machine, vs from a remote computer.
Adding the index has drastically improved the performance of the script, to the point that it may not even be needed to do the warehousing.

The joys of supporting a DB schema that I didn't create. Get to find all these little foibles down the road :)
 
Does this really need to be all that fast? It seems like this is going to be the type of thing you do only once, then have a cron perform daily maintenance to keep it right. I would absolutely not do it in python because then you are reliant on the connection. Use a .sql file and give that directly to mysql. From python, you can then call the mysql client with this file as the parameter (which will allow you to change the "someValue" parameter programatically.

Wait! I see some very scary sql there, here is the corrected version:
Code:
BEGIN;
CREATE DATABASE IF NOT EXISTS `olddata`;
CREATE TABLE IF NOT EXISTS `oldTable`;
INSERT INTO `olddata`.oldeTable` SELECT * FROM `currData`.`currTable` WHERE timestamp < someValue;
DELETE FROM `currData`.`currTable` WHERE timestamp < someValue;
COMMIT;

You have multiple queries in your file, and if mysql craps out halfway through, your database will be in a corrupted state so always wrap it in a transaction.
 
Last edited:
  • Like
Likes   Reactions: stoomart, WWGD and jack action
Hi, hope not to derail or distract the post, with a quasi-related followup; the answer may provide info towards speeding up the OPs query .

Are there, in SQL Server ( or other) significant differences (resource- and otherwise? ) between Delete, Drop, Truncate ? I believe (SQL Server -wise) is more of a DDL (Data Definition ) statement, which will delete associated data objects ( triggers, integrity constraints, indexes, etc ), so, I guess, the table would not exist in the original script, while Truncate and Delete are more DML (Table will still exist, albeit without data in it)?
 
WWGD said:
Are there, in SQL Server ( or other) significant differences (resource- and otherwise? ) between Delete, Drop, Truncate ?
  • DELETE deletes rows (DML);
  • DROP TABLE deletes a table (DDL);
  • TRUNCATE TABLE is equivalent to DROP TABLE followed by CREATE TABLE to recreate the table without any data (DDL).
So TRUNCATE TABLE is much faster than DELETE all rows with large tables, but it's really not the same thing.
 
  • Like
Likes   Reactions: WWGD
jack action said:
  • DELETE deletes rows (DML);
  • DROP TABLE deletes a table (DDL);
  • TRUNCATE TABLE is equivalent to DROP TABLE followed by CREATE TABLE to recreate the table without any data (DDL).
So TRUNCATE TABLE is much faster than DELETE all rows with large tables, but it's really not the same thing.
Thanks, I have done some reading, but sources seem to contradict each other. What would then be the difference results-wise between DELETE and TRUNCATE? Don't we end up in both cases with an empty table? And I guess DROP TABLE does not allow for 'Where' clauses? Also, do you know ifMySQL and SQL Server coincide in these definitions? EDIT: And I guess we would require to drop "Child tables" before dropping Parent Tables?EDIT2: There also seems to be a difference of opinion on whether "Drops" can be rolled back.
 
Last edited:
WWGD said:
What would then be the difference results-wise between DELETE and TRUNCATE?
From the MySQL documentation:
Although TRUNCATE TABLE is similar to DELETE, it is classified as a DDL statement rather than a DML statement. It differs from DELETE in the following ways:
  • Truncate operations drop and re-create the table, which is much faster than deleting rows one by one, particularly for large tables.
  • Truncate operations cause an implicit commit, and so cannot be rolled back. See Section 13.3.3, “Statements That Cause an Implicit Commit”.
  • Truncation operations cannot be performed if the session holds an active table lock.
  • TRUNCATE TABLE fails for an InnoDB table or NDB table if there are any FOREIGN KEY constraints from other tables that reference the table. Foreign key constraints between columns of the same table are permitted.
  • Truncation operations do not return a meaningful value for the number of deleted rows. The usual result is “0 rows affected,” which should be interpreted as “no information.”
  • As long as the table format file tbl_name.frm is valid, the table can be re-created as an empty table with TRUNCATE TABLE, even if the data or index files have become corrupted.
  • Any AUTO_INCREMENT value is reset to its start value. This is true even for MyISAM and InnoDB, which normally do not reuse sequence values.
  • When used with partitioned tables, TRUNCATE TABLE preserves the partitioning; that is, the data and index files are dropped and re-created, while the partition definitions (.par) file is unaffected.

    Note
    As of MySQL 5.7.6, partition definition (.par) files are no longer created. Instead, partition definitions are stored in the internal data dictionary.
  • The TRUNCATE TABLE statement does not invoke ON DELETE triggers.
WWGD said:
Don't we end up in both cases with an empty table?
If you specify deleting all rows, yes. (Like with «DELETE FROM table_name WHERE 1=1»).
WWGD said:
And I guess DROP TABLE does not allow for 'Where' clauses?
All you can do is delete a table by specifying its name («DROP [TEMPORARY] TABLE [IF EXISTS] tbl_name [, tbl_name] ...»).
WWGD said:
Also, do you know ifMySQL and SQL Server coincide in these definitions?
I'm not familiar with SQL Server.
WWGD said:
And I guess we would require to drop "Child tables" before dropping Parent Tables?
The concept of "Child" and "Parent" tables doesn't exist in MySQL, so you really need to specify their names. But I think if you have an appropriate foreign key constraint, it won't drop the table, but I'm not sure.
WWGD said:
There also seems to be a difference of opinion on whether "Drops" can be rolled back.
According to MySQL documentation:
Some statements cannot be rolled back. In general, these include data definition language (DDL) statements, such as those that create or drop databases, those that create, drop, or alter tables or stored routines.
 
  • Like
Likes   Reactions: WWGD
  • #10
jack action said:
From the MySQL documentation:

If you specify deleting all rows, yes. (Like with «DELETE FROM table_name WHERE 1=1»).

All you can do is delete a table by specifying its name («DROP [TEMPORARY] TABLE [IF EXISTS] tbl_name [, tbl_name] ...»).

I'm not familiar with SQL Server.

The concept of "Child" and "Parent" tables doesn't exist in MySQL, so you really need to specify their names. But I think if you have an appropriate foreign key constraint, it won't drop the table, but I'm not sure.

According to MySQL documentation:
Thanks, Jack.
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
5K