Speeding Up SQL Query: Grouping & Joining

  • Thread starter Thread starter BWV
  • Start date Start date
  • Tags Tags
    Sql
Click For Summary

Discussion Overview

The discussion revolves around optimizing a T-SQL query in MS SQL Server that is slow due to multiple inner joins across several tables. Participants explore various strategies for improving query performance, including grouping tables, using nested queries, indexing, and alternative data handling methods.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Exploratory

Main Points Raised

  • One participant suggests grouping four category tables into a single table to speed up the query before joining it with quantitative data.
  • Another participant expresses caution, noting that performance improvements can be unpredictable and recommends conducting timing tests.
  • A participant reports a minor improvement in runtime by changing query syntax from "=" to "IN".
  • There is mention of negligible runtime improvement when using a nested query, leading to speculation about other potential issues in the query.
  • Some participants propose using SQL Server Profiler or examining execution plans to diagnose performance issues.
  • Indexing is suggested as a method to improve join operations, although one participant notes they cannot alter the indexed views they are working with.
  • Another participant recommends denormalizing data into fewer tables and considering parallel processing to handle large data volumes more efficiently.
  • Concerns are raised about the inefficiency of joining large fact tables directly, with suggestions to aggregate and filter them individually before joining.
  • One participant shares their experience with pre-filtered views, noting that they did not yield significant runtime improvements.
  • There is a discussion about the feasibility of loading large datasets into memory for processing outside of SQL, with mixed opinions on performance implications.
  • Some participants express skepticism about the ability to optimize queries beyond what the SQL optimizer can achieve.
  • Concerns are raised about the database design, particularly regarding the repetition of keys across multiple tables, which may complicate optimization efforts.

Areas of Agreement / Disagreement

Participants express a range of views on the best strategies for optimizing the query, with no clear consensus on the most effective approach. Some agree on the potential benefits of indexing and denormalization, while others emphasize the limitations imposed by permissions and database design.

Contextual Notes

Participants note limitations related to permissions for altering views and tables, which may hinder optimization efforts. There are also discussions about the implications of database design on query performance.

  • #31
BWV said:
and the query, which is not important to the question, just makes calculations based on classification and filtering of the category data.
Actually I think the query is of VITAL importance to the question. Can you post it here (you can change the table and column names to something anonymous if necessary).
 
  • Like
Likes   Reactions: jack action
Technology news on Phys.org
  • #32
PeterDonis said:
I'm not sure what you mean by a "professional database program", but the C libraries that the Python wrappers use for all of the common database engines are the same libraries that the command line clients for those engines use, and those libraries are provided by the database engine developers. So the Python wrappers should be running the same C code "under the hood" that the database engine developers use.
I can think of two scenarios:
1) Python uses the same libraries for all data manipulation and never touches the data otherwise. Then it is doing the same thing and has the same speed. I would not call this having Python "doing the heavy lifting".
2) Python uses libraries to retrieve data and then manipulates the data (doing joins, simple calculations, etc.) external to the libraries. In that case, you must be careful and do it right because Python can be up to 45,000 times slower than optimized C even in very simple, repetitive, calculations.

EDIT: I would not consider the timing of the optimized C code valid unless I could see the assembler language because it might have optimized the entire simple calculation away to just return the answer. The unoptimized C runs 45 times faster than the Python.
 
Last edited:
  • #33
the purpose of the query is to simply gather and format data to be imported into another third party software app, so no advantage to using Python or some other software. Also the database is MS SQLServer running in Azure on a lot more horsepower than my local machine. The query is not that complicated, 7 tables with some formatting with case statements so the other program can read it properly. The big constraint is it needs live data and software is constantly updated throughout the day, otherwise I could create tables and columnstore indexes to warehouse it in a better formatI did manage to cut the run time by nearly 50% (11 min to just over 6 min) by creating a simple view with the full query on a slightly wider range of dates
 
  • Like
Likes   Reactions: harborsparrow
  • #34
FactChecker said:
I can think of two scenarios
You're missing a third one:

(3) Python uses the C libraries to execute SQL statements (and any other operations that the database library provides). The results are then read into Python data structures and used for operations that cannot be expressed as SQL statements (or any other operations that the database library provides).

This is the approach actually taken by most programs that do this sort of thing.
 
  • #35
I will just say that if speed is the issue, I would not look to Python to solve it. The only language that I know is slower is MATLAB (if used incorrectly).
 
  • #36
BWV said:
the purpose of the query is to simply gather and format data to be imported into another third party software app, so no advantage to using Python or some other software.
But "gathering data" could be really slow without proper indexing, so there COULD be a big advantage in doing the indexing elsewhere if you can't do it in SQL Server, however

BWV said:
Also the database is MS SQLServer running in Azure
Ah, then transferring perhaps 10GB of unindexed data out each time is probably not an option (unless perhaps to a computational instance in the same Azure cluster).
 
  • #37
If the table design from the third-party app is the culprit, then you could copy the data on your machine - in an optimized way - and update your copied database before each query.

But I seriously doubt this is your problem. I really would like to see your query (Telling us it is a "simple one with join" is not enough for us to help you further). The bold text is particularly a red flag for me:

BWV said:
and the query, which is not important to the question, just makes calculations based on classification and filtering of the category data
A query that makes calculations - as opposed to just gathering basic data - can take a lot of time.

Show us your query.
 
  • Like
Likes   Reactions: pbuk
  • #38
We need to see the SQL. You don't say how large the rows are, but, frankly, 10^7 rows on modern hardware is typically no big deal. You also don't say how large the other tables are. 10^7, 100 byte rows is 1G. Modern drives are pushing 100's of MB per second. Doesn't take that long to scan that kind of data. Not minutes, certainly.

Posting the output of the EXPLAIN will tell volumes. In fact, we can't really give, and you shouldn't really accept, any advice without the EXPLAIN. We don't know what it's doing.

You also don't say how often the query runs, how important it is to a workflow, etc. If it's for a weekly report then who cares? Meanwhile, if it's crushing the DB by voiding all of the cache memory with worthless data, then it has impact even beyond itself.

Simply, as with all performance questions , the answer is "it depends".

So, for best results, post the SQL query itself, and post the result of the EXPLAIN on it. Then, with some more context on how its run and expectations, perhaps better guidance can be given.
 
  • Like
Likes   Reactions: berkeman and PeterDonis
  • #39
Just out of curiosity, is anyone aware that Microsoft SQL Server Studio has a fantastic query speed optimizer? It saved me many times. I can't claim to understand the inner performance of Sql Server but MS SQL Studio had a 100% success rate in refactoring my complex stored procedures and queries. I don't recall how to access the feature but it was pretty easy to find 10 years ago and I assume it's still there. I'm talking about the free version that anyone can download and run.
 
Last edited:
  • #40
Only one I'm familiar with is running a trace. not sure that's the latest.
 
  • #41
The only way to speed up a sql query is to build an index table using the keys you are querying with in your sql query. This allows the query to use the index table to get its results.
 
  • #42
With T-SQL, the dev environment and tools are one's best friend. I have even had SQL Studio suggest and automate the addition of needed indexes. Learn to use that tool to the max, is my best advise. Of course, I had admin rights to the server. A lot of work environments force SQL programmers to work practically disconnected. I hated working like that and literally changed jobs to escape it. The sysadmins needed safety and control, I know, but I found them generally unhelpful. To use the tools fully, get a copy of the database and install it locally if you can. Or get admin rights so the dev environment can help you. People try to segregate programming from sysadmin on databases, which has a purpose sometimes but often leaves the programmer without enough information.
 
  • Like
Likes   Reactions: jedishrfu
  • #43
jedishrfu said:
The only way to speed up a sql query is to build an index table using the keys you are querying with in your sql query. This allows the query to use the index table to get its results.
I cannot agree that there is only one way. There might be a number of ways. I wrote T-SQL code for many years. For example, instead of querying a raw rable, you can write presorted views which are already compiled and optimized, and then query against those. And you can use completely different language semantics, sometimes, to accomplish the goal. Often there can be multiple approaches to the same problem. Indexes are one tool, and building an index is not free either. Everything has a cost.
 
  • #44
So, I have some SQL code that I do not own, but is open source, and it is slower than I think it should be. There are probably 400,000 rows, and selecting rows containing a particular substring takes about a minute. The underlying technology is MariaDB. These queries cause the CPU to jump to 100% on one core.

Obviously, the best thing to do is to use multiple cores. but that seems not to be an option. Faster hardware helps, but not if this is CPU limited. Otherwise, I need to reduce the data - probably only 100-160K of these rows are of interest, although extracting them is non-trivial.

Am I missing something?
 
  • Like
Likes   Reactions: harborsparrow
  • #45
It sounds like the query needs to parse a string for each row and that may be why you’re cpu bound.
 
  • #46
Vanadium 50 said:
So, I have some SQL code that I do not own, but is open source, and it is slower than I think it should be. There are probably 400,000 rows, and selecting rows containing a particular substring takes about a minute. The underlying technology is MariaDB. These queries cause the CPU to jump to 100% on one core.

Obviously, the best thing to do is to use multiple cores. but that seems not to be an option. Faster hardware helps, but not if this is CPU limited. Otherwise, I need to reduce the data - probably only 100-160K of these rows are of interest, although extracting them is non-trivial.

Am I missing something?
Can you make this a new question, please? THIS thread is about T-Sql which is a Microsoft technology, whereas this interesting question concerns a Linux database using an entirely different dB engine and query language. This should get its own thread IMO.
 
  • #47
If one decides based on the title then it’s for any sql from any database.
 
  • #48
jedishrfu said:
If one decides based on the title then it’s for any sql from any database.
The question itself specifies TSql and all my previous answers are specific to that.
 
  • #49
Nevermind. I alraedy have frustration. I don't need to get any more here.
 
  • Haha
Likes   Reactions: jedishrfu
  • #50
Vanadium 50 said:
selecting rows containing a particular substring takes about a minute.
Is the substring always in the same place? Create an index on just that part of the string. Is it always the same substring? Add a computed field to flag that the record contains the string.
 
  • #51
Why not ignore the views and query the table they are formed from? A view is not a real table, so the view has to be created by its own query, and you are doing that several times. then joining them. Query the source of these views directly, so no join is required. However, are the views indexed? The index should be stored. But the index to the table of origin should be stored as well, so querying that should be faster as no join would be required.

Views are usually created to give office staff easy access to just the data their level of expertise or position in a company allows, or to avoid errors when they repeatedly enter a query with an extra column or two being required.
 
  • #52
DrJohn said:
Why not ignore the views and query the table they are formed from?
We are going round in circles:
BWV said:
The tables are actually views on a third party app, they are indexed, but I have no ability to alter them
The OP was fully addressed some time ago.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
4K
Replies
7
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 7 ·
Replies
7
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
5
Views
7K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 31 ·
2
Replies
31
Views
4K