Speeding Up SQL Query: Grouping & Joining

BWV · Aug 11, 2022

I have a T-SQL query (MS SQL Server) that runs slowly due to the six inner joins I need to get the data I want. The seven tables of data I am working with consist of four of category data and three of quantitative - and the query, which is not important to the question, just makes calculations based on classification and filtering of the category data

question is - the four category tables are all the same key, they just have different information I need on them. So do you all think I could materially speed up the query by first grouping the four category tables into a single one with only the data I want then joining that to the quant data?

Not sure if I am permissioned to create a table, so may have to use a nested query to do this

FactChecker · Aug 11, 2022

I think it should help a lot. That being said, computer speed issues can be very tricky and what I think will be faster can be slower in fact. You should conduct a timing test. If you do, I would be interested. If the result is so significant that no timing is needed, that is enough to know.

BWV · Aug 11, 2022

Will give it a try. FWIW one minor tweak I made after some research was replacing = with IN in the WHERE and some CASE WHEN statements and it cut a 10:54 runtime by a few seconds to 10:47

BWV · Aug 11, 2022

Negligible difference - 10:47 down to 10:41 using a nested query. the nested query on its own runs too fast for mgmt studio to time, so maybe be the case statements in SELECT or some other issue

WWGD · Aug 11, 2022

Have you tried doing a trace (Profiler, IIRC) ? Maybe translating to Relational Algebra?Execution Plan
Activity Monitor?

jedishrfu · Aug 11, 2022

Have you built indexes for the tables in question? that is one way to speed up a join operation.

https://support.microsoft.com/en-us...formance-0a8e2aa6-735c-4c3a-9dda-38c6c4f1a0ce

BWV · Aug 11, 2022

Thanks,

Have not messed with profiler

The tables are actually views on a third party app, they are indexed, but I have no ability to alter them

pbuk · Aug 11, 2022

I think I'd probably denormalize the data into 3 or fewer tables and query against that. If it is practical to do this each time you want to run the report then you have a solution, if not then you have an idea of the minimum time you are aiming at with optimizations.

Also, if the slow times are driven by the volume of data rather than the complexity of calculations think about how you can parallelize the process - or even run serially in smaller batches to make more use of faster cache levels.

jack action · Aug 11, 2022

Have you tried the EXPLAIN command?

FactChecker · Aug 11, 2022

BWV said:

The tables are actually views on a third party app, they are indexed, but I have no ability to alter them

Are you saying that you don't have permission to join (inner, outer, left, or right as appropriate) the four tables together with the common key and save them before the query? In that case, I wouldn't be surprised if there was no timing improvement if the time includes the join.
I thought that you were talking about joining them once to get a larger table that could be saved and used in many queries later. In that case, I would expect the time required for a single query to be much less than the time required for four separate queries.

BWV · Aug 11, 2022

Appreciate all the responses, several good avenues to explore for a hack programmer

Ultimately the thinking in my OP comes down to misunderstanding that the JOIN operator is associative so nesting the joins should have no effect on run time

Also my nested query of categorical data took literally 0 time to run, so that can’t be the issue

The quant data tables are very large, in one to many with the cat data and many to many with one another, so that likely is the source of the slow run time

pbuk · Aug 11, 2022

BWV said:

The quant data tables are very large, in one to many with the cat data and many to many with one another, so that likely is the source of the slow run time

No doubt. That is going to be particularly slow if the common fields on the many to many are not all indexed, so that's the first thing to try on a snapshot before denormalizing.

pasmith · Aug 12, 2022

BWV said:

Appreciate all the responses, several good avenues to explore for a hack programmer

Ultimately the thinking in my OP comes down to misunderstanding that the JOIN operator is associative so nesting the joins should have no effect on run time

Also my nested query of categorical data took literally 0 time to run, so that can’t be the issue

The quant data tables are very large, in one to many with the cat data and many to many with one another, so that likely is the source of the slow run time

It sounds like you are describing fact tables in a star schema, and you never want to join those to each other: Firstly this is very slow, and secondly it might result in a single row of a fact table being repeated multiple times in the joined table, which will make all of your aggregations inaccurate.

Instead you should aggregate and filter the fact tables individually, and then outer join the results with the category tables on the common category keys.

BWV · Aug 15, 2022

Tried some pre-filtered views on the large (~10^7 records) data tables, thought by pre-filtering them with a trailing 180 date window it would speed up some of the stuff I run, but the runtime is essentially unchanged. Now messing around with trying to create an indexed view with CREATE UNIQUE CLUSTERED INDEX, but running into permission issues, so will likely just give up. You can't schemabind any of the views in this DB to create an index so need to go to the main tables, but run into permissioning issues as its a 3rd party app and don't want to mess with the core data tables too much

pbuk · Aug 15, 2022

BWV said:

Tried some pre-filtered views on the large (~10^7 records) data tables

Doesn't sound that large, have you tried just loading the lot into memory and using your programming tool of choice?

BWV · Aug 15, 2022

pbuk said:

Doesn't sound that large, have you tried just loading the lot into memory and using your programming tool of choice?

it is manageable, but reducing the current 10-20 minute runtime would improve workflows

pbuk · Aug 15, 2022

I don't understand, are you saying you are currently loading 10^7 records into some suitable data structure in e.g Python and it is taking 10-20 minutes for Python to run the computations?

BWV · Aug 15, 2022

No just talking about the query time in SSMS

jack action · Aug 15, 2022

I wish I could at least see the query. Seeing the table definitions would be a plus.

What you are trying to do - rewriting your query to eliminate the JOINs - is most likely pointless. The program is always using the most optimized way to solve your query. Thinking you could do it better than the program is most likely not going to happen.

You have to make sure that the query is written such that it works well with the built-in optimizer AND that your data is well stored (most simple data type and useful indexing).

pbuk · Aug 15, 2022

BWV said:

No just talking about the query time in SSMS

Yes, so I am saying avoid the query altogether, just send the tables over and let Python or whatever do the heavy lifting.

FactChecker · Aug 15, 2022

If you can not do a join once and save the result for later use multiple times, I think there is very little improvement that you can make. If you do not have permission to create and save a joined table, you should consult the data administrator to see what he will suggest and allow. There might be serious issues with keeping a separate, joined table up to date. There also might be storage space issues. It sounds to me as though the database might be poorly designed, with the same key repeated many times in separate, single-data-column tables. If so, you might be out of luck. One reason for designing the database that way might be to have unique rules of access for each data type.

FactChecker · Aug 15, 2022

pbuk said:

Yes, so I am saying avoid the query altogether, just send the tables over and let Python or whatever do the heavy lifting.

If speed is the issue, I would have serious doubts about Python. It is relatively slow. A professional database program should be orders of magnitude faster.

PeterDonis · Aug 15, 2022

FactChecker said:

If speed is the issue, I would have serious doubts about Python.

Python has third-party libraries for all of the popular database engines that use C extension modules to do all the heavy lifting. There isn't really a speed impact on DB operations in Python if you use these.

FactChecker · Aug 15, 2022

PeterDonis said:

Python has third-party libraries for all of the popular database engines that use C extension modules to do all the heavy lifting. There isn't really a speed impact on DB operations in Python if you use these.

There is no reason that it would be faster than the professional database program if it uses the same algorithms in the same language. The process of pulling the data out for any Python manipulations would add time and might be slow.

pbuk · Aug 15, 2022

FactChecker said:

If speed is the issue, I would have serious doubts about Python. It is relatively slow. A professional database program should be orders of magnitude faster.

No, no, and no.

I can't blame you for repeating the often quoted myth that "Python is slow", because it is often repeated, but it is also a myth.

And "professional database programs" are only fast when (i) doing things that "professional database programs" are good at, and (ii) they are properly structured for the task at hand. In this case it seems that neither criterion is fulfilled.

FactChecker · Aug 15, 2022

pbuk said:

No, no, and no.

I can't blame you for repeating the often quoted myth that "Python is slow", because it is often repeated, but it is also a myth.

It's the result of timing tests.

pbuk · Aug 15, 2022

FactChecker said:

It's the result of timing tests.

What timing tests are relevant to the problem here, which is not even known?

FactChecker · Aug 15, 2022

pbuk said:

What timing tests are relevant to the problem here, which is not even known?

Of course, I am talking in general. In general, Python is a fairly slow language.
If Python can do everything through calls to more efficient subroutines and never has to bring the data into the Python environment, then fine. But in that case, it should be doing the same thing that the original database program is doing. I wouldn't call that having Python "do the heavy lifting".

PeterDonis · Aug 15, 2022

FactChecker said:

There is no reason that it would be faster than the professional database program if it uses the same algorithms in the same language.

I'm not sure what you mean by a "professional database program", but the C libraries that the Python wrappers use for all of the common database engines are the same libraries that the command line clients for those engines use, and those libraries are provided by the database engine developers. So the Python wrappers should be running the same C code "under the hood" that the database engine developers use.

pbuk · Aug 15, 2022

FactChecker said:

Of course, I am talking in general. In general, Python is a fairly slow language.
If Python can do everything through calls to more efficient subroutines and never has to bring the data into the Python environment, then fine. But in that case, it should be doing the same thing that the original database program is doing. I wouldn't call that having Python "do the heavy lifting".

There seems to be a big misunderstanding here.

We have a query over 10^7 records (or is it 3x10^7 records) that is taking 10-20 minutes to run. That can only be becuase the data is not structured and indexed in an appropriate way for this particular computation. We cannot, apparently, change the structure of the database in MS SQL Server because the database is proprietory.

What we can do, however, is pull the (lets assume 10GB of) data over to a workstation running (any language that has an effective hash table implementation) so that we can structure it in a way that is appropriate for this particular computation.

The fact that Python is slowed by a small constant factor due to running a P-code virtual machine is insignificant compared to the exponential gains that can be made by structuring and indexing the data in an appropriate way.

pbuk · Aug 15, 2022

BWV said:

and the query, which is not important to the question, just makes calculations based on classification and filtering of the category data.

Actually I think the query is of VITAL importance to the question. Can you post it here (you can change the table and column names to something anonymous if necessary).

FactChecker · Aug 15, 2022

PeterDonis said:

I'm not sure what you mean by a "professional database program", but the C libraries that the Python wrappers use for all of the common database engines are the same libraries that the command line clients for those engines use, and those libraries are provided by the database engine developers. So the Python wrappers should be running the same C code "under the hood" that the database engine developers use.

I can think of two scenarios:
1) Python uses the same libraries for all data manipulation and never touches the data otherwise. Then it is doing the same thing and has the same speed. I would not call this having Python "doing the heavy lifting".
2) Python uses libraries to retrieve data and then manipulates the data (doing joins, simple calculations, etc.) external to the libraries. In that case, you must be careful and do it right because Python can be up to 45,000 times slower than optimized C even in very simple, repetitive, calculations.

EDIT: I would not consider the timing of the optimized C code valid unless I could see the assembler language because it might have optimized the entire simple calculation away to just return the answer. The unoptimized C runs 45 times faster than the Python.

BWV · Aug 15, 2022

the purpose of the query is to simply gather and format data to be imported into another third party software app, so no advantage to using Python or some other software. Also the database is MS SQLServer running in Azure on a lot more horsepower than my local machine. The query is not that complicated, 7 tables with some formatting with case statements so the other program can read it properly. The big constraint is it needs live data and software is constantly updated throughout the day, otherwise I could create tables and columnstore indexes to warehouse it in a better formatI did manage to cut the run time by nearly 50% (11 min to just over 6 min) by creating a simple view with the full query on a slightly wider range of dates

PeterDonis · Aug 15, 2022

FactChecker said:

I can think of two scenarios

You're missing a third one:

(3) Python uses the C libraries to execute SQL statements (and any other operations that the database library provides). The results are then read into Python data structures and used for operations that cannot be expressed as SQL statements (or any other operations that the database library provides).

This is the approach actually taken by most programs that do this sort of thing.

FactChecker · Aug 15, 2022

I will just say that if speed is the issue, I would not look to Python to solve it. The only language that I know is slower is MATLAB (if used incorrectly).

pbuk · Aug 15, 2022

BWV said:

the purpose of the query is to simply gather and format data to be imported into another third party software app, so no advantage to using Python or some other software.

But "gathering data" could be really slow without proper indexing, so there COULD be a big advantage in doing the indexing elsewhere if you can't do it in SQL Server, however

BWV said:

Also the database is MS SQLServer running in Azure

Ah, then transferring perhaps 10GB of unindexed data out each time is probably not an option (unless perhaps to a computational instance in the same Azure cluster).

jack action · Aug 15, 2022

If the table design from the third-party app is the culprit, then you could copy the data on your machine - in an optimized way - and update your copied database before each query.

But I seriously doubt this is your problem. I really would like to see your query (Telling us it is a "simple one with join" is not enough for us to help you further). The bold text is particularly a red flag for me:

BWV said:

and the query, which is not important to the question, just makes calculations based on classification and filtering of the category data

A query that makes calculations - as opposed to just gathering basic data - can take a lot of time.

Show us your query.

whartung · Aug 16, 2022

We need to see the SQL. You don't say how large the rows are, but, frankly, 10^7 rows on modern hardware is typically no big deal. You also don't say how large the other tables are. 10^7, 100 byte rows is 1G. Modern drives are pushing 100's of MB per second. Doesn't take that long to scan that kind of data. Not minutes, certainly.

Posting the output of the EXPLAIN will tell volumes. In fact, we can't really give, and you shouldn't really accept, any advice without the EXPLAIN. We don't know what it's doing.

You also don't say how often the query runs, how important it is to a workflow, etc. If it's for a weekly report then who cares? Meanwhile, if it's crushing the DB by voiding all of the cache memory with worthless data, then it has impact even beyond itself.

Simply, as with all performance questions , the answer is "it depends".

So, for best results, post the SQL query itself, and post the result of the EXPLAIN on it. Then, with some more context on how its run and expectations, perhaps better guidance can be given.

harborsparrow · Nov 21, 2022

Just out of curiosity, is anyone aware that Microsoft SQL Server Studio has a fantastic query speed optimizer? It saved me many times. I can't claim to understand the inner performance of Sql Server but MS SQL Studio had a 100% success rate in refactoring my complex stored procedures and queries. I don't recall how to access the feature but it was pretty easy to find 10 years ago and I assume it's still there. I'm talking about the free version that anyone can download and run.

WWGD · Nov 21, 2022

Only one I'm familiar with is running a trace. not sure that's the latest.

jedishrfu · Nov 21, 2022

The only way to speed up a sql query is to build an index table using the keys you are querying with in your sql query. This allows the query to use the index table to get its results.

harborsparrow · Nov 21, 2022

With T-SQL, the dev environment and tools are one's best friend. I have even had SQL Studio suggest and automate the addition of needed indexes. Learn to use that tool to the max, is my best advise. Of course, I had admin rights to the server. A lot of work environments force SQL programmers to work practically disconnected. I hated working like that and literally changed jobs to escape it. The sysadmins needed safety and control, I know, but I found them generally unhelpful. To use the tools fully, get a copy of the database and install it locally if you can. Or get admin rights so the dev environment can help you. People try to segregate programming from sysadmin on databases, which has a purpose sometimes but often leaves the programmer without enough information.

harborsparrow · Nov 21, 2022

jedishrfu said:

The only way to speed up a sql query is to build an index table using the keys you are querying with in your sql query. This allows the query to use the index table to get its results.

I cannot agree that there is only one way. There might be a number of ways. I wrote T-SQL code for many years. For example, instead of querying a raw rable, you can write presorted views which are already compiled and optimized, and then query against those. And you can use completely different language semantics, sometimes, to accomplish the goal. Often there can be multiple approaches to the same problem. Indexes are one tool, and building an index is not free either. Everything has a cost.

Vanadium 50 · Nov 21, 2022

So, I have some SQL code that I do not own, but is open source, and it is slower than I think it should be. There are probably 400,000 rows, and selecting rows containing a particular substring takes about a minute. The underlying technology is MariaDB. These queries cause the CPU to jump to 100% on one core.

Obviously, the best thing to do is to use multiple cores. but that seems not to be an option. Faster hardware helps, but not if this is CPU limited. Otherwise, I need to reduce the data - probably only 100-160K of these rows are of interest, although extracting them is non-trivial.

Am I missing something?

jedishrfu · Nov 21, 2022

It sounds like the query needs to parse a string for each row and that may be why you’re cpu bound.

harborsparrow · Nov 21, 2022

Vanadium 50 said:

So, I have some SQL code that I do not own, but is open source, and it is slower than I think it should be. There are probably 400,000 rows, and selecting rows containing a particular substring takes about a minute. The underlying technology is MariaDB. These queries cause the CPU to jump to 100% on one core.

Obviously, the best thing to do is to use multiple cores. but that seems not to be an option. Faster hardware helps, but not if this is CPU limited. Otherwise, I need to reduce the data - probably only 100-160K of these rows are of interest, although extracting them is non-trivial.

Am I missing something?

Can you make this a new question, please? THIS thread is about T-Sql which is a Microsoft technology, whereas this interesting question concerns a Linux database using an entirely different dB engine and query language. This should get its own thread IMO.

jedishrfu · Nov 21, 2022

If one decides based on the title then it’s for any sql from any database.

harborsparrow · Nov 21, 2022

jedishrfu said:

If one decides based on the title then it’s for any sql from any database.

The question itself specifies TSql and all my previous answers are specific to that.

Vanadium 50 · Nov 21, 2022

Nevermind. I alraedy have frustration. I don't need to get any more here.

pbuk · Nov 21, 2022

Vanadium 50 said:

selecting rows containing a particular substring takes about a minute.

Is the substring always in the same place? Create an index on just that part of the string. Is it always the same substring? Add a computed field to flag that the record contains the string.

Speeding Up SQL Query: Grouping & Joining

Similar threads

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Python Partial pivoting in getting the reduced row echelon form of a matrix

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem