SQL WHERE Clause: Repetitions & Order Explained

  • Context: Comp Sci 
  • Thread starter Thread starter Crystal037
  • Start date Start date
  • Tags Tags
    Sql
Click For Summary

Discussion Overview

The discussion revolves around the behavior of SQL queries, specifically focusing on the use of the WHERE clause and the resulting repetitions in output. Participants explore issues related to duplicate values, the effects of different join types, and the order of data retrieval in SQL queries.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expects distinct output but encounters repetitions, suggesting confusion over how the WHERE clause conditions lead to duplicates.
  • Another participant proposes that the repetitions may stem from a cartesian join, indicating that the join conditions are not properly defined.
  • A different viewpoint suggests that the query optimizer's behavior, influenced by existing indexes, affects how rows are joined and selected.
  • One participant recommends using "select distinct" to achieve the expected result but cautions that this is not a proper solution to the underlying issue.
  • Another participant explains that joining on non-unique keys can lead to multiple occurrences of the same data, complicating the output further.
  • It is noted that using commas in joins can create cross joins, resulting in every combination of rows, which can exacerbate the issue of duplicates.

Areas of Agreement / Disagreement

Participants express differing views on the causes of the repetitions and the best approaches to resolve them. There is no consensus on a single solution, as various models and explanations are presented.

Contextual Notes

Participants highlight the importance of defining join conditions clearly to avoid unintended duplicates, but the discussion remains open regarding the specific implementation and resolution of the issues raised.

Crystal037
Messages
167
Reaction score
7
Homework Statement
I am trying to understand how sql query selects multiple times the same value when I use a combination of condition in where.
I have 4 tables employee, project,department,works_on.
I have to find all project number that involve an employee with last name='Wong' either as a worker or as a manager of the department that controls the project.
Relevant Equations
select pnumber
from project,works_on,employee,department
where lname='Wong'
and ((dnum=dnumber and mgr_ssn=ssn) or (essn=ssn and pnumber=pno));
I expect output as 30,20,3,2,1 in a row but I am getting repetitions of these with 20,30 coming 3 times and 3,2 coming 18 times and 1 coming 16 times. I dont understand how combination of condition in where is leading to so many duplicates. Instead of using or if I divide the query into 2 and execute it one by one then I am getting only distinct values. Can you also explain the order by which it traverses through the table and selects tuples and display as to understand why am I getting those multiple values and why in that specific order
Screenshot (136).png
Screenshot (137).png
Screenshot (138).png
Screenshot (139).png
Screenshot (140).png
 
Physics news on Phys.org
Repitition like that is usually due to a cartesian join, where some part of the join is not defined. So each row of one table is joined to EVERY row of another table. And I think that the OR is one of the things doing that. So read up on cartesian joins to see where it is going wrong.

If your select was on more than just Pnumber, say include Pno from works_on, or Pno from works_on and Dnumber from department you should see groupings of data showing how the cartesian join is collecting repetively data on a table by table basis.

It always helps to recover date where you specify the tables in the select's join rather than just a column name. You have not mentioned which table's dnum equals which table's dnumber, for example.
 
  • Like
Likes   Reactions: Crystal037 and jim mcnamara
The query optimiser will test the number of rows it gets in lots of different joins and then use that to work out which join to do first, then collect data from that and choose which one it does next. This depends on indexes existing as well, if I remember correctly.
 
You could probably replace your first line with "select distinct pnumber" and get the result you expect. But that would be a bad fix.

The real fix you need is to rewrite your query with an appropriate JOIN syntax to eliminate the problem explained by @DrJohn . Here are links that explain how and why you should:
 
  • Like
Likes   Reactions: jim mcnamara
If you join on dnum = dnumber you get nr 5 times 9, since dnumber is 5 thrice and dnum as well. This all goes for the same person, so the join isn't uniquely linked.
 
Joining with comma's or with the equivalent cross join takes every combination of rows once. If you then specify dnum = dnumber the amount of rows where both equal 5 becomes 9 since for both tables there are 3 such rows. But you also join on pnum = pnumber and these match on 2 and 3 where again dnum = dnumber = 5. Therefore these occurences (2 and 3) are again multiplied by 2, yielding 18 times 2 and 3.

BTW this also occurs on an inner join on the same conditions, you need to take a join condition that uniquely identifies one correct row.
 
Last edited:

Similar threads

  • · Replies 17 ·
Replies
17
Views
3K
  • · Replies 7 ·
Replies
7
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 8 ·
Replies
8
Views
1K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K