Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXCEPT ALL returns the wrong number of records #12956

Open
vbarua opened this issue Oct 15, 2024 · 1 comment
Open

EXCEPT ALL returns the wrong number of records #12956

vbarua opened this issue Oct 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@vbarua
Copy link
Contributor

vbarua commented Oct 15, 2024

Describe the bug

According to the SQL spec, when handling EXCEPT ALL the number of copies returned of a given record is the maximum of 0 OR the number of copies in the LHS minus the RHS.

Specifically:

            b) If a set operator is specified, then the result of applying
              the set operator is a table containing the following rows:

              i) Let R be a row that is a duplicate of some row in T1 or of
                 some row in T2 or both. Let m be the number of duplicates
                 of R in T1 and let n be the number of duplicates of R in
                 T2, where m � 0 and n � 0.

            ...            

            iii) If ALL is specified, then

                 Case:

                 1) If UNION is specified, then the number of duplicates of
                   R that T contains is (m + n).

                 2) If EXCEPT is specified, then the number of duplicates of
                   R that T contains is the maximum of (m - n) and 0.

                 3) If INTERSECT is specified, then the number of duplicates
                   of R that T contains is the minimum of m and n.

DataFusion currently removes all copies of a record if it is present in the RHS.

To Reproduce

The following query

➜  ~ datafusion-cli
DataFusion CLI v42.0.0
> SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
EXCEPT ALL
SELECT * FROM VALUES ('b'), ('c');
+---------+
| column1 |
+---------+
| a       |
+---------+

returns 0 copies of ('b') and ('c') which does not match the behaviour from the spec.

Expected behavior

According to the SQL spec there should be 1 copy of ('b') and 2 copies of ('c')

Additional context

See DB Fiddle for Postgres, which showcases the expected results
https://www.db-fiddle.com/f/ja4BG5CfyEvak5ScoBwCZr/1

@jcsherin
Copy link
Contributor

jcsherin commented Oct 16, 2024

DataFusion CLI v42.0.0
> EXPLAIN SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
EXCEPT ALL
SELECT * FROM VALUES ('b'), ('b'), ('b'), ('c'), ('c');
+---------------+-----------------------------------------------------------------------------------+
| plan_type     | plan                                                                              |
+---------------+-----------------------------------------------------------------------------------+
| logical_plan  | LeftAnti Join: column1 = column1                                                  |
|               |   Values: (Utf8("a")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))...      |
|               |   Values: (Utf8("b")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))         |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                       |
|               |   HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(column1@0, column1@0)] |
|               |     ValuesExec                                                                    |
|               |     ValuesExec                                                                    |
|               |                                                                                   |
+---------------+-----------------------------------------------------------------------------------+

Here the query generates a left anti-join. So it will always exclude rows which match in RHS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants