Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid optimization of CASE-WHEN expressions #33751

Closed
ranma42 opened this issue May 18, 2024 · 12 comments · Fixed by #33754
Closed

Invalid optimization of CASE-WHEN expressions #33751

ranma42 opened this issue May 18, 2024 · 12 comments · Fixed by #33754

Comments

@ranma42
Copy link
Contributor

ranma42 commented May 18, 2024

The SqlExpressionSimplifyingExpressionVisitor performs an invalid optimization of CASE-WHEN expressions:

        // generic CASE statement comparison optimization:
        // (CASE
        //  WHEN condition1 THEN result1
        //  WHEN condition2 THEN result2
        //  WHEN ...
        //  WHEN conditionN THEN resultN) == result1 -> condition1
        if (sqlBinaryExpression.OperatorType == ExpressionType.Equal
            && sqlConstantComponent?.Value is not null
            && caseComponent is { Operand: null, ElseResult: null })
        {
            var matchingCaseBlock = caseComponent.WhenClauses.FirstOrDefault(wc => sqlConstantComponent.Equals(wc.Result));
            if (matchingCaseBlock != null)
            {
                return Visit(matchingCaseBlock.Test);
            }
        }

The simplification has these additional requirements that are currently not being checked:

  1. the sqlConstantComponent?.Value only appears once in the list of case results
  2. the matchingCaseBlock is disjoint from all of the previous cases
  3. the (omitted) ELSE is never reached, aka at least one of the conditions holds

An example program that showcases the bug (and can be conveniently run on https://dotnetfiddle.net/ ;) ) is:

using System;
using System.Data;
using System.Linq;
using Microsoft.EntityFrameworkCore;
using Microsoft.EntityFrameworkCore.Query.SqlExpressions;

using var db = new BloggingContext();

db.Database.EnsureDeleted();
db.Database.EnsureCreated();

db.Blogs
    .Where(x => db.Cases(
        x.Url == "http://efcore/", 3,
        x.Url.StartsWith("http://"), 2,
        x.Url.EndsWith("/"), 3
    ) >= 3)
    .ToList();

db.Blogs
    .Where(x => db.Cases(
        x.Url == "http://efcore/", 3,
        x.Url.StartsWith("http://"), 2,
        x.Url.EndsWith("/"), 3
    ) == 3)
    .ToList();

db.Blogs
    .Where(x => db.Cases(
        x.Url == "http://efcore/", 3,
        x.Url.StartsWith("http://"), 2,
        x.Url.EndsWith("/"), 3
    ) == 2)
    .ToList();

public class BloggingContext : DbContext
{
    public DbSet<Blog> Blogs { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder options)
        => options
            .LogTo(Console.WriteLine, Microsoft.Extensions.Logging.LogLevel.Information)
            .UseSqlite($"Data Source=test.db");

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder
            .HasDbFunction(typeof(BloggingContext).GetMethod(nameof(Cases))!)
            .HasTranslation(args =>
                new CaseExpression([
                    new CaseWhenClause(args[0], args[1]),
                    new CaseWhenClause(args[2], args[3]),
                    new CaseWhenClause(args[4], args[5]),
                ])
            );
    }

    public int Cases(bool c1, int v1, bool c2, int v2, bool c3, int v3)
        => throw new NotSupportedException();
}

public class Blog
{
    public int BlogId { get; set; }
    public required string Url { get; set; }
}

The first query is correctly translated as:

SELECT "b"."BlogId", "b"."Url"
FROM "Blogs" AS "b"
WHERE CASE
      WHEN "b"."Url" = 'http://efcore/' THEN 3
      WHEN "b"."Url" LIKE 'http://%' THEN 2
      WHEN "b"."Url" LIKE '%/' THEN 3
END >= 3

The second query is simplified to

SELECT "b"."BlogId", "b"."Url"
FROM "Blogs" AS "b"
WHERE "b"."Url" = 'http://efcore/'

which misses all of the Urls that would match the 3rd-case (example: https://efcore/).

The third query is simplified to

SELECT "b"."BlogId", "b"."Url"
FROM "Blogs" AS "b"
WHERE "b"."Url" LIKE 'http://%'

which incorrectly also matches http://efcore/.

Include provider and version information

EF Core version: 8.0.5
Database provider: Microsoft.EntityFrameworkCore.Sqlite
Target framework: .NET 8.0
Operating system: Linux (/WSL)
IDE: Visual Studio Code 1.89.1

EDIT: added condition about ELSE/NULL

ranma42 added a commit to ranma42/efcore that referenced this issue May 18, 2024
@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

There is a further constraint around NULL values which cannot be noticed while filtering, but would affect projection

Assuming that Url is not nullable, the following query

SELECT "b"."BlogId", "b"."Url", CASE
      WHEN "b"."Url" = 'http://efcore/' THEN 1
      WHEN "b"."Url" LIKE 'http://%' THEN 2
      WHEN "b"."Url" LIKE '%/' THEN 3
END = 1 AS "Category"
FROM "Blogs" AS "b"

classifies all of the rows in one of the 3 categories:

  • true: "b"."Url" = 'http://efcore/',
  • false: NOT("b"."Url" = 'http://efcore/') AND ("b"."Url" LIKE 'http://%' OR "b"."Url" LIKE '%/')
  • NULL anything else

while the (incorrectly) optimized version

SELECT "b"."BlogId", "b"."Url", "b"."Url" = 'http://efcore/' AS "Category"
FROM "Blogs" AS "b"

classifies all of the rows as:

  • true: "b"."Url" = 'http://efcore/',
  • false anything else

@roji
Copy link
Member

roji commented May 18, 2024

@ranma42 all makes sense, thanks for the minimal repro and the bug analysis! /cc @maumar

the sqlConstantComponent?.Value only appears once in the list of case results
the matchingCaseBlock is disjoint from all of the previous cases

Shouldn't the optimization simply collect all WHEN blocks with the specific value, and string them together with ORs? In other words:

CASE
    WHEN x THEN 1
    WHEN y THEN 2
    WHEN z THEN 1
END = 1

... would get simplified to x OR z? I'm not sure why the requirement for matchingCaseBlock to be disjoint...

the (omitted) ELSE is never reached, aka at least one of the conditions holds

Can you provide an example for this? At least in the above example, if none of the conditions hold, the CASE/WHEN would return NULL, and the simplification should return the same rows - or am I missing something?

(though there may indeed be issues around NULL vs. false, we'd need to think about this a bit moer.

BTW an unrelated note: I'm not sure why the optimization is limited to constants only - it seems we could widen it to arbitrary SqlExpressions (though keep in mind the cost of the deep comparisons).

@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

Shouldn't the optimization simply collect all WHEN blocks with the specific value, and string them together with ORs? In other words:

CASE
    WHEN x THEN 1
    WHEN y THEN 2
    WHEN z THEN 1
END = 1

... would get simplified to x OR z? I'm not sure why the requirement for matchingCaseBlock to be disjoint...
It is not a requirement in general, but it is a requirement if we plan to only use a single condition as the current code.
Also, notice that the transformation you propose would be incorrect, as it does not take into account the fact that the third WHEN clause is only reached if the second one does not match:

CASE
      WHEN "b"."Url" = 'http://efcore/' THEN 1
      WHEN "b"."Url" LIKE 'http://%' THEN 2
      WHEN "b"."Url" LIKE '%/' THEN 1
END = 1

should not simplify to

"b"."Url" = 'http://efcore/' OR "b"."Url" LIKE '%/'

The CASE would evaluate to 2 for the string 'http://bug/', so the first expression (CASE ... END = 1) would be false.
Instead, the "simplified" expression would obviously be true for 'http://bug/'
(in fact, the second expression could be further simplified to "b"."Url" LIKE '%/')

If we want to transform it, it should get simplified to x OR (NOT(y) AND z), i.e.

"b"."Url" = 'http://efcore/' OR (NOT("b"."Url" LIKE 'http://%') AND "b"."Url" LIKE '%/')

@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

Can you provide an example for this? At least in the above example, if none of the conditions hold, the CASE/WHEN would return NULL, and the simplification should return the same rows - or am I missing something?

I think you missed the part about "which cannot be noticed while filtering, but would affect projection" ;)
The issue is not triggered when filtering (as NULL and FALSE behave the same way in a WHERE), but rather in a SELECT projection.

For a concrete example, if we simplify

CASE
    WHEN x > 0 THEN 1
    WHEN y > 0 THEN 2
END = 1

to

x > 2

the values for the x: 0, y: 0 row do not match:

x y CASE CASE = 1 x > 0
0 0 NULL NULL FALSE
1 0 1 TRUE TRUE
0 1 2 FALSE FALSE

You can try it directly as:

WITH T(x, y) AS (VALUES (0, 0),(1, 0),(0, 1))
SELECT x, y, CASE
    WHEN x > 0 THEN 1
    WHEN y > 0 THEN 2
END = 1 AS c,
x > 0 AS simple FROM T;

@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

BTW an unrelated note: I'm not sure why the optimization is limited to constants only - it seems we could widen it to arbitrary SqlExpressions (though keep in mind the cost of the deep comparisons).

I agree that it would make sense to handle a more general case, but some care should be taken to avoid introducing subtle issues (see my two previous comments).

From my point of view, instead, it makes sense to limit this optimization to constants (as it makes it easy to control its complexity), but it would be natural to extend it to other binary operators. For example, we could distribute the binop and perform constant folding. This could open some more possibilities:

  • merging consecutive WHEN cases that share the same result
  • in filters, dropping trailing FALSE/NULL cases (this is the happy path: if we end up with a single case, we can return a filter that is simply the WHEN condition)

example of generic transform

CASE
    WHEN a THEN 2
    WHEN b THEN 1
    WHEN c THEN 1
    WHEN d THEN 3
END = 1

--- distribute & constand fold -->

CASE
    WHEN a THEN FALSE
    WHEN b THEN TRUE
    WHEN c THEN TRUE
    WHEN d THEN FALSE
END

--- merge -->

CASE
    WHEN a THEN FALSE
    WHEN b OR c THEN TRUE
    WHEN d THEN FALSE
END

--- if we are in a filter -->

CASE
    WHEN a THEN FALSE
    WHEN b OR c THEN TRUE
END

example of the "happy path":

CASE
    WHEN a THEN 2
    WHEN b THEN 1
    WHEN c THEN 1
    WHEN d THEN 3
END < 3

--- distribute & constand fold -->

CASE
    WHEN a THEN TRUE
    WHEN b THEN TRUE
    WHEN c THEN TRUE
    WHEN d THEN FALSE
END

--- merge -->

CASE
    WHEN a OR b OR c THEN TRUE
    WHEN d THEN FALSE
END

--- if we are in a filter -->

a OR b OR c

@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

fun fact: while trying to write the test cases in https://github.com/ranma42/efcore/tree/fix-case-when-9 I hit another (apparently unrelated) issue around nullability 😅

EDIT: posted the issue as #33752

@roji
Copy link
Member

roji commented May 18, 2024

If we want to transform it, it should get simplified to x OR (NOT(y) AND z), i.e.

You're absolutely right, we indeed can't ignore WHENs with other result values since they may match and therefore cause a later WHEN to not apply. I don't think it's feasible for us to recognize when WHENs are truly disjoint, and I'm not sure that rewriting a CASE/WHEN to e.g. the above x OR (NOT(y) AND z) has any actual value...

We may want to consider removing this optimization, or maybe reduce its scope considerably (in case it answers some specific need we have..). Leaving to @maumar to investigate and think about it...

@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

If we want to transform it, it should get simplified to x OR (NOT(y) AND z), i.e.

You're absolutely right, we indeed can't ignore WHENs with other result values since they may match and therefore cause a later WHEN to not apply. I don't think it's feasible for us to recognize when WHENs are truly disjoint, and I'm not sure that rewriting a CASE/WHEN to e.g. the above x OR (NOT(y) AND z) has any actual value...

In general checking whether the cases are disjoint is hard, but some special cases might be easier, for example:

  • the CompareTo pattern (already handled)
  • CASE expressions with an operand CASE x WHEN x1 THEN v1 WHEN x2 THEN v2 ... END could be optimized if desired (repeated test values can be dropped, after that the test values are trivially disjoint); currently it is not optimized at all
  • CASE expressions with only = tests CASE WHEN x = x1 THEN v1 WHEN x = x2 THEN v2 ... END could be handled as a CASE with an operand

As mentioned in #33751 (comment) there are also some optimization opportunities on the value expressions (opposed to the test expressions).
(disjointedness makes cases commutative, but some simpler optimizations would be possible regardless of that)

We may want to consider removing this optimization, or maybe reduce its scope considerably (in case it answers some specific need we have..). Leaving to @maumar to investigate and think about it...

Except for the (already supported) CompareTo pattern, it is unclear to me whether there is a lot of value in trying really hard to optimize CASE expressions.
Either way, I would like to try and help improving the SqlExpressionSimplifyingExpressionVisitor (fixing the issues and/or implementing further optimizations) if there is a chance 😇

@maumar
Copy link
Contributor

maumar commented May 18, 2024

I also agree that the most valuable optimization is the compare one. I would not like to strip case just for the sake of doing it (i.e. x OR (NOT(y) AND z)). Current way we do the general-purpose simplification is wrong, causing data corruption, so we definitely should remove it. However, I do like the specific optimizations proposed by @ranma42 (constant folding, merge and prune).
They don't affect super common scenarios, but we do optimizations that don't seem very useful at first glance, because they compound on one another and sometimes a silly optimization can open up further simplifications to a real query. As long as the result is a simpler/faster query, 100% correct and the code is relatively straightforward - I think it's worth adding it.

@ranma42 a PR with some SqlExpressionSimplifyingExpressionVisitor would be most welcome. You clearly thought about this a bit, and optimizations you propose all make sense to me.

@roji
Copy link
Member

roji commented May 18, 2024

One thought... We generally don't try to go too far with optimizations which help with "badly-written" LINQ queries; there's an infinite number of ways in which a query can be expressed via inefficient LINQ, and we generally avoid adding a large amount of complexity (and also compilation time!) in trying to compensate for that. However, the same optimization which may help with badly-written LINQ queries can also help with SQL generated internally with the query pipeline; of course, when that's the case the optimization can be quite important.

So I think we should be generally pragmatic here; if we know of internally-produced SQL which is improved by an optimization, that's definitely a good reason to do it. If, however, we don't think there's such a case, and the optimization would only help with badly-written user queries, then unless the optimization is trivial/easy, IMHO we shouldn't necessarily do it.

@ranma42 I'd suggest opening an issue for the planned optimization before spending too much time in implementation, so we can discuss.

@ranma42
Copy link
Contributor Author

ranma42 commented May 18, 2024

If it makes sense to you I would

  1. open a PR based that fixes this issue by removing the incorrect optimization and extending the CompareTo one a little bit, to avoid regressions (I would start from https://github.com/ranma42/efcore/tree/fix-case-when-9 plus any improvements you suggest)
  2. open a separate issue to evaluate the tradeoffs of adding some CASE optimizations (code complexity vs query improvement opportunities); if we find some cases that are important and/or allow for easy improvements, I'll go forward and implement them; otherwise, we will at least some "documentation" (the explanations in the issue) of why those optimizations were not implemented

ranma42 added a commit to ranma42/efcore that referenced this issue May 19, 2024
ranma42 added a commit to ranma42/efcore that referenced this issue May 21, 2024
@maumar maumar added this to the 9.0.0 milestone May 21, 2024
maumar pushed a commit that referenced this issue May 21, 2024
* Add tests for `CASE WHEN END = const` optimization

* Support optimization of `CompareTo(a, b) == {-1,0,1}`

* Remove invalid optimization of `CASE WHEN ... END = const`

Fixes #33751
@maumar
Copy link
Contributor

maumar commented May 21, 2024

fixed by @ranma42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants