Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid timezone information is ignored after Z in timestamp literals #8452

Closed
alamb opened this issue Dec 7, 2023 · 8 comments · Fixed by #9429
Closed

Invalid timezone information is ignored after Z in timestamp literals #8452

alamb opened this issue Dec 7, 2023 · 8 comments · Fixed by #9429
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

Describe the bug

Some timestamps that should error are actally parsed, leading to confusing cases when users make a mistake

For example this timestamp is invalid '2023-12-05T21:58:10.45ZZTOP' but it is parsed as though it were '2023-12-05T21:58:10.45Z' (the trailing content is ignored)

To Reproduce

using datafusion-cli,

select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
+-------------------------------------+
| Utf8("2023-12-05T21:58:10.45ZZTOP") |
+-------------------------------------+
| 2023-12-05T21:58:10.450             |
+-------------------------------------+
1 row in set. Query took 0.001 seconds.

Expected behavior

I expect the query to error, similarly to what happens if the Z is not present

❯ select TIMESTAMP '2023-12-05T21:58:10.45 hello';
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Parser error: Invalid timezone "hello": 'hello' is not a valid timezone

This is consistent with postgres, which errors for such cases:

postgres=# select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
ERROR:  invalid input syntax for type timestamp: "2023-12-05T21:58:10.45ZZTOP"
LINE 1: select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
                         ^
postgres=# select timestamp '2023-12-05T21:58:10.45 hello';
ERROR:  invalid input syntax for type timestamp: "2023-12-05T21:58:10.45 hello"
LINE 1: select timestamp '2023-12-05T21:58:10.45 hello';
                         ^
postgres=# select timestamp '2023-12-05T21:58:10.45';
       timestamp
------------------------
 2023-12-05 21:58:10.45
(1 row)

Additional context

Kudos to @reidkaufmann for finding this downstream

This appears to be a bug in arrow: apache/arrow-rs#5182

Once it is fixed in arrow, then we can write a test for it in DataFusion and close the issue

@alamb alamb added the bug Something isn't working label Dec 7, 2023
@alamb alamb added the good first issue Good for newcomers label Dec 7, 2023
@razeghi71
Copy link
Contributor

Can I give this a shot?

@reidkaufmann
Copy link

Please do! I think it is beneficial to reject any timestamp literal that has unparsed trailing characters after the timezone ('Z' in the example: Zulu/UTC). Being strict has the benefit of surfacing problems earlier rather than later.

@razeghi71
Copy link
Contributor

razeghi71 commented Dec 8, 2023

I've made the change on arrow-rs in here: apache/arrow-rs#5189, should be released in a few weeks and then we can update the version and write the tests here.

@alamb
Copy link
Contributor Author

alamb commented Dec 8, 2023

Thank you @razeghi71

@ZhengLin-Li
Copy link
Contributor

Is this issue still opening? I am interested in adding more test cases to test if it works as expected.

@alamb
Copy link
Contributor Author

alamb commented Feb 16, 2024

Thanks @ZhengLin-Li

I double checked and the issue appears to be fixed in the latest DataFusion

Adding a test case would be most appreciated 🙏

DataFusion CLI v35.0.0
❯ select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
Arrow error: Parser error: Invalid timezone "ZZTOP": 'ZZTOP' is not a valid timezone

@ZhengLin-Li
Copy link
Contributor

Hi @alamb Thanks for your follow up.

I tried to add a test case in datafusion/sql/src/parser.rs but it turned out that the parser did not yield an error.

Expected parse error for 'select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP'', but was successful: [Statement(Query(Query { with: None, body: Select(Select { distinct: None, top: None, projection: [UnnamedExpr(TypedString { data_type: Timestamp(None, None), value: "2023-12-05T21:58:10.45ZZTOP" })], into: None, from: [], lateral_views: [], selection: None, group_by: Expressions([]), cluster_by: [], distribute_by: [], sort_by: [], having: None, named_window: [], qualify: None }), order_by: [], limit: None, limit_by: [], offset: None, fetch: None, locks: [], for_clause: None }))]

See: https://github.com/ZhengLin-Li/arrow-datafusion/actions/runs/7936448427/job/21671747144?pr=1

It seems that we only check this in cli but not in parser?

@MohamedAbdeen21
Copy link
Contributor

MohamedAbdeen21 commented Mar 2, 2024

Hi @ZhengLin-Li ,

I think you're confusing arrow's parser with DF's parser.

The job of a parser in a programming language is to validate the syntax and build an AST. TIMESTAMP '2023-12-05T21:58:10.45ZZTOP' is syntactically correct, as TIMESTAMP 'x' and TIMESTAMP '' (empty string).

This is a semantics issue which, in a typical programming language, is checked by the semantic analysis step. In the case of SQL, and as stated in the issue's expected behavior, this should be handled by the optimizer; or worst case scenario, by the query executor (think of interpreted languages that catch semantic errors in runtime).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants