Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows CI test is extremely long (~3 hours) #8696

Closed
Jefffrey opened this issue Dec 31, 2023 · 11 comments · Fixed by #8730
Closed

Windows CI test is extremely long (~3 hours) #8696

Jefffrey opened this issue Dec 31, 2023 · 11 comments · Fixed by #8730
Labels
bug Something isn't working

Comments

@Jefffrey
Copy link
Contributor

Describe the bug

Some examples:

To Reproduce

No response

Expected behavior

No response

Additional context

Not sure if it was like this in the past too? But a 3hr runtime seems odd

@Jefffrey Jefffrey added the bug Something isn't working label Dec 31, 2023
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Dec 31, 2023

From logs in https://github.com/apache/arrow-datafusion/actions/runs/7358744456/job/20032462383

     Running tests\tpcds_planning.rs (target\debug\deps\tpcds_planning-b7c04ff6bee7c82a.exe)
...
test result: ok. 190 passed; 0 failed; 8 ignored; 0 measured; 0 filtered out; finished in 5068.19s

Seems would be a good start point for investigating this, as accounts for almost half that runtime.

Skimming through the logs, can't see any other part that significantly contributes to the runtime

Edit: might be an issue with the windows runner itself, see actions/runner-images#9085

First started occurring here: https://github.com/apache/arrow-datafusion/actions/runs/7350344666

But don't think it's issue on datafusion side, considering contents of that PR

Edit2: Some other references:

Seem to suggest switching to windows-2019 could improve speed, will test

Edit3: for comparison, the CI used to take no more than 30min

@Jefffrey
Copy link
Contributor Author

CI is still taking extremely long after switching to windows-2019 runner

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

Some debugging thoughts:

Due to image change

Unlikely.

27min runtime: https://github.com/apache/arrow-datafusion/actions/runs/7347591698/job/20004278779#step:1:1

Current runner version: '2.311.0'
Operating System
  Microsoft Windows Server 2022
  10.0.20348
  Datacenter
Runner Image
  Image: windows-2022
  Version: 20231217.2.0
  Included Software: https://github.com/actions/runner-images/blob/win22/20231217.2/images/windows/Windows2022-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F20231217.2
Runner Image Provisioner
  2.0.321.1

3hr 20min run: https://github.com/apache/arrow-datafusion/actions/runs/7351057731/job/20013780063#step:1:1

Current runner version: '2.311.0'
Operating System
  Microsoft Windows Server 2022
  10.0.20348
  Datacenter
Runner Image
  Image: windows-2022
  Version: 20231217.2.0
  Included Software: https://github.com/actions/runner-images/blob/win22/20231217.2/images/windows/Windows2022-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F20231217.2
Runner Image Provisioner
  2.0.321.1

Runner image versions are the exact same.

Some regression in Datafusion code

Probably unlikely? I don't have a Windows machine to test to be certain that it's not only CI.

Given example run: https://github.com/apache/arrow-datafusion/actions/runs/7390088229/job/20104324232

  • 3hr 16min runtime

The tests that have extreme runtime are:

Rest of the run doesn't have any major anomalies, so these are the offending tests.

Some other runner issue

Not sure how to debug this, might need to raise another issue with runner repo?

Edit: pretty certain its not due to code change on DataFusion side, as this PR run was the first to exhibit >3hr win runtime: https://github.com/apache/arrow-datafusion/actions/runs/7349660365/job/20009991842

And it happens independently of the next commit to main which also exhibits slowness

This was referenced Jan 3, 2024
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

Actually, another thing that changed was Rust version from 1.74.1 to 1.75.0 (which also aligned exactly with the slowdown)

Testing switching back to 1.74.1 to see if this is the root cause: #8728

@comphead
Copy link
Contributor

comphead commented Jan 3, 2024

@Jefffrey thanks for creating this, do you have windows machine to run tests locally?

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

@Jefffrey thanks for creating this, do you have windows machine to run tests locally?

I don't have access to a Windows machine easily unfortunately. Can try get access to one if necessary, will see how latest CI test pans out

@comphead
Copy link
Contributor

comphead commented Jan 3, 2024

I'm also trying to find a solution in #8730

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

#8728

Windows test finished in 28min @comphead

Looks like its due to Rust 1.75.0 upgrade, will check its changelog

@comphead
Copy link
Contributor

comphead commented Jan 3, 2024

good finding, on 1.75.0 even the compilation is way slower

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

After a bunch of debugging, I found this:

Tests run fast on toolchain: nightly-2023-10-29 (rust e5cfc5547)

Tests run extremely slow on toolchain: nightly-2023-10-30 (rust 608e9682f)

I might raise an issue on rust lang repo, after checking if a similar issue hasn't already been raised

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 3, 2024

Raised issue on Rust repo: rust-lang/rust#119560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants