Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the kubeflow-m2m-oidc-configurator a CronJob #2667

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kromanow94
Copy link
Contributor

Which issue is resolved by this Pull Request:
Resolves #2646

Description of your changes:
Changing the Job to CronJob improves the robustness of the setup in case if the JWKS will change or the user accidentally overwrote the requestauthentication.

Checklist:

  • Tested on kind and on vcluster.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kromanow94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kromanow94
Copy link
Contributor Author

kromanow94 commented Apr 4, 2024

@juliusvonkohout or @kimwnasptd can we restart the tests? Both of them failed because of a non-related issue:

timed out waiting for the condition on pods/kubeflow-m2m-oidc-configurator-28537425-s8kzm
timed out waiting for the condition on pods/activator-bd5fdc585-rrnqf
timed out waiting for the condition on pods/autoscaler-5655dd9df5-4knpj
timed out waiting for the condition on pods/controller-5447f77dc5-ljx5r
timed out waiting for the condition on pods/domain-mapping-757799d898-knf69
timed out waiting for the condition on pods/domainmapping-webhook-5d875ccb7d-z2qjv
timed out waiting for the condition on pods/net-istio-controller-5f89595bcb-dv7h2
timed out waiting for the condition on pods/net-istio-webhook-dc448cfc4-rws5f
timed out waiting for the condition on pods/webhook-578c5cf66f-25sf9
timed out waiting for the condition on pods/coredns-5dd5756b68-hpg77
timed out waiting for the condition on pods/coredns-5dd5756b68-vv66m
timed out waiting for the condition on pods/etcd-kind-control-plane
timed out waiting for the condition on pods/kindnet-9l886
timed out waiting for the condition on pods/kindnet-pftsz
timed out waiting for the condition on pods/kindnet-z5qpl
timed out waiting for the condition on pods/kube-apiserver-kind-control-plane
timed out waiting for the condition on pods/kube-controller-manager-kind-control-plane
timed out waiting for the condition on pods/kube-proxy-64vj7
timed out waiting for the condition on pods/kube-proxy-vk4lr
timed out waiting for the condition on pods/kube-proxy-xwm8d
timed out waiting for the condition on pods/kube-scheduler-kind-control-plane
timed out waiting for the condition on pods/local-path-provisioner-7577fdbbfb-7zv5k
timed out waiting for the condition on pods/oauth2-proxy-86d8c97455-hvjl8
timed out waiting for the condition on pods/oauth2-proxy-86d8c97455-z9vjw
Error: Process completed with exit code 1.

@juliusvonkohout
Copy link
Member

@KRomanov, i restarted the tests. If they fail again we might have to increase the timeouts in this PR.

Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski.kr3@roche.com>
Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski.kr3@roche.com>
@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from b98a24d to 4abca40 Compare April 11, 2024 13:41
@kromanow94
Copy link
Contributor Author

@juliusvonkohout this is super weird. I limited the CronJob with concurrencyPolicy: Forbid. I don't know if this should be handled with increasing the timeout or by increaseing the resources for the CICD Jobs... I can also try to split the installation steps to limit how many pods are created at the same time...

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Apr 15, 2024

I restarted the tests. yeah our CICD is a bit problematic at the moment. If we can specify more resources in this public repository yes, otherwise we have to increase the timeouts. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

@kromanow94
Copy link
Contributor Author

@juliusvonkohout maybe the issue is with CICD resource sharing? If the memory and cpu is shared between multiple workflows, it may be problematic. I see one of the failing tests completed with success. Can you restart the last test workflow?

Also, is this something I could do myself, for example with the github bot with commands in comment?

restartPolicy: OnFailure
serviceAccountName: kubeflow-m2m-oidc-configurator
containers:
- image: curlimages/curl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this from docker.io?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should specify it then with docker.io/curlimages/curl

name: kubeflow-m2m-oidc-configurator
namespace: istio-system
spec:
schedule: '* * * * *'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHould we not go with every 5 minutes instead of every minute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to every 5 minutes. There is also configuration for not adding more jobs until the last one is completed and from the latest log from cicd workflows shows that there is no more than 1 job created at a time.

defaultMode: 0777
items:
- key: script.sh
path: script.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that script.sh is idempotent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, well it doesn't verify if the JWKS is present and after all is always performing the patch so this might be an improvement. I think the JWKS value should be also compared and only patched if different.

@juliusvonkohout
Copy link
Member

@juliusvonkohout maybe the issue is with CICD resource sharing? If the memory and cpu is shared between multiple workflows, it may be problematic. I see one of the failing tests completed with success. Can you restart the last test workflow?

Also, is this something I could do myself, for example with the github bot with commands in comment?

I did restart and it failed again. In the KFP repository that was possible with /retest or /retest-failed or so. Probably something i can investigate in the next weeks when i am less busy.

@kromanow94
Copy link
Contributor Author

@juliusvonkohout maybe we could add verbosity to the logs in CICD GH Workflows? We currently know that the pods aren't ready but what is the actual reason? DockerHub pull rate limits? Not enough resources? Failing Pod?

@juliusvonkohout
Copy link
Member

@juliusvonkohout maybe we could add verbosity to the logs in CICD GH Workflows? We currently know that the pods aren't ready but what is the actual reason? DockerHub pull rate limits? Not enough resources? Failing Pod?

Yes, lets do that in a separate PR with @codablock as well.

@juliusvonkohout
Copy link
Member

The tests in #2696 were successful so i reran the test and hope that the CICD is happy now. If not please rebase the PR against the master branch.

@juliusvonkohout
Copy link
Member

https://github.com/kubeflow/manifests/actions/runs/8891109875 here is the successful test.

@juliusvonkohout
Copy link
Member

So we need a rebase and step by step debugging with minimal changes.

@juliusvonkohout
Copy link
Member

/hold

@juliusvonkohout
Copy link
Member

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the oidc-issuer configurator a CronJob to ensure correct JWKS for the in-cluster self-signed OIDC Issuer
2 participants