Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking Test][sig-network] ci-kubernetes-verify-master verify-openapi-spec #121865

Closed
rjsadow opened this issue Nov 14, 2023 · 26 comments · Fixed by #121878 or #123866
Closed

[Flaking Test][sig-network] ci-kubernetes-verify-master verify-openapi-spec #121865

rjsadow opened this issue Nov 14, 2023 · 26 comments · Fixed by #121878 or #123866
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@rjsadow
Copy link
Contributor

rjsadow commented Nov 14, 2023

Which jobs are flaking?

Verify Master - Master Blocking https://testgrid.k8s.io/sig-release-master-blocking#verify-master

Which tests are flaking?

verify.openapi-spec

Since when has it been flaking?

Inconsistently since 2 November. First failure (shown in testgrid) https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-verify-master/1719970042151440384

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#verify-master

Reason for failure (if possible)

API Server is not becoming ready during hack/make-rules/../../hack/verify-openapi-spec.sh

Anything else we need to know?

Relevant logs

+++ command: bash "hack/make-rules/../../hack/verify-openapi-spec.sh"
go version go1.21.4 linux/amd64
Downloading https://github.com/coreos/etcd/releases/download/v3.5.10/etcd-v3.5.10-linux-amd64.tar.gz succeed
etcd v3.5.10 installed. To use:
export PATH="/home/prow/go/src/k8s.io/kubernetes/third_party/etcd:${PATH}"
+++ [1113 21:08:56] Building go targets for linux/amd64
    k8s.io/kubernetes/cmd/kube-apiserver (static)
+++ [1113 21:11:10] Building go targets for linux/amd64
    k8s.io/kubernetes/cmd/kube-apiserver (static)
etcd --advertise-client-urls http://127.0.0.1:2379/ --data-dir /tmp/tmp.jwE2A0E5lK --listen-client-urls http://127.0.0.1:2379/ --log-level=warn 2> "/logs/artifacts/etcd.dd149c5e-01f1-4127-9f65-28eade1b4496.root.log.DEBUG.20231113-211118.287462" >/dev/null
Waiting for etcd to come up.
+++ [1113 21:11:18] On try 2, etcd: : {"health":"true","reason":""}
{"header":{"cluster_id":"14841639068965178418","member_id":"10276657743932975437","revision":"2","raft_term":"2"}}+++ [1113 21:11:19] Starting kube-apiserver
!!! [1113 21:11:50] Timed out waiting for apiserver:  to answer at https://127.0.0.1:8050/healthz; tried 30 waiting 1 between each
!!! [1113 21:11:50] Here are the last 10 lines from kube-apiserver (/tmp/update-openapi-spec.sh.IoKFHR/openapi-api-server.log)
!!! [1113 21:11:50] === BEGIN OF LOG ===
I1113 21:11:50.375973  288022 healthz.go:261] poststarthook/start-kubernetes-service-cidr-controller check failed: readyz
[-]poststarthook/start-kubernetes-service-cidr-controller failed: not finished
I1113 21:11:50.390824  288022 healthz.go:261] poststarthook/start-kubernetes-service-cidr-controller check failed: readyz
[-]poststarthook/start-kubernetes-service-cidr-controller failed: not finished
I1113 21:11:50.475803  288022 healthz.go:261] poststarthook/start-kubernetes-service-cidr-controller check failed: readyz
[-]poststarthook/start-kubernetes-service-cidr-controller failed: not finished
I1113 21:11:50.491708  288022 healthz.go:261] poststarthook/start-kubernetes-service-cidr-controller check failed: readyz
[-]poststarthook/start-kubernetes-service-cidr-controller failed: not finished
I1113 21:11:50.575548  288022 healthz.go:261] poststarthook/start-kubernetes-service-cidr-controller check failed: readyz
[-]poststarthook/start-kubernetes-service-cidr-controller failed: not finished
!!! [1113 21:11:50] === END OF LOG ===
+++ [1113 21:11:53] Clean up complete
+++ exit code: 1
+++ error: 1
�[0;31mFAILED   verify-openapi-spec.sh	179s

Relevant SIG(s)

/sig-testing

@rjsadow rjsadow added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 14, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 14, 2023
@rjsadow rjsadow changed the title [Flaking Test][sig-testing] ci-kubernetes-verify-master [Flaking Test][sig-testing] ci-kubernetes-verify-master verify-openapi-spec Nov 14, 2023
@liggitt
Copy link
Member

liggitt commented Nov 14, 2023

cc @aojea for cidr controller

@pacoxu
Copy link
Member

pacoxu commented Nov 14, 2023

Inconsistently since 2 November.

#116516 and #121682 may be related.

@pacoxu
Copy link
Member

pacoxu commented Nov 14, 2023

/sig auth network

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 14, 2023
@liggitt
Copy link
Member

liggitt commented Nov 14, 2023

/remove-sig auth

#116516 Is almost certainly related

@k8s-ci-robot k8s-ci-robot removed the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Nov 14, 2023
@aojea
Copy link
Member

aojea commented Nov 14, 2023

/assign

it seems the servicecidr controller takes time to start and it blocks the apiserver readiness

@Vyom-Yadav
Copy link
Member

@k8s-ci-robot
Copy link
Contributor

@Vyom-Yadav: Reopened this issue.

In response to this:

/reopen

Test run failed after related PR merger: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-verify-master/1725068113919086592

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 16, 2023
@aojea
Copy link
Member

aojea commented Nov 16, 2023

yeah
/assign

@pacoxu
Copy link
Member

pacoxu commented Nov 20, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2023
@aojea
Copy link
Member

aojea commented Nov 28, 2023

this is sig network @rjsadow can you pelase update the title?

@Vyom-Yadav
Copy link
Member

/retitle [Flaking Test][sig-network] ci-kubernetes-verify-master verify-openapi-spec

@k8s-ci-robot k8s-ci-robot changed the title [Flaking Test][sig-testing] ci-kubernetes-verify-master verify-openapi-spec [Flaking Test][sig-network] ci-kubernetes-verify-master verify-openapi-spec Nov 28, 2023
@pacoxu
Copy link
Member

pacoxu commented Dec 25, 2023

@pacoxu
Copy link
Member

pacoxu commented Dec 25, 2023

@aojea

https://github.com/kubernetes/kubernetes/blob/b9e0714722a2d6456822a6b4b8131ab8b174ffac/hack/local-up-cluster.sh#L82C13-L82C13

${CONTROLPLANE_SUDO} "${GO_OUT}/kube-apiserver" "${authorizer_args[@]}" "${priv_arg}" ${runtime_config} \
${cloud_config_arg} \
"${advertise_address}" \
"${node_port_range}" \
--v="${LOG_LEVEL}" \
--vmodule="${LOG_SPEC}" \
--audit-policy-file="${AUDIT_POLICY_FILE}" \
--audit-log-path="${LOG_DIR}/kube-apiserver-audit.log" \
--cert-dir="${CERT_DIR}" \
--egress-selector-config-file="${EGRESS_SELECTOR_CONFIG_FILE:-}" \
--client-ca-file="${CERT_DIR}/client-ca.crt" \
--kubelet-client-certificate="${CERT_DIR}/client-kube-apiserver.crt" \
--kubelet-client-key="${CERT_DIR}/client-kube-apiserver.key" \
--service-account-key-file="${SERVICE_ACCOUNT_KEY}" \
--service-account-lookup="${SERVICE_ACCOUNT_LOOKUP}" \
--service-account-issuer="https://kubernetes.default.svc" \
--service-account-jwks-uri="https://kubernetes.default.svc/openid/v1/jwks" \
--service-account-signing-key-file="${SERVICE_ACCOUNT_KEY}" \
--enable-admission-plugins="${ENABLE_ADMISSION_PLUGINS}" \
--disable-admission-plugins="${DISABLE_ADMISSION_PLUGINS}" \
--admission-control-config-file="${ADMISSION_CONTROL_CONFIG_FILE}" \
--bind-address="${API_BIND_ADDR}" \
--secure-port="${API_SECURE_PORT}" \
--tls-cert-file="${CERT_DIR}/serving-kube-apiserver.crt" \
--tls-private-key-file="${CERT_DIR}/serving-kube-apiserver.key" \
--storage-backend="${STORAGE_BACKEND}" \
--storage-media-type="${STORAGE_MEDIA_TYPE}" \
--etcd-servers="http://${ETCD_HOST}:${ETCD_PORT}" \
--service-cluster-ip-range="${SERVICE_CLUSTER_IP_RANGE}" \
--feature-gates="${FEATURE_GATES}" \
--external-hostname="${EXTERNAL_HOSTNAME}" \
--requestheader-username-headers=X-Remote-User \
--requestheader-group-headers=X-Remote-Group \
--requestheader-extra-headers-prefix=X-Remote-Extra- \
--requestheader-client-ca-file="${CERT_DIR}/request-header-ca.crt" \
--requestheader-allowed-names=system:auth-proxy \
--proxy-client-cert-file="${CERT_DIR}/client-auth-proxy.crt" \
--proxy-client-key-file="${CERT_DIR}/client-auth-proxy.key" \
--cors-allowed-origins="${API_CORS_ALLOWED_ORIGINS}" >"${APISERVER_LOG}" 2>&1 &
APISERVER_PID=$!
# Create kubeconfigs for all components, using client certs
kube::util::write_client_kubeconfig "${CONTROLPLANE_SUDO}" "${CERT_DIR}" "${ROOT_CA_FILE}" "${API_HOST}" "${API_SECURE_PORT}" admin
${CONTROLPLANE_SUDO} chown "${USER}" "${CERT_DIR}/client-admin.key" # make readable for kubectl

hack/local-up-cluster.sh use 60s timeout and here

if ! kube::util::wait_for_url "https://${API_HOST}:${API_PORT}/healthz" "apiserver: "; then
kube::log::error "Here are the last 10 lines from kube-apiserver (${API_LOGFILE})"
kube::log::error "=== BEGIN OF LOG ==="
tail -10 "${API_LOGFILE}" >&2 || :
kube::log::error "=== END OF LOG ==="
exit 1
fi

hack/update-openapi-spec.sh uses 30s timeout by default. It flakes.

Before it took more than 10 seconds in being ready, because it failed on the first attempt and had to wait the interval defined (10 seconds). Now it takes less than 1 second to be ready.

  • Only 10 lines of log cannot show other details.

My local test with 15s timeout failed several times. 30s is enough for my local tests.

Or we just enlarge the timeout for CI.

diff --git a/hack/update-openapi-spec.sh b/hack/update-openapi-spec.sh
index 5d54f1acd18..f3905b6c2cf 100755
--- a/hack/update-openapi-spec.sh
+++ b/hack/update-openapi-spec.sh
@@ -24,6 +24,7 @@ set -o pipefail
 KUBE_ROOT=$(dirname "${BASH_SOURCE[0]}")/..
 DISCOVERY_ROOT_DIR="${KUBE_ROOT}/api/discovery"
 OPENAPI_ROOT_DIR="${KUBE_ROOT}/api/openapi-spec"
+WAIT_FOR_URL_API_SERVER=${WAIT_FOR_URL_API_SERVER:-60}
 source "${KUBE_ROOT}/hack/lib/init.sh"
 
 kube::util::require-jq
@@ -88,7 +89,7 @@ kube::log::status "Starting kube-apiserver"
   --service-cluster-ip-range="10.0.0.0/24" >"${API_LOGFILE}" 2>&1 &
 APISERVER_PID=$!
 
-if ! kube::util::wait_for_url "https://${API_HOST}:${API_PORT}/healthz" "apiserver: "; then
+if ! kube::util::wait_for_url "https://${API_HOST}:${API_PORT}/healthz" "apiserver: " 1 "${WAIT_FOR_URL_API_SERVER}"; then
   kube::log::error "Here are the last 10 lines from kube-apiserver (${API_LOGFILE})"

@aojea
Copy link
Member

aojea commented Dec 25, 2023

having both scripts the same timeout sounds reasonable

I'm still worried why it keeps timing out, but for that we should first define a test that measure the apiserver startup and then decide what value is correct and enforce it.

@pacoxu
Copy link
Member

pacoxu commented Jan 5, 2024

/priority important-soon
/milestone v1.30
backport to v1.29 may be needed.

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jan 5, 2024
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 5, 2024
@moficodes
Copy link

Hello!
Release Signal shadow here.
This issue has not been updated for a long time, so I'd like to check what's the status. The code freeze is starting 02:00 UTC Wednesday 6th March 2024 / 18:00 PDT Tuesday 5th March 2024 (less than a month from now), and while there is still plenty of time, we want to ensure that each PR has a chance to be merged on time.

As this issue is tagged for 1.30, is it still planned for this release?

@pacoxu
Copy link
Member

pacoxu commented Feb 29, 2024

This flakes in https://testgrid.k8s.io/sig-release-master-blocking#verify-master.

Do we have some action items for v1.30?

@aojea
Copy link
Member

aojea commented Feb 29, 2024

v1.30

I need to fix it, is a shame on me, sorry

@SubhasmitaSw
Copy link
Contributor

Hello!
Release Signal shadow here.
This issue has not been updated for a long time, so we'd like to check the status. The code freeze is starting 02:00 UTC Wednesday 6th March 2024 / 18:00 PDT Tuesday 5th March 2024 less than a week, and we want to ensure that each PR has a chance to be merged on time.

As this issue is tagged for 1.30, is it still planned for this release?

Please let us know the status if possible.

@aojea
Copy link
Member

aojea commented Mar 1, 2024

yeah, I like to get it fixed, there is some race condition that makes some of the boostrap to deadlock , I still didn't find why

@aojea
Copy link
Member

aojea commented Mar 6, 2024

didn't forget but didn't have time to figure out the root cause, will try to get back to this end of the week

@aojea
Copy link
Member

aojea commented Mar 10, 2024

I got one failure locally of an apiserver that took much time
The apiservers need the servicecidr controller to start to be healthy, and the controllers need to sync the informer, but checking one of the requests from the informer I can see this error

servicecidrs,user-agent:controlplane.test/v0.0.0 (linux/amd64) kubernetes/$Format,verb:GET (08-Mar-2024 23:24:37.278) (total time: 3002ms):
Trace[1391457241]: ---"About to List from storage" 0ms (23:24:37.278)
Trace[1391457241]: ["List(recursive=false) etcd3" audit-id:54359dbc-99a2-46a3-82c2-51f293cc3c87,key:/servicecidrs,resourceVersion:,resourceVersionMatch:,limit:1,continue: 3002ms (23:24:37.278)]
Trace[1391457241]: ["cacher list" audit-id:54359dbc-99a2-46a3-82c2-51f293cc3c87,type:servicecidrs.networking.k8s.io 3001ms (23:24:37.280)
Trace[1391457241]:  ---"Ready" 0ms (23:24:37.280)
Trace[1391457241]:  ---"watchCache locked acquired" 0ms (23:24:37.280)]
Trace[1391457241]: [3.002703852s] [3.002703852s] END

I0308 23:24:40.281770 1278835 httplog.go:132] "HTTP" verb="GET" URI="/apis/networking.k8s.io/v1alpha1/servicecidrs?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0" latency="3.009430023s" userAgent="controlplane.test/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="54359dbc-99a2-46a3-82c2-51f293cc3c87" srcIP="127.0.0.1:42682" apf_pl="exempt" apf_fs="exempt" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="3.002874602s" resp=504 statusStack=<

goroutine 2919 [running]:
...
/usr/local/google/home/aojea/src/kubernetes/staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go:101 +0x198
addedInfo=<
logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12\x96\x01\n\x06\n\x00\x12\x00\x1a\x00\x12\aFailure\x1a2Timeout: Too large resource version: 2, current: 1"\aTimeout*C\n\x00\x12\x00\x1a\x00"7\n\x17ResourceVersionTooLarge\x12\x1aToo large resource version\x1a\x00(\x012\x000\xf8\x03\x1a\x00"\x00"

and this keep going for a while

addedInfo=<
logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12\x98\x01\n\x06\n\x00\x12\x00\x1a\x00\x12\aFailure\x1a4Timeout: Too large resource version: 203, current: 1"\aTimeout*C\n\x00\x12\x00\x1a\x00"7\n\x17ResourceVersionTooLarge\x12\x1aToo large resource version\x1a\x00(\x012\x000\xf8\x03\x1a\x00"\x00"

@wojtek-t @liggitt do you know why the resource version is growing?

@liggitt
Copy link
Member

liggitt commented Mar 11, 2024

looks like the server returns that error in a few places, not sure which one is getting exercised:

if wc.initialRev > 0 && forceInitialEvents {
currentStorageRV, err := wc.watcher.getCurrentStorageRV(wc.ctx)
if err != nil {
wc.sendError(err)
return
}
if uint64(wc.initialRev) > currentStorageRV {
wc.sendError(storage.NewTooLargeResourceVersionError(uint64(wc.initialRev), currentStorageRV, int(wait.Jitter(1*time.Second, 3).Seconds())))

for w.resourceVersion < resourceVersion {
if w.clock.Since(startTime) >= blockTimeout {
// Request that the client retry after 'resourceVersionTooHighRetrySeconds' seconds.
return storage.NewTooLargeResourceVersionError(resourceVersion, w.resourceVersion, resourceVersionTooHighRetrySeconds)

// Enforce the storage.Interface guarantee that the resource version of the returned data
// "will be at least 'resourceVersion'".
if minimumRV > actualRevision {
return storage.NewTooLargeResourceVersionError(minimumRV, actualRevision, 0)

update-openapi-spec starts with AllAlpha=true in order to enable all features / APIs ... @wojtek-t @p0lyn0mial @serathius are there any alpha features around watch / resourceVersion that could cause this?

@wojtek-t
Copy link
Member

update-openapi-spec starts with AllAlpha=true in order to enable all features / APIs ... @wojtek-t @p0lyn0mial @serathius are there any alpha features around watch / resourceVersion that could cause this?

Didn't read carefully the issue, so just quickly sharing some thoughts:

  1. streaming lists might potentially be related, but it needs to be explicitly requested by the client.
    Reflector is the only place that can currently do that, but it requires explicitly set env-var:
    https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/reflector.go#L245-L249

Is it possible that this value is set?
if so, then streaming lists will result is calling etcd for current RV and will wait for watchache to be synced with it - if progress-notify-interval is large and manual progress notify is disabled, we could face things like that

@p0lyn0mial ^^

  1. that said, if Alpha features are enabled, then manual-progress notify should also be set:
    https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L1304

  2. is it possible that we're facing the bug that Marek found 2 days before release that effectively was the main reason we didn't try to promote consistnet read to beta?
    Etcd not respond to progress notification on newly created watch RPC until there is a event in a watch etcd-io/etcd#17507

@serathius ^^

@serathius
Copy link
Contributor

serathius commented Mar 11, 2024

Based on the log from #121865 (comment)

Trace[1391457241]: ["List(recursive=false) etcd3" audit-id:54359dbc-99a2-46a3-82c2-51f293cc3c87,key:/servicecidrs,resourceVersion:,resourceVersionMatch:,limit:1,continue: 3002ms (23:24:37.278)]

I would guess this is the #123674

  • 3002ms implies the 3 second timeout of waiting for revision
  • Empty resourceVersion implies consistent list from cache
  • limit:1 implies non-recursive list

Planned to fix it, but PR was blocked on #123732

@liggitt
Copy link
Member

liggitt commented Mar 11, 2024

I would guess this is the #123674

thanks a ton for digging into that

@aojea, to deflake this test, I'd set this in the update-openapi-spec.sh file (since it doesn't impact the generated openapi) with a comment to remove once #123674 is resolved

--feature-gates=AllAlpha=true,OpenAPIEnums=false,ConsistentListFromCache=false \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
10 participants