Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secured cluster services not able to authenticate with central service after restarting central db #9726

Open
vinspub opened this issue Feb 2, 2024 · 17 comments

Comments

@vinspub
Copy link

vinspub commented Feb 2, 2024

Kindly help us to resolve the below issue, Why we are getting the bad certificate error.

Note: The Scanner and central certificates are not expired

In central pod getting below error in log,

tlsconfig: 2024/02/02 18:27:11.454870 tlsconfig.go:155: Info: Default TLS certificate file "/run/secrets/stackrox.io/default-tls-cert/tls.crt" does not exist. Skipping
51
pkg/grpc/authn: 2024/02/02 18:27:11.614534 rate_limited_logger.go:69: Warn: Cannot extract identity: could not verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "StackRox Certificate Authority")
50
tlsconfig: 2024/02/02 18:29:02.557744 tlsconfig.go:50: Info: Skipping additional CA directory entry "..2024_02_02_13_15_39.1377428811" because it is a directory

In Scanner Pod getting below error in log,

2024/02/02 17:33:25 http: TLS handshake error from 10.0.x.x:59392: remote error: tls: bad certificate

error":"fetching update from URL: executing request: Get "https://central.stackrox.svc/api/extensions/scannerdefinitions?uuid=e5e73d51-8941-4831-96fe-4822153c2c70\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "StackRox Certificate Authority")"}
33
{"Event":"Starting an update cycle","Level":"info","Location":"updater.go:56","Time":"2024-02-02 17:39:20.675431"}

@vinspub
Copy link
Author

vinspub commented Feb 2, 2024

More details for your analysis,

Stackrox version: 4.1.0
Central DB restarted via ArgoCD
After restarting the Central DB, PV went to release state and getting error as PVC claimed by another resources
We have resolved this by patching PV - kubectl patch pv pv-test -p '{"spec":{"claimRef": null}}'
After that all Central and Central DB pods are working fine. But getting the bad certificate error in pod mentioned above.
And the secured cluster services are not authenticated with Central, due to this clusters are went to unhelathy state.

Sensor logs below for reference,

pkg/grpc: 2024/02/02 18:58:07.338266 server.go:216: Info: Launching backend gRPC listener
27
pkg/grpc: 2024/02/02 18:58:07.338573 server.go:329: Warn: failed to register Prometheus collector: descriptor Desc{fqName: "http_incoming_in_flight_requests", help: "Number of http requests which are currently running.", constLabels: {path="/v1/"}, variableLabels: []} already exists with the same fully-qualified name and const label values
26
pkg/grpc: 2024/02/02 18:58:07.338815 server.go:379: Info: TLS-enabled HTTP server listening on [::]:9443
25
common/clusterid: 2024/02/02 18:58:07.339798 cluster_id.go:35: Info: Certificate has wildcard subject 00000000-0000-0000-0000-000000000000. Waiting to receive cluster ID from central...
24
common/centralclient: 2024/02/02 18:58:07.476582 grpc_connection.go:101: Warn: Error fetching centrals TLS certs: verifying tls challenge: validating certificate chain: using a certificate bundle that was generated from a different Central installation than the one it is trying to connect to: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "StackRox Certificate Authority")
23
common/centralclient: 2024/02/02 18:58:07.476649 grpc_connection.go:128: Info: Did not add central CA cert to gRPC connection
22
common/sensor: 2024/02/02 18:58:07.477076 central_communication_impl.go:129: Info: Re-using cluster ID 5b2807ad-3d53-4674-96ab-e0dbf7531366 of previous run. If you see the connection to central failing, re-apply a new Helm configuration via 'helm upgrade', or delete the sensor pod.
21
common/sensor: 2024/02/02 18:58:07.627711 central_communication_impl.go:199: Warn: Central is running a legacy version that might not support all current features
20
common/clusterid: 2024/02/02 18:58:07.627991 cluster_id.go:51: Panic: Invalid dynamic cluster ID value "": no concrete cluster ID was specified in conjunction with wildcard ID "00000000-0000-0000-0000-000000000000". This may be caused by Central data not being persisted between restarts; you may try deploying Central with STORAGE=pvc. For other potential solutions reffer to https://access.redhat.com/solutions/6972449
19
panic: Invalid dynamic cluster ID value "": no concrete cluster ID was specified in conjunction with wildcard ID "00000000-0000-0000-0000-000000000000". This may be caused by Central data not being persisted between restarts; you may try deploying Central with STORAGE=pvc. For other potential solutions reffer to https://access.redhat.com/solutions/6972449
18
17
goroutine 239 [running]:
16
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0019e8580, {0x0, 0x0, 0x0})
15
go.uber.org/zap@v1.24.0/zapcore/entry.go:230 +0x486
14
go.uber.org/zap.(*SugaredLogger).log(0xc0007f6420, 0x4, {0x3923599?, 0x41c8c7?}, {0xc001137100?, 0x31d80c0?, 0xc000e4bb01?}, {0x0, 0x0, 0x0})

@vinspub
Copy link
Author

vinspub commented Feb 5, 2024

Dear stackrox team any update on the above issue?

@naveen2131-hue
Copy link

@ludydoo @vikin91 @amedeos @jvdm

kindly help us to resolve this issue.

@ludydoo
Copy link
Contributor

ludydoo commented Feb 5, 2024

Hey @naveen2131-hue @vinspub, I'm not sure I understand what is the operation that was performed on the PVC. Perhaps that this removed the existing init bundle from the database. From the logs, it looks like a new init bundle needs to be applied to the Secured Cluster to apply new certificates. You can check how to do that here https://docs.openshift.com/acs/3.66/installing/install-ocp-operator.html#generate-init-bundle-operator. After applying the init bundle, it is probably needed to restart all the Secured Cluster workloads (sensor, scanner, etc.). Let me know if that resolves the issue!

@vinspub
Copy link
Author

vinspub commented Feb 5, 2024

@ludydoo Thanks for the reply.

It seems like a certificate error in the central and scanner. We have properly configured the PV and rolled back successfully and we are able to view all existing policies we created in console.

we are getting the bad certificate and tls: failed to verify certificate: x509: certificate signed by unknown authority in scanner and central pods.

Due to this the vulnerability definitions also not updated and scanner certificate expiry status couldn't be fetched.

Warning alert:failed to determine scanner cert expiry error: failed to contact scanner at https://scanner.stackrox.svc:8080: dial tcp 10.100.231.16:8080: connect: connection refused

Refer the screenshots FYR...
stackrox 1
stackrox 2

@ludydoo
Copy link
Contributor

ludydoo commented Feb 5, 2024

@vinspub did you try the steps I mentioned earlier (re-applying an init bundle)?

@vinspub
Copy link
Author

vinspub commented Feb 5, 2024

@ludydoo We are good to re-applying an init bundle. Please correct me, if my understanding is wrong.

  • Generating new init bundle and re-applying to the existing secured cluster help us to make the cluster in healthy status.
  • But does't help to clear the bad certificate error in central service pods such as scanner and censor.
  • Does't help to get the vulnerability definitions up to date.

Bcz we are getting this bad certificate error and vulnerability definitions update error from past two days, before central db pod restarts.

@ludydoo
Copy link
Contributor

ludydoo commented Feb 5, 2024

@vinspub that's mostly accurate.

  • Generating the new init bundle and re-applying it will (probably) fix the bad secured cluster TLS error
  • Scanner is not necessarily a Central Service pod, it is also a SecuredCluster pod. Scanner is always deployed as part of Central services, and sometimes as part of SecuredCluster services (if local scanning is enabled)

So re-applying the init-bundle to the SecuredCluster would hopefully fix the TLS connection issues between central and the secured cluster. If there are also TLS connection issues between the central components themselves, I would click on the "Reissuing internal certificates" link (shown in the screenshot you attached) and follow those instructions.

@ludydoo
Copy link
Contributor

ludydoo commented Feb 5, 2024

@vinspub
Copy link
Author

vinspub commented Mar 19, 2024

@ludydoo Despite we recreated the init bundle, we are getting this below error in central pod,

root logger: 2024/03/19 14:58:31.672653 logger.go:77: Warn: pkg/grpc/authn/interceptor.go:30 - Cannot extract identity: could not verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "StackRox Certificate Authority")
8
pkg/telemetry/phonehome/segment: 2024/03/19 14:58:45.061919 segment.go:85: Error: sending request - Post "https://api.segment.io/v1/batch": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
7
pkg/telemetry/phonehome/segment: 2024/03/19 14:58:55.062934 segment.go:85: Error: 1 messages dropped because they failed to be sent after 10 attempts
6
pkg/telemetry/phonehome/segment: 2024/03/19 14:58:55.063045 segment.go:46: Error: Failure with message 'track': Post "https://api.segment.io/v1/batch": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
5
root logger: 2024/03/19 14:59:01.863187 logger.go:77: Warn: pkg/grpc/authn/interceptor.go:30 - Cannot extract identity: could not verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "StackRox Certificate Authority")
sensor.txt

I have attached Sensor pod log for reference. Kindly assist us to resolve the issue...

@vinspub
Copy link
Author

vinspub commented Mar 21, 2024

@ludydoo, Any update on the below error, we are eagerly waiting for your reply.

root logger: 2024/03/21 05:34:34.836598 logger.go:77: Warn: pkg/grpc/authn/interceptor.go:30 - Cannot extract identity: could not verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "StackRox Certificate Authority")

@ludydoo
Copy link
Contributor

ludydoo commented Mar 21, 2024

@vinspub what installation method are you using ?

@vinspub
Copy link
Author

vinspub commented Mar 26, 2024

@ludydoo We are deploying using Helm.

More Analysis we found for your reference,

  • We have deleted the existing clusters and PV from central services
  • New init bundle created for new secured cluster onboarding and its working fine.
  • New Init bundle created for previously onboarded clusters but getting the same error, tls verification.

@vinspub
Copy link
Author

vinspub commented Mar 27, 2024

@ludydoo Any update on the above queries.
In addition to that, today again we are getting the Internal certificate error in the Scanner pod from the freshly installed central services.

May I know the reason, why we are getting the error in auto-generated internal certificates? This issue occurs after the central services are running successfully for a few days.

Installation Details:
App Version: 4.3.3
K8s Version: 1.27

@ludydoo
Copy link
Contributor

ludydoo commented Mar 28, 2024

@SimonBaeumer I think we need your touch

@vinspub
Copy link
Author

vinspub commented Apr 1, 2024

@ludydoo @SimonBaeumer Any update on the above issue?

@SimonBaeumer
Copy link
Member

SimonBaeumer commented Apr 19, 2024

Hi @vinspub,

Do you run a LoadBalancer in front of your Central with another TLS certificate? If yes, the CA of that cert must be added to Central's additional-ca configuration. For this you can configured it here in your values.

Could you update your environment to the latest ACS version? 4.1 is quite old, ACS is at version 4.4. If the issue still persists after upgrading we need to have a deeper look into the certificates and rotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants