Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processChild.js dies in production with no replacement #49623

Closed
1 task done
masterkain opened this issue May 10, 2023 · 5 comments
Closed
1 task done

processChild.js dies in production with no replacement #49623

masterkain opened this issue May 10, 2023 · 5 comments
Labels
bug Issue was opened via the bug report template.

Comments

@masterkain
Copy link

masterkain commented May 10, 2023

Verify canary release

  • I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
      Platform: linux
      Arch: x64
      Version: #1 SMP Tue Sep 13 07:51:46 UTC 2022
    Binaries:
      Node: 20.1.0
      npm: 9.6.4
      Yarn: 1.22.19
      pnpm: N/A
    Relevant packages:
      next: 13.4.2-canary.4
      eslint-config-next: 13.4.1
      react: 18.2.0
      react-dom: 18.2.0
      typescript: 5.0.4

Which area(s) of Next.js are affected? (leave empty if unsure)

Standalone mode (output: "standalone")

Link to the code that reproduces this issue

To Reproduce

not sure, use prisma?

Describe the Bug

in my next.js 13 appDir app I'm using:

  • server side components
  • route handlers
  • both are using prisma

once I started getting some production traffic I started observing lock up issues, timeouts and whatnot.
initially of course I went through the whole prisma/pgbouncer shebang but after some days of debugging (including checking proxied requests between server and jest workers) I just noticed that in my pod just after few requests we observe this.

PID   USER     TIME  COMMAND
    1 nextjs    0:01 node server.js
   14 nextjs    0:02 /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   25 nextjs    0:13 /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   26 nextjs    0:00 /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   39 nextjs    0:00 sh
   78 nextjs    0:00 ps axjf

this is the situation when the server starts, once we get some requests in:

...
Handling request GET /projects/2
Proxy target: http://localhost:33149/projects/2
Handling request GET /projects/14
Proxy target: http://localhost:33149/projects/14
Handling request GET /projects/4
Proxy target: http://localhost:33149/projects/4
Handling request GET /projects
Proxy target: http://localhost:33149/projects
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: socket hang up
    at connResetException (node:internal/errors:717:14)
    at Socket.socketOnEnd (node:_http_client:519:23)
    at Socket.emit (node:events:523:35)
    at endReadableNT (node:internal/streams/readable:1367:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Handling request GET /_next/static/css/65a6d53be77a4768.css
Proxy target: http://localhost:33149/_next/static/css/65a6d53be77a4768.css
Handling request GET /_next/static/chunks/webpack-a75aec2d68f60fab.js
Proxy target: http://localhost:33149/_next/static/chunks/webpack-a75aec2d68f60fab.js
...

and we can notice how a processChild is missing:

  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   14     1 nextjs   S    10.6g  32%   5   0% /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   26    14 nextjs   S    10.6g  32%   0   0% /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
    1     0 nextjs   S     686m   2%   3   0% node server.js
   39     0 nextjs   S     1692   0%   3   0% sh
   77    39 nextjs   R     1620   0%   7   0% top

so it seems that a worker crashed but the server still thinks it's around somehow.
on further requests however there are no ECONNRESET in the logs anymore, just the server hanging indefinitely.

it is unclear what is causing this apparent crash given the same pages sometime works, especially when server just restarted.

Expected Behavior

server is stable.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

No response

@masterkain masterkain added the bug Issue was opened via the bug report template. label May 10, 2023
@masterkain
Copy link
Author

https://gist.github.com/masterkain/a3edb77d932fceae51a449082ede65d7

this is the custom_server.js I'm using to debug proxied requests and for a workaround for #49536

I also fixed const hostname = 'localhost'; just to make sure I did not have dns issues while debugging on kubernetes.

@masterkain
Copy link
Author

masterkain commented May 11, 2023

after some more research I have a rough idea about what is happening

  1. jest-worker based childProcess.js might be killed (or exit spontaneusly, not sure), this for me happens almost immediately on a small app, but only after some server-side components calls sprinkled with prisma.
    it does not happen when performing just route handler requests on a new server instance.
  1. killed workers do not get replaced -- work will still be attempted to be passed around to dead instances.
    we need to wait for this to release:

  • the takeaway is that the memory leak has not been fixed yet in node.js and it probably never will be as it is flagged as wontfix.

  • afaik we have little say in our deployment how to regulate such workers, i.e. numWorkers, an idleMemoryLimit to be set and the fact that if this worker is printing something to stdout I can't recognize that what's being printed belongs to a worker.
    https://github.com/jestjs/jest/tree/main/packages/jest-worker#idlememorylimit-number-optional

  • https://github.com/jestjs/jest/blob/main/packages/jest-worker/src/workers/ChildProcessWorker.ts#L119
    I've never been been able to see this, workers just die without printing anything useful.

  • I can't fix this with a kubernetes health check because the app becomes unresponsive almost immediately and restarting the pod just means going into an infinite loop as soon as it comes online.

  • I also have no clue yet what is really killing the jest workers, memory issues are a strong assumption but I'm not yet sure.
    maybe some miscalculation going on with how memory is calculated when running on k8s? because it's not the pod going OOM. pod reports low memory and it's stable.

  • another takeaway is that once the jest-worker restart patch lands in next.js the situation would even be worse in some cases because processes will be killed and restarted all the time, masking the underlying issue.

@masterkain
Copy link
Author

the issue for now has been "resolved" by assigning copious amount of ram to each next.js instance.

@khuezy
Copy link
Contributor

khuezy commented May 11, 2023

jest worker seems really slow; we're noticing ~7s of cold start times.

@github-actions
Copy link
Contributor

This closed issue has been automatically locked because it had no new activity for a month. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue was opened via the bug report template.
Projects
None yet
Development

No branches or pull requests

2 participants