processChild.js dies in production with no replacement #49623

masterkain · 2023-05-10T17:28:51Z

Verify canary release

I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
      Platform: linux
      Arch: x64
      Version: #1 SMP Tue Sep 13 07:51:46 UTC 2022
    Binaries:
      Node: 20.1.0
      npm: 9.6.4
      Yarn: 1.22.19
      pnpm: N/A
    Relevant packages:
      next: 13.4.2-canary.4
      eslint-config-next: 13.4.1
      react: 18.2.0
      react-dom: 18.2.0
      typescript: 5.0.4

Which area(s) of Next.js are affected? (leave empty if unsure)

Standalone mode (output: "standalone")

Link to the code that reproduces this issue

To Reproduce

not sure, use prisma?

Describe the Bug

in my next.js 13 appDir app I'm using:

server side components
route handlers
both are using prisma

once I started getting some production traffic I started observing lock up issues, timeouts and whatnot.
initially of course I went through the whole prisma/pgbouncer shebang but after some days of debugging (including checking proxied requests between server and jest workers) I just noticed that in my pod just after few requests we observe this.

PID   USER     TIME  COMMAND
    1 nextjs    0:01 node server.js
   14 nextjs    0:02 /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   25 nextjs    0:13 /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   26 nextjs    0:00 /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   39 nextjs    0:00 sh
   78 nextjs    0:00 ps axjf

this is the situation when the server starts, once we get some requests in:

...
Handling request GET /projects/2
Proxy target: http://localhost:33149/projects/2
Handling request GET /projects/14
Proxy target: http://localhost:33149/projects/14
Handling request GET /projects/4
Proxy target: http://localhost:33149/projects/4
Handling request GET /projects
Proxy target: http://localhost:33149/projects
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: socket hang up
    at connResetException (node:internal/errors:717:14)
    at Socket.socketOnEnd (node:_http_client:519:23)
    at Socket.emit (node:events:523:35)
    at endReadableNT (node:internal/streams/readable:1367:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:455:19)
    at Socket.emit (node:events:523:35)
    at TCP.<anonymous> (node:net:335:12) {
  code: 'ECONNRESET'
}
Handling request GET /_next/static/css/65a6d53be77a4768.css
Proxy target: http://localhost:33149/_next/static/css/65a6d53be77a4768.css
Handling request GET /_next/static/chunks/webpack-a75aec2d68f60fab.js
Proxy target: http://localhost:33149/_next/static/chunks/webpack-a75aec2d68f60fab.js
...

and we can notice how a processChild is missing:

  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   14     1 nextjs   S    10.6g  32%   5   0% /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
   26    14 nextjs   S    10.6g  32%   0   0% /usr/local/bin/node /app/node_modules/next/dist/compiled/jest-worker/processChild.js
    1     0 nextjs   S     686m   2%   3   0% node server.js
   39     0 nextjs   S     1692   0%   3   0% sh
   77    39 nextjs   R     1620   0%   7   0% top

so it seems that a worker crashed but the server still thinks it's around somehow.
on further requests however there are no ECONNRESET in the logs anymore, just the server hanging indefinitely.

it is unclear what is causing this apparent crash given the same pages sometime works, especially when server just restarted.

Expected Behavior

server is stable.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

No response

The text was updated successfully, but these errors were encountered:

masterkain · 2023-05-10T17:32:40Z

https://gist.github.com/masterkain/a3edb77d932fceae51a449082ede65d7

this is the custom_server.js I'm using to debug proxied requests and for a workaround for #49536

I also fixed const hostname = 'localhost'; just to make sure I did not have dns issues while debugging on kubernetes.

masterkain · 2023-05-11T10:54:46Z

after some more research I have a rough idea about what is happening

jest-worker based childProcess.js might be killed (or exit spontaneusly, not sure), this for me happens almost immediately on a small app, but only after some server-side components calls sprinkled with prisma.
it does not happen when performing just route handler requests on a new server instance.

[Bug]: Memory consumption issues on Node JS 16.11.0+ jestjs/jest#11956

killed workers do not get replaced -- work will still be attempted to be passed around to dead instances.
we need to wait for this to release:

jestjs/jest@25d59f3

the takeaway is that the memory leak has not been fixed yet in node.js and it probably never will be as it is flagged as wontfix.
afaik we have little say in our deployment how to regulate such workers, i.e. numWorkers, an idleMemoryLimit to be set and the fact that if this worker is printing something to stdout I can't recognize that what's being printed belongs to a worker.
https://github.com/jestjs/jest/tree/main/packages/jest-worker#idlememorylimit-number-optional
https://github.com/jestjs/jest/blob/main/packages/jest-worker/src/workers/ChildProcessWorker.ts#L119
I've never been been able to see this, workers just die without printing anything useful.
I can't fix this with a kubernetes health check because the app becomes unresponsive almost immediately and restarting the pod just means going into an infinite loop as soon as it comes online.
I also have no clue yet what is really killing the jest workers, memory issues are a strong assumption but I'm not yet sure.
maybe some miscalculation going on with how memory is calculated when running on k8s? because it's not the pod going OOM. pod reports low memory and it's stable.
another takeaway is that once the jest-worker restart patch lands in next.js the situation would even be worse in some cases because processes will be killed and restarted all the time, masking the underlying issue.

masterkain · 2023-05-11T13:19:55Z

the issue for now has been "resolved" by assigning copious amount of ram to each next.js instance.

khuezy · 2023-05-11T17:22:31Z

jest worker seems really slow; we're noticing ~7s of cold start times.

github-actions · 2023-06-11T00:04:22Z

This closed issue has been automatically locked because it had no new activity for a month. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

masterkain added the bug Issue was opened via the bug report template. label May 10, 2023

This was referenced May 11, 2023

Middleware and custom server randomly causes uncaughtException error: aborted when work with nginx using error_page #49587

Open

Prisma Memory Leak when using in NestJS app, reported by Jest prisma/prisma#17925

Closed

masterkain closed this as completed May 11, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processChild.js dies in production with no replacement #49623

processChild.js dies in production with no replacement #49623

masterkain commented May 10, 2023 •

edited

masterkain commented May 10, 2023

masterkain commented May 11, 2023 •

edited

masterkain commented May 11, 2023

khuezy commented May 11, 2023

github-actions bot commented Jun 11, 2023

processChild.js dies in production with no replacement #49623

processChild.js dies in production with no replacement #49623

Comments

masterkain commented May 10, 2023 • edited

Verify canary release

Provide environment information

Which area(s) of Next.js are affected? (leave empty if unsure)

Link to the code that reproduces this issue

To Reproduce

Describe the Bug

Expected Behavior

Which browser are you using? (if relevant)

How are you deploying your application? (if relevant)

masterkain commented May 10, 2023

masterkain commented May 11, 2023 • edited

masterkain commented May 11, 2023

khuezy commented May 11, 2023

github-actions bot commented Jun 11, 2023

masterkain commented May 10, 2023 •

edited

masterkain commented May 11, 2023 •

edited