[Bug]: gRPC Socket Shutting Down After Many Runs #681

deevashwer · 2024-05-12T18:13:37Z

Issue Type

Usability

Modules Involved

SPU runtime

Have you reproduced the bug with SPU HEAD?

Yes

Have you searched existing issues?

Yes

SPU Version

spu 0.7.0b0

OS Platform and Distribution

Linux Ubuntu 22.04

Python Version

3.9

Compiler Version

No response

Current Behavior?

Hi!

I'm trying to benchmark SPU performance over 3 machines using PPD and it works well for the most part, but when I try to do many runs to get more accurate runtimes, one of the gRPC sockets shuts down with the following error message:

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:{IP_ADDRESS} {created_time:"2024-05-10T18:52:46.26229335+00:00", grpc_status:14, grpc_message:"Socket closed"}

I don't think it has something to do with the application code because the preceding runs which are doing the exact same computation run just fine. It seems to me that there's potentially a limit set on how much data can be communicated over these RPC instances. I don't think it's timing because I've had them run for several hours without aborting.

Is there an RPC environment variable I can set to prevent the sockets from closing?

Thanks for your help!

Standalone code to reproduce the issue

N/A

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

anakinxc · 2024-05-13T01:49:40Z

Hi @deevashwer

Are you running a large model? There is a timeout config here, this can happen when data is kind of huge and send takes more than 100s.

It is possible that there are some network jitter that causes one of the node takes a little bit longer to recv data.

Thanks

deevashwer · 2024-05-13T02:18:50Z

Yes, I'm running a large model in a LAN setting, so I don't expect significant jitters. It is a curious case because it works just fine for a few runs (say 4 or 5) and then on the 6th run, one of the sockets closes down. I'll try setting the timeout higher and let's see if that fixes the issue.

Thanks!

deevashwer · 2024-05-13T04:28:03Z

That did not solve the problem. After a bunch of runs, the same error happened after around 1 hour and 43 minutes. One of the nodes gets automatically terminated with signal 9, and then the other two abort from a closed socket.

anakinxc · 2024-05-13T06:36:45Z

Interesting, we'll try to reproduce this.

tpppppub · 2024-05-13T10:27:18Z

Hi @deevashwer , we have encountered a some similar issue (it's in chinese) before due to a potential memory leak problem in glibc. Maybe you can have a try with a different version of glibc or tcmalloc.

deevashwer · 2024-05-14T09:47:07Z

Hi @tpppppub. Thanks for the reference. Switching to tcmalloc unfortunately didn't resolve the issue. It does look like a memory leak however.

anakinxc · 2024-05-14T11:52:53Z

@warriorpaw Can you take a look when you have time? Thanks

anakinxc assigned warriorpaw May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: gRPC Socket Shutting Down After Many Runs #681

[Bug]: gRPC Socket Shutting Down After Many Runs #681

deevashwer commented May 12, 2024

anakinxc commented May 13, 2024

deevashwer commented May 13, 2024

deevashwer commented May 13, 2024

anakinxc commented May 13, 2024

tpppppub commented May 13, 2024

deevashwer commented May 14, 2024

anakinxc commented May 14, 2024

[Bug]: gRPC Socket Shutting Down After Many Runs #681

[Bug]: gRPC Socket Shutting Down After Many Runs #681

Comments

deevashwer commented May 12, 2024

Issue Type

Modules Involved

Have you reproduced the bug with SPU HEAD?

Have you searched existing issues?

SPU Version

OS Platform and Distribution

Python Version

Compiler Version

Current Behavior?

Standalone code to reproduce the issue

Relevant log output

anakinxc commented May 13, 2024

deevashwer commented May 13, 2024

deevashwer commented May 13, 2024

anakinxc commented May 13, 2024

tpppppub commented May 13, 2024

deevashwer commented May 14, 2024

anakinxc commented May 14, 2024