Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: gRPC Socket Shutting Down After Many Runs #681

Open
deevashwer opened this issue May 12, 2024 · 7 comments
Open

[Bug]: gRPC Socket Shutting Down After Many Runs #681

deevashwer opened this issue May 12, 2024 · 7 comments
Assignees

Comments

@deevashwer
Copy link

Issue Type

Usability

Modules Involved

SPU runtime

Have you reproduced the bug with SPU HEAD?

Yes

Have you searched existing issues?

Yes

SPU Version

spu 0.7.0b0

OS Platform and Distribution

Linux Ubuntu 22.04

Python Version

3.9

Compiler Version

No response

Current Behavior?

Hi!

I'm trying to benchmark SPU performance over 3 machines using PPD and it works well for the most part, but when I try to do many runs to get more accurate runtimes, one of the gRPC sockets shuts down with the following error message:

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:{IP_ADDRESS} {created_time:"2024-05-10T18:52:46.26229335+00:00", grpc_status:14, grpc_message:"Socket closed"}

I don't think it has something to do with the application code because the preceding runs which are doing the exact same computation run just fine. It seems to me that there's potentially a limit set on how much data can be communicated over these RPC instances. I don't think it's timing because I've had them run for several hours without aborting.

Is there an RPC environment variable I can set to prevent the sockets from closing?

Thanks for your help!

Standalone code to reproduce the issue

N/A

Relevant log output

No response

@anakinxc
Copy link
Contributor

Hi @deevashwer

Are you running a large model? There is a timeout config here, this can happen when data is kind of huge and send takes more than 100s.

It is possible that there are some network jitter that causes one of the node takes a little bit longer to recv data.

Thanks

@deevashwer
Copy link
Author

Yes, I'm running a large model in a LAN setting, so I don't expect significant jitters. It is a curious case because it works just fine for a few runs (say 4 or 5) and then on the 6th run, one of the sockets closes down. I'll try setting the timeout higher and let's see if that fixes the issue.

Thanks!

@deevashwer
Copy link
Author

That did not solve the problem. After a bunch of runs, the same error happened after around 1 hour and 43 minutes. One of the nodes gets automatically terminated with signal 9, and then the other two abort from a closed socket.

@anakinxc
Copy link
Contributor

Interesting, we'll try to reproduce this.

@tpppppub
Copy link
Member

Hi @deevashwer , we have encountered a some similar issue (it's in chinese) before due to a potential memory leak problem in glibc. Maybe you can have a try with a different version of glibc or tcmalloc.

@deevashwer
Copy link
Author

Hi @tpppppub. Thanks for the reference. Switching to tcmalloc unfortunately didn't resolve the issue. It does look like a memory leak however.

@anakinxc
Copy link
Contributor

@warriorpaw Can you take a look when you have time? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants