Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on transport.newBufWriter #6761

Closed
pablodz opened this issue Nov 2, 2023 · 20 comments
Closed

Memory leak on transport.newBufWriter #6761

pablodz opened this issue Nov 2, 2023 · 20 comments

Comments

@pablodz
Copy link

pablodz commented Nov 2, 2023

What version of gRPC are you using?

module test

go 1.21

require (
	google.golang.org/grpc v1.59.0
	google.golang.org/protobuf v1.31.0
)

...

compiled with

GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o server

What version of Go are you using (go version)?

go version go1.21.3 linux/amd64

What operating system (Linux, Windows, …) and version?

running inside alpine linux/amd64

What did you do?

proto file

humandetector.proto

syntax = "proto3";

package humandetector;

// Ping-pong request
message PingPongRequest { string message = 1; }

// Ping-pong response
message PingPongResponse { string message = 1; }

// Human Detector service definition
service HumanDetectorService {
  rpc PingPong(PingPongRequest) returns (PingPongResponse);
}

I have a golang grpc server like below

server.go

package server

import (
	"fmt"
	"log"
	"net"
	"time"

	"gitlab.com/vozy/proto-telephony-grpc-models/pkg/humandetector"
	"google.golang.org/grpc"
	"google.golang.org/grpc/keepalive"
	"google.golang.org/grpc/reflection"
)

const (
	host = "0.0.0.0"
	port = 8080 // Changed to integer type
)

type srv struct {
	humandetector.HumanDetectorServiceServer
}

func ServeGrpc() {
	fmt.Println("Starting server...")
	fmt.Printf("Listening on %s:%d\n", host, port)
	listener, err := net.Listen("tcp", fmt.Sprintf("%s:%d", host, port))
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
		return // Added return statement
	}
	defer listener.Close()

	opts := []grpc.ServerOption{}
	grpcServer := grpc.NewServer(opts...)
	mysrv := &srv{}
	humandetector.RegisterHumanDetectorServiceServer(grpcServer, mysrv)

	// Register reflection service on gRPC server.
	reflection.Register(grpcServer)

	if err := grpcServer.Serve(listener); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}



func (s *srv) PingPong(ctx context.Context, req *humandetector.PingPongRequest) (*humandetector.PingPongResponse, error) {
	return &humandetector.PingPongResponse{
		Message: "PONG!",
	}, nil
}

client.go

package main

import (
	"context"
	"log"
	"sync"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	humandetector_pb2 "gitlab.com/vozy/proto-telephony-grpc-models/pkg/humandetector"
)

func makeRequest(wg *sync.WaitGroup, ch chan string) {
	defer wg.Done()

	// Dial the gRPC server inside the request function
	conn, err := grpc.Dial("localhost:8080", grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		log.Printf("Failed to connect: %v", err)
		return
	}
	defer conn.Close()

	// Create a gRPC client
	client := humandetector_pb2.NewHumanDetectorServiceClient(conn)

	// Create a request with the random number
	request := &humandetector_pb2.PingPongRequest{}

	// Call the PingPong RPC
	response, err := client.PingPong(context.Background(), request)
	if err != nil {
		log.Printf("Error calling PingPong: %v", err)
		return
	}

	// Send the response message through the channel
	ch <- response.Message
}

func main() {
	// Number of concurrent requests
	numRequests := 50000

	// Create a channel to receive responses
	ch := make(chan string, numRequests)

	// Use a WaitGroup to wait for all goroutines to finish
	var wg sync.WaitGroup

	// Start concurrent requests
	for i := 0; i < numRequests; i++ {
		wg.Add(1)
		go makeRequest(&wg, ch)
	}

	// Close the channel once all goroutines are done
	go func() {
		wg.Wait()
		close(ch)
	}()

	// Collect responses from the channel and print them
	for response := range ch {
		log.Printf("PingPong Response: %s", response)
	}
}

What did you expect to see?

low memory usage after 50k requests, stucks on that memory usage

> docker stats (EXPECTED)
CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O   PIDS
aec46e609fe0   tender_mirzakhani   0.00%     5.105MiB / 62.72GiB   0.01%     3.21kB / 0B   0B / 0B     6

What did you see instead?

> docker stats
CONTAINER ID   NAME                   CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O    PIDS
a203abc32299   intelligent_torvalds   0.00%     777MiB / 62.72GiB   1.21%     132MB / 17.4MB   0B / 325MB   32

After 50k requests of a simple stream ram is nearly 700MB

@pablodz
Copy link
Author

pablodz commented Nov 2, 2023

using pprof give us

aec46e609fe0:~#  go tool pprof -lines http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /root/pprof/pprof.go-telephony-human-detector.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: go-telephony-human-detector
Type: inuse_space
Time: Nov 2, 2023 at 9:20pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1182.59MB, 95.86% of 1233.60MB total
Dropped 98 nodes (cum <= 6.17MB)
Showing top 10 nodes out of 19
      flat  flat%   sum%        cum   cum%
  619.46MB 50.22% 50.22%   619.46MB 50.22%  google.golang.org/grpc/internal/transport.newBufWriter /go/pkg/mod/google.golang.org/grpc@v1.59.0/internal/transport/http_util.go:315 (inline)
  543.12MB 44.03% 94.24%   543.12MB 44.03%  bufio.NewReaderSize /usr/local/go/src/bufio/bufio.go:57 (inline)
   13.01MB  1.05% 95.30%    13.01MB  1.05%  runtime.malg /usr/local/go/src/runtime/proc.go:4459
       7MB  0.57% 95.86%        7MB  0.57%  google.golang.org/grpc/internal/transport.NewServerTransport /go/pkg/mod/google.golang.org/grpc@v1.59.0/internal/transport/http2_server.go:246
         0     0% 95.86%        8MB  0.65%  gitlab.com/vozy/go-telephony-human-detector/internal/server.ServeGrpc /go/src/gitlab.com/vozy/go-telephony-human-detector/internal/server/server.go:54
         0     0% 95.86%        8MB  0.65%  google.golang.org/grpc.(*Server).Serve /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:852
         0     0% 95.86%  1203.59MB 97.57%  google.golang.org/grpc.(*Server).Serve.func3 /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:894
         0     0% 95.86%  1203.59MB 97.57%  google.golang.org/grpc.(*Server).handleRawConn /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:910
         0     0% 95.86%  1203.59MB 97.57%  google.golang.org/grpc.(*Server).newHTTP2Transport /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:954
         0     0% 95.86%  1178.59MB 95.54%  google.golang.org/grpc/internal/transport.NewServerTransport /go/pkg/mod/google.golang.org/grpc@v1.59.0/internal/transport/http2_server.go:168

heap file:

heap.zip

@pablodz
Copy link
Author

pablodz commented Nov 3, 2023

Compiled using go 1.20 instead of 1.21 and now works as usual on Dockerfile

FROM golang:1.20-alpine AS BUILDER
...

image

looks like is something related with golang and bufers

@atollena
Copy link
Collaborator

atollena commented Nov 3, 2023

Just to understand better, you are suspecting that this is a regression with go 1.21, since you're not observing this problem with go 1.20?

Those buffers are allocated per connection. It would be useful to show us the associated number of connection for each test you run. If you have a number of open connections in the order of 1000s, the numbers you're showing are expected.

@pablodz
Copy link
Author

pablodz commented Nov 8, 2023

Just to understand better, you are suspecting that this is a regression with go 1.21, since you're not observing this problem with go 1.20?

Yes, ddos of ping request on server use a lot more memory compiled with 1.21 (400Mb) than 1.20 (70Mb)

Those buffers are allocated per connection. It would be useful to show us the associated number of connection for each test you run. If you have a number of open connections in the order of 1000s, the numbers you're showing are expected.

Yes. However, memory with compiled on 1.21 stucks on the max used, go1.20 decrease as expected after minutes of ddos

Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@arvindbr8
Copy link
Member

We will try to reproduce this on our side and get back to you

@arvindbr8 arvindbr8 self-assigned this Nov 15, 2023
@pablodz
Copy link
Author

pablodz commented Nov 30, 2023

I suspect that something happens when function finish execution

func (s *srv) PingPong(ctx context.Context, req *humandetector.PingPongRequest) (*humandetector.PingPongResponse, error) {
	return &humandetector.PingPongResponse{
		Message: "PONG!",
	}, nil
}

I added a 60-second time.Sleep to the function, and the memory leak only occurs when all the client receives the message. Towards the end, there's a sudden increase in memory, even though memory usage during connections is expected.

@krishnachaitanya1996
Copy link

Do we have a solution for high memory usage of grpc with transport.newBufWriter

@arvindbr8
Copy link
Member

@krishnachaitanya1996 -- are you also seeing a spike in the memory usage when you moved between go versions?

@krishnachaitanya1996
Copy link

Ours is a newly built app on go 1.21.3 and grpc 1.57.0, and we are tuning the application wrt cpu and memory and found that grpc transport.newBufWriter consuming a good amount of memory. so investigating on this area.

@arvindbr8
Copy link
Member

The investigation of this issue is towards the spike in memory usage while using go v1.21 vs go v1.20. @krishnachaitanya1996 -- might be different that what you are describing here

@Suyghur
Copy link

Suyghur commented Mar 1, 2024

I also found the same issue in my application~~
I tried changing the buffer size from 32k to 8k but that didn't seem to fix it

@arvindbr8 arvindbr8 removed their assignment Mar 5, 2024
@aranjans
Copy link
Contributor

aranjans commented Mar 5, 2024

Hi @arvindbr8, I am working on this issue

@jinleileiking
Copy link

image
image

Our mem leak prof, maybe related to this issue.

@sywesk
Copy link

sywesk commented Mar 12, 2024

Hello! I encountered this issue and had posted some details in this thread but eventually removed it as I thought it wasn't really tied to this issue. Nonetheless, I solved it and it might help others:

If you are using Google Cloud Load Balancers and HTTP2 with your LB Backend, be careful as GFEs (LB Frontends) may keep connections open for as long as 10 minutes (which is the default backend timeout). This means that when dealing with lots of clients that connect & disconnect to GFEs, you might end-up with a ton of "residual" HTTP2 connections & bear the costs associated with them (high number of go routines & memory allocations).

I partially solved it by tweaking keep-alive settings to send a GOAWAY after 15 seconds of inactivity. I reckon it might not the best way to solve it though.

@atollena
Copy link
Collaborator

Hello! I encountered this issue and had posted some details in this thread but eventually removed it as I thought it wasn't really tied to this issue. Nonetheless, I solved it and it might help others:

If you are using Google Cloud Load Balancers and HTTP2 with your LB Backend, be careful as GFEs (LB Frontends) may keep connections open for as long as 10 minutes (which is the default backend timeout). This means that when dealing with lots of clients that connect & disconnect to GFEs, you might end-up with a ton of "residual" HTTP2 connections & bear the costs associated with them (high number of go routines & memory allocations).

I partially solved it by tweaking keep-alive settings to send a GOAWAY after 15 seconds of inactivity. I reckon it might not the best way to solve it though.

The fact that grpc-go isn't terribly efficient at dealing with lots of connection is something a lot of users have run into, including myself. Sometimes it's possible to reduce the number of connections, sometimes it isn't. For your particular case, I would try SharedWriteBuffer and SharedReadBuffer server options, which should be particularly effective if a lot of connections are idle (which is usually the case when there are many).

This specific issue was opened for a regression in Go 1.21, not about generic per-connection memory usage.

@aranjans
Copy link
Contributor

aranjans commented Mar 12, 2024

Hi @pablodz, sorry for the delay in response. We have been a bit thin on the ground for a couple of months.

However, it seems like we are not able to repro the exact issue you are facing. I tried to do it on my side and I noticed below memory usages metrics between go versions 1.21 and 1.20. I also tried varying the number of connections made as I feel like that is an important metric to check here

Number of requests Mem Usage go1.20 Mem Usage go1.20
1k 128M 116M
5k 766M 778M
10k 1.5G 1.5G
20k 2.7G 2.9G
40k 4.3G 5.1G
50k 6.4G 6.1G

fyi: I have used docker for these findings
Kindly let us know if we are missing something here.

@pablodz
Copy link
Author

pablodz commented Mar 12, 2024

@arvindbr8 it looks like since go1.21.4 the memory leaks was fixed, now I'm using go.1.22, looks that was something related to GC collector after executing a grpc function according to my investigation

@arvindbr8
Copy link
Member

That's great! Happy that it resolved for you. Thanks for bringing this to our notice. Seems like there could be other people impacted by this.

@jinleileiking
Copy link

golang/go#63335 this fixed it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants