Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPA out of memory #6753

Open
itayhac opened this issue May 20, 2024 · 10 comments
Open

OPA out of memory #6753

itayhac opened this issue May 20, 2024 · 10 comments
Labels

Comments

@itayhac
Copy link

itayhac commented May 20, 2024

we are working with OPA as our policy agent.
we deploy multiple instances of OPA as docker containers on kubernetes.
Each OPA has k8s memory limit of 4GB.
also, each OPA loads a bundle with data.json file of about ~15Mb.

recently we have noticed that some of our OPA instances have been restarted due to OOM.
after further investigation we have found out that it happens when OPA is receiving frequent requests and memory fails to get free fast enough, which in turns results in OOM very fast (within 3 seconds).

Disclaimer:
the bundle i share here is a mock data that best mimics our use case.
i will share the heapdump that we got for the mimic data, and for actual production data (both with same rego code).

Please note, these are functions are taking almost 90 percent of the memory and the service gets OOMed out within seconds.
image

this is also true for our production memory profile.

  • OPA version - latest
  • bundle is provided.
  • Example of the both memory profile (using pprof) (both for mock data bundle and for production data run)
  • Go code that sends 100 requests to the local OPA.

Steps To Reproduce

run the following command to start OPA:
opa run --bundle itay_kenv_files/test_15mb.tar.gz --server --pprof --log-level=info
run the code to trigger OPA requests

Expected behavior

memory should remain low or at least get free shortly after the requests are being made.

Code that sends 100 request to OPA

package main

import (
	"bytes"
	"fmt"
	"log"
	"net/http"
	"sync"
	"time"
)

const (
	iterationsNumber = 100

)

func main() {
	log.Println("Starting OPA testing")

	var wg sync.WaitGroup
	wg.Add(iterationsNumber)
	for i := 0; i <= iterationsNumber; i++ {
		time.Sleep(40 *time.Millisecond)
		go sendRequest(i)
	}
	wg.Wait()
	fmt.Println("All go routines have finished.")

}

func sendRequest(i int) {
	log.Println("Sending request to opa. iteration number: ", i)

	// URL to which the POST request will be sent
	url := "http://localhost:8181/v1/data/test_policy/evaluator/access"
	
	jsonStr := []byte(`{
  		"input":{
	  	}
	}`)


	// Create a new HTTP request with POST method, specifying the URL and the request body
	req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonStr))
	if err != nil {
		log.Println("Error creating request:", err)
		return
	}

	// Set the Content-Type header to application/json since we're sending JSON data
	req.Header.Set("Content-Type", "application/json")

	// Create a new HTTP client
	client := &http.Client{}

	// Send the request via the HTTP client
	resp, err := client.Do(req)
	if err != nil {
		log.Println("Error sending request:", err)
		return
	}

	defer resp.Body.Close()

	// Print the HTTP response status code
	log.Println("Response Status:", resp.Status)
}

test_15mb.tar.gz
memory profile.zip

if further information regarding our production setup is required ill be happy to provide it.

@itayhac itayhac added the bug label May 20, 2024
@ashutosh-narkar
Copy link
Member

Thanks for the detailed issue @itayhac. I tried to reproduce this by running OPA on docker and setting a 4GB memory limit. I increased the number of go routines from your script to send more concurrent requests to OPA. The maximum amount of memory consumed by OPA did not cross 200 MB. Is there something different in your actual setup vs the mock bundle you've provided here? I would expect the CPU usage to spike while OPA handles these requests but it's still unclear why OPA runs OOM.

@itayhac
Copy link
Author

itayhac commented May 21, 2024

Hi @ashutosh-narkar , thank you so much for you fast and detailed reply.
i changed the files to reproduce the issue with 4GB memory (i increased the size and structure of the data.json file).

please retry and it should be reproduced.

@ashutosh-narkar
Copy link
Member

One thing I noticed in the policy is you're using the object.get builtin on the data set instead of just accessing under data.rules for example. You can probably avoid using the builtin. Another thing I noticed when I run the stress test with the openpolicyagent/opa:0.64.1-static image variant there is no significant increase in memory. Have you seen that as well?

@itayhac
Copy link
Author

itayhac commented May 26, 2024

any further thoughts?
@ashutosh-narkar can we label it as bug and prioritize it?

@ashutosh-narkar
Copy link
Member

@itayhac can you please confirm if you're able to repro this issue with the upstream OPA images including any differences with the static variant. You mentioned (in a separate thread) that y'all are building your own images. Also this could be a relevant issue.

@itayhac
Copy link
Author

itayhac commented May 29, 2024

the problem is reproduced with our own OPA image (we compile latest), and with both latest public images (static and non-static)

@ashutosh-narkar
Copy link
Member

This could be related to #5946. In your policy you're referring to a large object and this can be replicated if you modify the policy to refer to the object w/o using the object.get builtin. @johanfylling did you encounter something like this while working on #6040 ?

@johanfylling
Copy link
Contributor

@ashutosh-narkar, the work in #6040 focused solely on the CPU time aspect, and did not look at how memory usage was affected.

@ashutosh-narkar
Copy link
Member

The data has some objects and arrays and I wonder if when referenced inside of the policy the interface-AST conversions are impacting performance in terms of CPU and memory.

@ashutosh-narkar
Copy link
Member

We're looking to implement something like discussed in #4147. This should probably help with performance as we'll avoid the interface to AST conversion during eval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants