-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingester] Intermittent KubePersistentVolumeFillingUp alerts #2129
Comments
Thanks for filing the issue. Not having any error logs is quite frustrating as it would be the most likely source of good information here. Some things we can do:
Is the ingester unexpectedly restarting? or even gracefully restarting before this issue occurs? I have seen something similar once, but it occurred after an ingester OOMed and then had issues reading its wal on restart. We have also seen issues in AKS where mounted drives will swap to read only. Perhaps try to manually delete a file from the drive when this issue is occurring? Tempo should log if it fails to remove a file, but I think it's still a good test. |
Hey @joe-elliott, thanks for your quick reply. I've looked into the metrics you mentioned using the example of the ingester from zone
It definitely looks like the rate of flushes slowly degrades while the PVC is filling up:
From what I can see, the only failed flushes emitted by the As expected, this also corresponds exactly with the error logs we saw above.
Request duration looks pretty stable: However, looking at the tail end of the yellow data points, this chart seems to also indicate that the rate of flushes slowly degrades.
I'll take a look at this the next time we the issue occurs.
Using the
This sounds interesting, I'll try this out as well the next time the issue occurs. I looked around the Tempo dashboards again, and I did find one anomaly: While CPU usage looks fairly steady, the used memory of the ingester definitely seems to spike right around At |
Hi @joe-elliott, we caught a live instance of this issue occurring and I was able to dump the contents of contents of /var/tempo
I also tested whether the mounted drive was read only but I was able to delete one of the |
based on settings that might be a "normal" number of complete blocks or it might be exceptionally high. can you check an ingester that is not showing the issue? |
I have checked the same ingester that had it's volume filling up earlier after we have deleted the volume and recreated it and currently it is in a healthy state contents of /var/tempo in previously unhealthy ingester
And for comparison a different ingester that has not experienced any problems with volumes filling up recently contents of /var/tempo in healthy ingester
|
Ok, so we know that flushes are slowing down which is causing the completed blocks on disk to pile up and then eventually fill the disk. We also can't find any relevant logs that might indicate there is some obvious error. For fun we could confirm this by checking the My guess is that some resource is being saturated during the time period that Tempo is struggling to offload complete blocks from the ingesters. Above you indicated there was a spike in memory usage at the time that tempo was having issues. This is another potential indicator that some resource is saturated. Things to try/look for:
I also see heap increasing. This has me concerned that there is bug where something in the ingester is spiraling out of control and doing a ton of unnecessary work. Things we can check:
I apologize for all the homework but you've stumbled on a tough one. I'm more than willing to work through this with you though! |
Hey @joe-elliott. We are still experiencing this issue, and are still trying to figure out the root cause. I'm going to be dumping different relevant (I hope) metrics from a couple of examples we had recently: ingester-zone-a-4 which started filling up at 2023-03-22T16:30:00Z and was full at 2023-03-02T03:24:00Zingester-zone-a-3 which started filling up at 2023-03-27T06:00:00Z and was full at 2023-03-27T14:20:00ZThere are some spikes in the number of goroutines, but otherwise looks fairly steady:
Roger that, we will try to catch a live one and report back with some pprof output. |
@joe-elliott A couple more things:
We use Goldpinger to monitor our nodes, and have been unable to find a link between node issues and PVCs filling up. Our maximum trace size is rather large at 40MB. Even with this limit, our ingesters are still emitting a lot of We have been wondering - given that our limit is significantly higher than the default - whether lowering this limit could have a positive impact? There is some amount of politics involved with lowering the limit, so we want to have a plausible reason before lowering it. |
Do you know if the pod moved nodes after it was deleted? I'm still suspicious of resource throttling at the node level. If this pod moved nodes when it was deleted it might suggest something along these lines. Does goldpinger give any indication of resource usage on the nodes? I would recommend using something like node exporter to dump node metrics.
We have a max size of 50MB and do not see this issue. Lowering the max size will improve ingester and compactor memory profiles. Humorously, we too face politics on lowering the limit :). This PR would improve things w/r to max trace size quite a bit but I have not found time to go back at complete it:
I will try to find some time to dig into this deeper, but it would be easier if you were able to pass the binary files retrieved from pprof directly. I took a quick look and didn't see anything out of the ordinary. Interested to see if any of these metrics are showing anything suspicious: are the flushes taking considerably longer during this period? |
I don't know about the concrete pod in the example earlier. However, I did run a quick test; PVC During all of this time, the ingester has been running on the same node. I cordoned the ingester's node and deleted the ingester pod, forcing it onto a new node. Immediately after doing this, the PVC's space started freeing again: After letting the PVC empty, I deleted the ingester once again, putting it back on its original node. Once placed there, the ingester exhibited no faulty behaviour, and continued operating with no issues.
No, but as you also suggest, we use node_exporter for that. CPU, memory and network are fairly consistent across Tempo nodes, but there is definitely some funny business going on with disk IO: Above metrics are from a node exhibiting poor PVC behaviour. Note the Y-axis of Compare that with two nodes on their best behaviour: My current working theory is that PVC's exhibit this behaviour when their underlying Azure premium SSD managed disk runs out of burst credits. Preliminary analysis consisting of digging into metrics from Azure has been unable to confirm this theory.
Roger that. We will try to get some binary files next time we watch the issue happening.
Doesn't look like it, though the frequency of flushes seems to go down, as we have previously discussed:
|
Ok, I think things are coming together. Flushes aren't slowing down. The time it takes to cut a block is likely slowing down to due to disk saturation.
I do not know much about Azure SSD's. In our internal cluster we provision 50GB SSDs per ingester but really only use ~3-5GB's. We used to use 15GB SSDs but found that we had to increase their size in order to get more guaranteed iops. If there is a similar relationship between size and iops in Azure perhaps just bumping the disks up will solve the problem? |
Hi @joe-elliott We found another interesting behavior about this issue: It failed to flush blocks to the storage account every other time. From the log, we can see that it successfully cut out the blocks for all blockID but it does not revoke the handleFlush methods. We know this because the relevant metrics like
|
With the information available at this time, I believe this is our best bet. It's always frustrating to make this kind of change without being 100% sure that it addresses the root cause of the issue, but data does seem to point towards disk IOPS being a (the) limiting factor here. We are going to bump our disks up a performance tier or two and analyse the effect it has on the PVC issue. |
I wonder if this is some strange artifact related to disk saturation? One block gets stalled out but the next goes through?
good luck :)! let us know how it goes |
We increased the size of our ingesters on April 13th, and were initially hopeful that it had resolved the issue, as we did not observe the issue for several days. Unfortunately, a PVC started filling up again yesterday. I deleted the pod (but not the PVC), and even though the pod was re-scheduled in the same node, the PVC immediately recovered: The red arrow (showcasing my sublime design skills) indicates when the pod was deleted. Since the PVC was able to recover on the same node that the issue originally appeared on, I guess we can conclude that the node itself is not the issue? I grabbed some pprof data as you requested: I'm not familiar with the pprof tool, so I hope the data is correct this time. For reference, here is data from a healthy ingester: A curious observation is that when a PVC does start filling up, the rate with which it fills varies a lot. This one has been slowly filling for two days, other ones take only a few hours to fill completely. Another theory we were working with is that the PVC issues are somehow linked to surges of traffic. We introduced trace sampling a couple of weeks ago, which drastically reduced the amount of spans being created per second, as can be seen here: The fact that the issue is still happening despite traffic being consistently low leads us to believe that traffic is not the root cause of the issue. As always, any help is appreciated. |
So the issue still occurs but seems like it just takes longer to trigger? Is there anyway to visualize azure pvc burst credits to see if they are running out? |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
I see that the bot is eager to close this issue, so I'll chime in with an update. We never managed to get to the bottom of this issue, and we are still not sure exactly where it lies. As previously mentioned, the rate with which a PVC fills up decreased dramatically after bumping our Azure disks from 20GB to 64GB. The PVCs now fill up so slowly that it is rather easy for us to catch it before it becomes critical. In practice, we have two warnings on PVC usage:
We previously did not have the warning alert, which meant that our on-call personnel would occasionally get paged during the night with a very frustrating PVC alert. We introduced the 50% utilisation alert to catch erroring PVCs before they reach the critical alert threshold. Since it takes days for an erroring PVC to reach the critical threshold, the warning alert makes it easier for us to catch erroring PVCs and fix them. Naturally, this is not a satisfying conclusion to the issue, but this approach has bought our engineering team some valuable time and allowed us to focus on other more important problems. |
Marking this "keepalive". Fwiw we still have never seen this issue in any of our clusters (including our Azure clusters). Unsure if this will help but here is our PVC:
|
@joe-elliott Thanks for sharing that information! Are you able to share what performance tier your Azure disks run with? We use 64GiB with performance tier P6: While performance tier of the disk is tied to the disk's size by default, I believe it is possible to change the performance tier independently of size (i.e. to run a 100GiB disk with performance tier P15, for example). I'd be curious to know if you are doing that. |
@joe-elliott Eureka! After an unholy amount of headscratching, I may finally be wise as to what's going on here. Here is the summary:
In-depth analysis In order for an ingester block to be cleared from disk, it has to have been flushed: func (i *instance) ClearFlushedBlocks(completeBlockTimeout time.Duration) error {
var err error
i.blocksMtx.Lock()
defer i.blocksMtx.Unlock()
for idx, b := range i.completeBlocks {
flushedTime := b.FlushedTime()
if flushedTime.IsZero() {
continue
}
if flushedTime.Add(completeBlockTimeout).Before(time.Now()) {
i.completeBlocks = append(i.completeBlocks[:idx], i.completeBlocks[idx+1:]...)
err = i.local.ClearBlock(b.BlockMeta().BlockID, i.instanceID)
if err == nil {
metricBlocksClearedTotal.Inc()
}
break
}
}
return err
} The first smoking gun was that blocks that pile up on disk have no
When a Tempo ingester boots, it starts i.flushQueuesDone.Add(i.cfg.ConcurrentFlushes)
for j := 0; j < i.cfg.ConcurrentFlushes; j++ {
go i.flushLoop(j)
}
func (i *Ingester) flushLoop(j int) {
defer func() {
level.Debug(log.Logger).Log("msg", "Ingester.flushLoop() exited")
i.flushQueuesDone.Done()
}()
for {
o := i.flushQueues.Dequeue(j)
if o == nil {
return
}
op := o.(*flushOp)
op.attempts++
var retry bool
var err error
if op.kind == opKindComplete {
retry, err = i.handleComplete(op)
} else {
retry, err = i.handleFlush(context.Background(), op.userID, op.blockID)
}
if err != nil {
handleFailedOp(op, err)
}
if retry {
i.requeue(op)
} else {
i.flushQueues.Clear(op)
}
}
} A dead end I spent an unfortunate amount of time on was a theory that flush operations were not being enqueued at all, but this turned out to not be the case. I eventually built a custom Tempo image with advanced telemetry (read: a bunch of additional log statements) in order to progress. In doing so, I discovered that all operations were being enqueued, but that flush queues would rarely stop processing operations altogether. Compare the graph of a healthy queue: To one of an unhealthy queue (i.e. one that has stopped processing operations):
A theory was starting to form: a flush queue gets stuck processing an operation, and hangs indefinitely. Next step would be to figure out where the operation gets stuck. Using my custom log statements, I was able to pinpoint the issue to the func (i *Ingester) handleFlush(ctx context.Context, userID string, blockID uuid.UUID) (retry bool, err error) {
sp, ctx := ot.StartSpanFromContext(ctx, "flush", ot.Tag{Key: "organization", Value: userID}, ot.Tag{Key: "blockID", Value: blockID.String()})
defer sp.Finish()
withSpan(level.Info(log.Logger), sp).Log("msg", "flushing block", "userid", userID, "block", blockID.String())
instance, err := i.getOrCreateInstance(userID)
if err != nil {
return true, err
}
if instance == nil {
return false, fmt.Errorf("instance id %s not found", userID)
}
if block := instance.GetBlockToBeFlushed(blockID); block != nil {
ctx := user.InjectOrgID(ctx, userID)
ctx, cancel := context.WithTimeout(ctx, i.cfg.FlushOpTimeout)
defer cancel()
start := time.Now()
err = i.store.WriteBlock(ctx, block)
metricFlushDuration.Observe(time.Since(start).Seconds())
metricFlushSize.Observe(float64(block.BlockMeta().Size))
if err != nil {
ext.Error.Set(sp, true)
sp.LogFields(otlog.Error(err))
return true, err
}
metricBlocksFlushed.Inc()
} else {
return false, fmt.Errorf("error getting block to flush")
}
return false, nil
} This method calls func (i *Ingester) getOrCreateInstance(instanceID string) (*instance, error) {
inst, ok := i.getInstanceByID(instanceID)
if ok {
return inst, nil
}
i.instancesMtx.Lock()
defer i.instancesMtx.Unlock()
inst, ok = i.instances[instanceID]
if !ok {
var err error
inst, err = newInstance(instanceID, i.limiter, i.overrides, i.store, i.local, i.cfg.AutocompleteFilteringEnabled, i.cfg.DedicatedColumns)
if err != nil {
return nil, err
}
i.instances[instanceID] = inst
}
return inst, nil
} Once again, I spent a considerable amount of time chasing a theory that a mutex deadlock was happening somewhere in the ingester. My suspicion that a mutex deadlock was at play was further exacerbated by this comment I found: // Now that we are adding a new block take the blocks mutex.
// A warning about deadlocks!! This area does a hard-acquire of both mutexes.
// To avoid deadlocks this function and all others must acquire them in
// the ** same_order ** or else!!! i.e. another function can't acquire blocksMtx
// then headblockMtx. Even if the likelihood is low it is a statistical certainly
// that eventually a deadlock will occur.
Additional log statements revealed that the operation gets stuck when uploading the block to Azure (we use Azure as our storage backend): Further digging took me to Tempo's Azure writer implementation: func (rw *readerWriter) writer(ctx context.Context, src io.Reader, name string) error {
blobURL := rw.containerURL.NewBlockBlobURL(name)
if _, err := blob.UploadStreamToBlockBlob(ctx, src, blobURL,
blob.UploadStreamToBlockBlobOptions{
BufferSize: rw.cfg.BufferSize,
MaxBuffers: rw.cfg.MaxBuffers,
},
); err != nil {
return errors.Wrapf(err, "cannot upload blob, name: %s", name)
}
return nil
} My theory at this point was that the passed By default, the ingester passes a five minute context timeout to the write operation: ctx, cancel := context.WithTimeout(ctx, i.cfg.FlushOpTimeout)
defer cancel()
start := time.Now()
err = i.store.WriteBlock(ctx, block) Following the context's trail, I was not able to find evidence of it being overwritten in Tempo's code, so I started looking at the azure-storage-blob-go SDK we use. The SDK's Inspecting the code, I did not find evidence that the context was being overwritten or not respected, and this theory ended up being a red herring as well. What I did find was these two concurrency items:
When a Tempo ingester uploads a block to Azure storage, it chunks the block's For each chunk, a goroutine is spawned through the SDK's // Some data was read, schedule the write.
id := c.id.next()
c.wg.Add(1)
c.o.TransferManager.Run(
func() {
defer c.wg.Done()
c.write(copierChunk{buffer: buffer, id: id, length: n})
},
) As we can see, a weight of one is added to the This is the // write uploads a chunk to blob storage.
func (c *copier) write(chunk copierChunk) {
defer c.o.TransferManager.Put(chunk.buffer)
if err := c.ctx.Err(); err != nil {
return
}
_, err := c.to.StageBlock(c.ctx, chunk.id, bytes.NewReader(chunk.buffer[:chunk.length]), c.o.AccessConditions.LeaseAccessConditions, nil, c.o.ClientProvidedKeyOptions)
if err != nil {
c.errCh <- fmt.Errorf("write error: %w", err)
return
}
} At the end of the The only place the // getErr returns an error by priority. First, if a function set an error, it returns that error. Next, if the Context has an error
// it returns that error. Otherwise it is nil. getErr supports only returning an error once per copier.
func (c *copier) getErr() error {
select {
case err := <-c.errCh:
return err
default:
}
return c.ctx.Err()
} This method is called as the first thing when processing each chunk: if err := c.getErr(); err != nil {
return err
} If // Send all our chunks until we get an error.
var err error
for {
if err = cp.sendChunk(); err != nil {
break
}
}
// If the error is not EOF, then we have a problem.
if err != nil && !errors.Is(err, io.EOF) {
cp.wg.Wait()
return nil, err
} This is the critical moment, as the outer method cannot Consider four chunks
In order for this to happen, multiple uploads have to fail, which is exactly what we see: The azure-storage-blob-go SDK we use has been superseded by the newer azure-sdk-for-go SDK. The azure-sdk-for-go SDK contains a different implementation of if err != nil {
select {
case errCh <- err:
// error was set
default:
// some other error is already set
}
cancel()
} Using a As such, it is my tentative conclusion that upgrading to the new SDK will solve this problem. I'd be happy to take a look at this task, or at the very least evaluate how large of a task it is. |
The new SDK may also help with these two issues: |
Wow, this is an impressive analysis. Thank you for digging this out. Since this is not currently impacting us we would likely have never been able to dig to the bottom of it internally. Great work! I am 100% on board with upgrading the azure blob storage client. As you've found the one we are using is no longer maintained.
If you can do a first pass analysis on how big this is, we would appreciate it. Hopefully, it's a small lift.
This cracked me up. I too have dabbled in "advanced telemetry". |
Roger that.
I'd be happy to do that. My team is releasing a new internal product this Monday, so I might not get a chance to continue with this until next week, or the week after if things get busy. I've taken the liberty of creating a new issue for tracking work on the SDK upgrade: #2835; I wouldn't want communication related to the upgrade to get buried in this issue. |
That's fantastic! Thanks for all your hard work on this @LasseHels. After we get some confidence in the v2 azure backend we will make it default. |
Describe the bug
Our ingesters'
PersistentVolumeClaim
s intermittently fill up, seemingly at random.To Reproduce
We have not been able to consistently reproduce the issue.
Expected behavior
We would expect that the used space on the
PersistentVolumeClaim
remains fairly consistent, and does not spike.Environment:
Additional Context
We run a total of nine ingesters; three in each zone across three zones.
For the past couple of months, the used space of an ingester's PVC will start gradually increasing. This gradual increase continues until it approaches 100%, at which point we have to manually intervene.
Here are a couple of screenshots of the PVC space usage when it happens:
2023-02-21T00:25:00Z
and2023-02-21T04:00:00Z
, the PVC of one of our ingesters in zoneA
went from ~3GB of usage to ~19GB:2023-02-21T23:50:00Z
and2023-02-22T02:50:00Z
, the PVC of one of our ingesters in zoneC
went from ~3GB of usage to ~18.4GB:
2023-02-12T20:20:00Z
and2023-02-13T15:28:00Z
, the PVC of one of our ingesters in zoneC
went from ~2GB of usage to ~18.9GB:We have observed this issue across different Tempo versions, namely
v1.5.0
andv2.0.0
(which we recently upgraded to).Initially, we figured that it was an issue with the flushing mechanism of the ingester, but preliminary log analysis has been unable to confirm this theory.
Using our first example from above (the ingester from zone
A
):Error logs (excluding
TRACE_TOO_LARGE
) from the period before and after the increase in space usage:As we can see, no
error
logs were emitted in the period leading up to the disk being full. The firsterror
log emitted is at2023-02-21T04:02:34
, which is when there was no space left:"error completing wal block with local backend: error creating block: write /var/tempo/wal/blocks/single-tenant/f291dded-9dea-4dfa-985b-5984c5421f49/data.parquet: no space left on device"
For the record, I am also attaching the graph of all logs (excluding
TRACE_TOO_LARGE
):This graph shows two spikes: one at ~
2023-02-21T04:00:00Z
, and a second one at ~2023-02-21T06:30:00Z
. The first spike here are theno space left on device
error messages; the second spike are logs from the manual intervention.We would be much appreciative of any help with this issue, as we are somewhat stuck at the moment, and are unsure of how to proceed.
Please let me know of any additional information that may be of use in debugging the issue. I will make sure it is swiftly provided.
As a final note: I have created this issue as a bug, but I appreciate the fact that the problem can just as well be a misconfiguration on our side as it can be a bug in Tempo itself. Our team is very happy to use Tempo in general, and this is a nut that we cannot seem to crack on our own.
Thanks.
The text was updated successfully, but these errors were encountered: