Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: grouping audio frames causes memory problems and is maybe slightly slower? #846

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

IvanovCosmin
Copy link

In our testing we found that this grouping allocates around 100 bytes per audio sample even for incredibly small videos.

For example an 8 second 1.36 MB videos allocates 18.9MB (I think it should be around 33, unsure why it is only halfl)

It doesn't require much imagination to compute what would happen on really big videos.
This will trigger memory limits pretty constantly on production deployents, and single handedly makes faster-whisper unsuitable for production systems.

Before my change:

   125    727.3 MiB      3.5 MiB           2       with av.open(input_file, metadata_errors="ignore") as container:
   126    708.4 MiB      0.0 MiB           1           frames = container.decode(audio=0)
   127    708.4 MiB      0.0 MiB           1           frames = _ignore_invalid_frames(frames)
   128    708.4 MiB      0.0 MiB           1           frames = _group_frames(frames, 500000)
   129    708.4 MiB      0.0 MiB           1           frames = _resample_frames(frames, resampler)
   130                                         
   131    727.3 MiB     18.9 MiB           3           for frame in frames:
   132    727.3 MiB      0.0 MiB           2               array = frame.to_ndarray()
   133    727.3 MiB      0.0 MiB           2               dtype = array.dtype
   134    727.3 MiB      0.0 MiB           2               raw_buffer.write(array)
   135                                         
   136                                             # It appears that some objects related to the resampler are not freed
   137                                             # unless the garbage collector is manually run.
   138    727.3 MiB      0.0 MiB           1       del resampler
   139    721.8 MiB     -5.5 MiB           1       gc.collect()

After my change:

   129    709.4 MiB      0.0 MiB           1           frames = _resample_frames(frames, resampler)
   130                                         
   131    710.6 MiB      0.8 MiB         182           for frame in frames:
   132    710.6 MiB      0.0 MiB         181               array = frame.to_ndarray()
   133    710.6 MiB      0.0 MiB         181               dtype = array.dtype
   134    710.6 MiB      0.4 MiB         181               raw_buffer.write(array)
   135                                         
   136                                             # It appears that some objects related to the resampler are not freed
   137                                             # unless the garbage collector is manually run.
   138    710.6 MiB      0.0 MiB           1       del resampler
   139    710.6 MiB      0.0 MiB           1       gc.collect()

The performance difference is this on a few hundred requests:

without group
max 0.940046314150095
average 0.7971834561880677
min 0.7308510262519121

with group
max 0.9293739395216107
average 0.8034254801739007
min 0.7338719926774502

I won't die on the performance hill.

@IvanovCosmin
Copy link
Author

I am unsure if the gc call is still necessary.
It seems to be related to the sampler, but also it seems to make no difference now.

I'll let you decide

@trungkienbkhn
Copy link
Collaborator

@IvanovCosmin , hello. I believe that memory will not increase over time in the _group_frames function because memory is released after calling fifo.read().

def _group_frames(frames, num_samples=None):
    fifo = av.audio.fifo.AudioFifo()

    for frame in frames:
        frame.pts = None  # Ignore timestamp check.
        fifo.write(frame)

        if num_samples is not None and fifo.samples >= num_samples:
            yield fifo.read()

    if fifo.samples > 0:
        yield fifo.read()

Additionally, I think that grouping frames into a consistent size (500,000) helps standardize the processing pipeline, making the next steps (resampling, ...) more stable and avoiding errors with long audio inputs

@IvanovCosmin
Copy link
Author

believe that memory will not increase over time in the _group_frames function because memory is released after calling fifo.read().

It does not increase over time, it just increases too much for too long.

I think that AV does something weird here. I follow your judgement, but this is not what happens in practice and it does not work well in production envs.

  1. Memory is not released when it should, we can see the increase persisting for longer than the scope of the decode function, when the generator should be released. It seems to be enventually released, but long after it is supposed to.
  2. It is unjustifiable to allocate more memory than the actual video consumes. My 8 second video is kilobytes, and it requires 18 MB to do just this step. For 3 minute videos we could see 60-80 MBs regularely allcoated. Let's now extrapolate to hour long videos.
  3. I did not find any issues with stability after removing that.

@trungkienbkhn
Copy link
Collaborator

trungkienbkhn commented May 23, 2024

From my test with device cuda and model large-v3, memory increase is listed below:

  • audio1.flac (00:11s, only sound): 10.6 MiB
  • audio2.mp4 (19:40s, includes both sound and image): 40.7 MiB
  • audio3.m4a (01:00:21s, only sound): 28.6 MiB

Log for audio3.m4a:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    20    252.9 MiB    252.9 MiB           1   @profile
    21                                         def decode_audio(
    22                                             input_file: Union[str, BinaryIO],
    23                                             sampling_rate: int = 16000,
    24                                             split_stereo: bool = False,
    25                                         ):
    26                                             """Decodes the audio.
    27                                         
    28                                             Args:
    29                                               input_file: Path to the input file or a file-like object.
    30                                               sampling_rate: Resample the audio to this sample rate.
    31                                               split_stereo: Return separate left and right channels.
    32                                         
    33                                             Returns:
    34                                               A float32 Numpy array.
    35                                         
    36                                               If `split_stereo` is enabled, the function returns a 2-tuple with the
    37                                               separated left and right channels.
    38                                             """
    39    252.9 MiB      0.0 MiB           2       resampler = av.audio.resampler.AudioResampler(
    40    252.9 MiB      0.0 MiB           1           format="s16",
    41    252.9 MiB      0.0 MiB           1           layout="mono" if not split_stereo else "stereo",
    42    252.9 MiB      0.0 MiB           1           rate=sampling_rate,
    43                                             )
    44                                         
    45    252.9 MiB      0.0 MiB           1       raw_buffer = io.BytesIO()
    46    252.9 MiB      0.0 MiB           1       dtype = None
    47                                         
    48    256.0 MiB      3.1 MiB           1       with av.open(input_file, mode="r", metadata_errors="ignore") as container:
    49    256.0 MiB      0.0 MiB           1           frames = container.decode(audio=0)
    50    256.0 MiB      0.0 MiB           1           frames = _ignore_invalid_frames(frames)
    51    256.0 MiB      0.0 MiB           1           frames = _group_frames(frames, 500000)
    52    256.0 MiB      0.0 MiB           1           frames = _resample_frames(frames, resampler)
    53                                         
    54    410.0 MiB     28.6 MiB         321           for frame in frames:
    55    410.0 MiB      0.0 MiB         320               array = frame.to_ndarray()
    56    410.0 MiB      0.0 MiB         320               dtype = array.dtype
    57    410.0 MiB    125.3 MiB         320               raw_buffer.write(array)
    58                                         
    59                                             # It appears that some objects related to the resampler are not freed
    60                                             # unless the garbage collector is manually run.
    61    410.0 MiB      0.0 MiB           1       del resampler
    62    369.3 MiB    -40.7 MiB           1       gc.collect()
    63                                         
    64    369.3 MiB      0.0 MiB           1       audio = np.frombuffer(raw_buffer.getbuffer(), dtype=dtype)
    65                                         
    66                                             # Convert s16 back to f32.
    67    590.2 MiB    220.9 MiB           1       audio = audio.astype(np.float32) / 32768.0
    68                                         
    69    590.2 MiB      0.0 MiB           1       if split_stereo:
    70                                                 left_channel = audio[0::2]
    71                                                 right_channel = audio[1::2]
    72                                                 return left_channel, right_channel
    73                                         
    74    590.2 MiB      0.0 MiB           1       return audio


Processing audio with duration 01:00:21.013
Detected language 'fr' with probability 1.00
Transcribe time:  11.415625095367432
Processing segment at 00:00.000
[0.00s -> 21.52s]  — Bonjour, M. le Président. — Bonjour.
[21.96s -> 26.72s]  — Merci d'avoir accepté notre invitation ce matin sur RMC et BFM TV.
Processing segment at 00:26.720
[26.72s -> 31.92s]  C'est au fond le coup d'envoi des Jeux olympiques à presque 100 jours de la cérémonie d'ouverture.
[32.04s -> 35.40s]  Nous sommes ici au Grand Palais. C'est presque la réinauguration.
[35.52s -> 38.44s]  C'est plutôt une visite de chantier, pour être tout à fait honnête.
[38.52s -> 42.88s]  On va évidemment parler Jeux olympiques. Rien que les JO, mais toutes les JO.
[42.98s -> 45.14s]  On est bien d'accord, M. le Président ? — Avec un grand bonheur.
[45.24s -> 49.68s]  — Ça, c'est le principe. On est d'accord. On va parler donc sport, enthousiasme, ambition française.
[50.22s -> 55.28s]  Mais on va aussi parler de ce qui inquiète les Français, les questions de sécurité, d'organisation, de coûts et de diplomatie.
[55.28s -> 56.70s]  Plus que jamais, je vous présente.
...

It can be seen that the memory increase might depend on the audio quality (audio with both sound and image will have a higher memory increase), and the memory increase for long videos is negligible.

Besides, in your example, if don't use _group_frames, it will take 182 loops to process all the frames, whereas using that function would only take 3 loops.
I ran a speed benchmark:

  • with this func: Min execution time: 43.642s
  • without this func: Min execution time: 43.808s

It doesn't really seem faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants