Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited GPU Utilization with NVIDIA RTX 4000 Ada Gen #844

Open
James-Shared-Studios opened this issue May 17, 2024 · 13 comments
Open

Limited GPU Utilization with NVIDIA RTX 4000 Ada Gen #844

James-Shared-Studios opened this issue May 17, 2024 · 13 comments

Comments

@James-Shared-Studios
Copy link

I am experiencing limited GPU utilization with the NVIDIA RTX 4000 Ada Gen card while running on Windows 10 1809
CPU: AMD EPYC 3251 8-Core Processor 2.5 GHz
RAM: 32GB
GPU: NVIDIA RTX 4000 Ada Gen 20 GB
CUDA Toolkit Version: 12.3
GPU Driver Version: 546.12

Python code:

   device = 'cuda'
   compute_type = 'int8_float16'
   model_size = 'medium.en'

   print(f"Loading model...")

   start_time = time.time()
   model = WhisperModel(model_size, device=device, 
                        compute_type=compute_type)
   end_time = time.time()
   execution_time = end_time - start_time
   print(f"Model loading time: {execution_time:.2f} seconds")
   folder_path = r"C:\Users\XYZ\Downloads\AI voice"
   max_new_tokens = 10
   beam_size = 10

   for filename in os.listdir(folder_path):
       if filename.endswith(".mp3") or filename.endswith(".m4a") or filename.endswith(".mp4") or filename.endswith(".wav"):
           file_path = os.path.join(folder_path, filename)
           print(f"Transcribing file: {file_path}")
           start_time = time.time()
           segments, _ = model.transcribe(file_path,
                                          beam_size=beam_size,
                                          max_new_tokens=max_new_tokens,
                                          word_timestamps = False,
                                          prepend_punctuations = "",
                                          append_punctuations = "",
                                          language="en", condition_on_previous_text=False)
           for segment in segments:
               print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
           end_time = time.time()
           execution_time = end_time - start_time
           print(f"Execution time: {execution_time:.2f} seconds")
           total_processing_time += execution_time

While running my code, I'm only observing around 10% GPU utilization.
image

However, the same code achieves 100% utilization on an NVIDIA GeForce RTX 4070.
image

@Napuh
Copy link
Contributor

Napuh commented May 17, 2024

Try to repeat the test but show the CUDA graph which shows CUDA utilization.

To do that, click here:
image
And select CUDA

@James-Shared-Studios
Copy link
Author

image image

for CUDA it's barely reached 70% utilization

@Napuh
Copy link
Contributor

Napuh commented May 20, 2024

How does it compare with a bigger model?

@phineas-pta
Copy link

u should compare speed

utilization matters less

@James-Shared-Studios
Copy link
Author

u should compare speed

utilization matters less

The average processing time with GeForce 4070 is 0.16 seconds, compared to 0.51 seconds with RTX 4000 Ada. I would expect faster performance from RTX 4000 Ada, that's why I was wondering if the RTX 4000 Ada has been limited in some way.

@James-Shared-Studios
Copy link
Author

James-Shared-Studios commented May 20, 2024

How does it compare with a bigger model?

the same results for large-v1, large-v2 and large-v3

image

@phineas-pta
Copy link

I would expect faster performance from RTX 4000 Ada

no u should expect the inverse: 4070 is faster

@James-Shared-Studios
Copy link
Author

I would expect faster performance from RTX 4000 Ada

no u should expect the inverse: 4070 is faster

why is that? could you provide more context please? Thank you.

@phineas-pta
Copy link

since the model can fit to gpu, vram is not a factor, it comes down to memory bandwidth (more impactful when cuda cores count isnt much different)

u can take a look at their theoretical fp32 & fp16 performance:

@James-Shared-Studios
Copy link
Author

4070 FP16 (half) 29.15 TFLOPS vs RTX 4000 Ada FP16 (half) 26.73 TFLOPS (1:1) so RTX 4000 Ada should not be three times slower than 4070, correct?

@phineas-pta
Copy link

the execution time is too short, there's additionally i/o overhead

for better benchmark, use longer audio/video to reduce overhead time part

@James-Shared-Studios
Copy link
Author

the execution time is too short, there's additionally i/o overhead

for better benchmark, use longer audio/video to reduce overhead time part

That makes sense. I will try a longer audio and see if it improves the results. Thank you so much for your help.

@Napuh
Copy link
Contributor

Napuh commented May 31, 2024

What's the conclusion? @James-Shared-Studios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants