Feature: added multiprocessing for creating hf embedddings #15260

shail2512-lm10 · 2024-08-09T15:56:02Z

Description

It would be great if HuggingFaceEmbedding class had a multiprocessing feature for creating embeddings for vast amounts of text just like SentenceTransformers has. Hence I added a multiprocessing support for the same.

HuggingFaceEmbedding class takes additional two arguments: parallel_process and traget_devices. If parallel_process is True then the multiprocess_pool will start as per the methods available in SentenceTransformers and when that task is done the mutliprocess_pool will stop.

Reference: SentenceTransformer implementation

PS: This is my first PR in LLamaIndex :)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

…feat/multiprocessing_hfembedddings

nerdai

Thanks @shail2512-lm10 for the contribution. Left a minor comment in my review.

nerdai · 2024-08-10T12:44:44Z

...ons/embeddings/llama-index-embeddings-huggingface/llama_index/embeddings/huggingface/base.py

+    """
+    Args:
+        parallel_process (bool): Default to False. If True it will start a multi-process pool to process the encoding
+            with several independent processes.
+
+        target_devices (List[str], optional): It will only taken into account if `parallel_process` = `True`. PyTorch
+            target devices, e.g. ["cuda:0", "cuda:1", ...], ["npu:0", "npu:1", ...], or ["cpu", "cpu", "cpu", "cpu"].
+            If target_devices is None and CUDA/NPU is available, then all available CUDA/NPU devices will be used.
+            If target_devices is None and CUDA/NPU is not available, then 4 CPU devices will be used.


thanks for the class docstring. We should probably expand this to all fields and methods?

nerdai · 2024-08-10T12:50:58Z

@shail2512-lm10: We should also bump the version number in this package's pyproject.toml.

shail2512-lm10 · 2024-08-10T15:53:05Z

Thank you @nerdai. Sure, will do all the modifications!!

…feat/multiprocessing_hfembedddings

nerdai

thanks @shail2512-lm10!

shail2512-lm10 added 2 commits August 9, 2024 10:59

formatted

9c9c7be

docstring formatted

Loading
Loading status checks…

f95cedf

dosubot bot added the size:M label Aug 9, 2024

shail2512-lm10 added 2 commits August 9, 2024 14:23

fix attributes error

09e6904

Merge branch 'main' of https://github.com/run-llama/llama_index into …

Loading
Loading status checks…

b8f86db

…feat/multiprocessing_hfembedddings

nerdai approved these changes Aug 10, 2024

View reviewed changes

dosubot bot added the lgtm label Aug 10, 2024

nerdai self-assigned this Aug 10, 2024

shail2512-lm10 added 2 commits August 10, 2024 13:37

Merge branch 'main' of https://github.com/run-llama/llama_index into …

06158cb

…feat/multiprocessing_hfembedddings

added docstrings and bumped the pyproject.toml version

Loading
Loading status checks…

1600084

dosubot bot added size:L and removed size:M labels Aug 10, 2024

nerdai approved these changes Aug 10, 2024

View reviewed changes

nerdai enabled auto-merge (squash) August 10, 2024 18:11

nerdai merged commit d5680c6 into run-llama:main Aug 10, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: added multiprocessing for creating hf embedddings #15260

Feature: added multiprocessing for creating hf embedddings #15260

shail2512-lm10 commented Aug 9, 2024 •

edited

Loading

nerdai left a comment

nerdai Aug 10, 2024

nerdai commented Aug 10, 2024

shail2512-lm10 commented Aug 10, 2024

nerdai left a comment

Feature: added multiprocessing for creating hf embedddings #15260

Feature: added multiprocessing for creating hf embedddings #15260

Conversation

shail2512-lm10 commented Aug 9, 2024 • edited Loading

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

nerdai left a comment

Choose a reason for hiding this comment

nerdai Aug 10, 2024

Choose a reason for hiding this comment

nerdai commented Aug 10, 2024

shail2512-lm10 commented Aug 10, 2024

nerdai left a comment

Choose a reason for hiding this comment

shail2512-lm10 commented Aug 9, 2024 •

edited

Loading