Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: milvus plugin #156

Merged
merged 8 commits into from
Nov 17, 2024
Merged

feat: milvus plugin #156

merged 8 commits into from
Nov 17, 2024

Conversation

zhanshuyou
Copy link
Contributor

@zhanshuyou zhanshuyou commented Nov 7, 2024

Before you submit a pull request, please make sure you have read and understood the contribution guidelines and the code of conduct.

This pull request is related to:

  • A bug
  • A new feature
  • Documentation
  • Other (please specify)

I have checked the following:

  • I have read and understood the contribution guidelines and the code of conduct;
  • I have added new tests (for bug fixes/features);
  • I have added/updated the documentation (for bug fixes / features).

Description:
The Milvus plugin provides indexer and retriever implementatons that use the Milvus vector database.

Related issues:
Please, link the related issues that this pull request will resolve here (if any).

Summary by CodeRabbit

  • New Features

    • Introduced a new Milvus plugin for indexing and retrieving data with the Milvus vector database.
    • Added functionalities for creating retrievers and indexers, along with collection management.
    • Provided examples and detailed instructions in the README for easy setup and usage.
  • Documentation

    • Updated README to include installation instructions, configuration details, and usage examples for the new plugin.
  • Tests

    • Added a suite of unit tests to validate the functionality of the Milvus plugin, including retriever and indexer operations.
  • Chores

    • Created necessary configuration files for TypeScript and build settings specific to the Milvus plugin.

Sorry, something went wrong.

Copy link
Contributor

coderabbitai bot commented Nov 7, 2024

Walkthrough

The changes introduce a new Milvus plugin within the Genkit framework, enabling indexing and retrieval of data using the Milvus vector database. Key components include a README.md detailing installation and configuration, a package.json defining dependencies and scripts, and an index.ts file implementing core functionalities such as retrievers and indexers. Additional files include a test suite for unit testing the plugin's features, TypeScript configuration, and build configuration files. Overall, the updates establish a structured integration of the Milvus database into the Genkit ecosystem.

Changes

File Change Summary
plugins/milvus/README.md Added documentation for the new Milvus plugin, including installation, configuration, and usage examples.
plugins/milvus/package.json Created package.json for the genkitx-milvus plugin, specifying metadata, dependencies, and scripts.
plugins/milvus/src/index.ts Introduced core functionality for the Milvus plugin, including schemas, retriever, indexer, and utility functions.
plugins/milvus/tests/index_test.ts Added unit tests for the Milvus plugin, covering retriever and indexer creation, collection management, and data operations.
plugins/milvus/tsconfig.json Created TypeScript configuration file extending the base config and including the src directory.
plugins/milvus/tsup.config.ts Introduced build configuration for the Milvus plugin, extending common settings and defining build options.

Poem

In the garden of code, a new plugin blooms,
With Milvus at heart, it dispels all the glooms.
Indexing and searching, oh what a delight,
Genkit and Milvus, a powerful sight!
So hop to the rhythm, let data take flight,
In the world of vectors, we dance through the night! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (5)
plugins/milvus/README.md (2)

1-10: Enhance introduction and installation documentation.

Consider adding:

  1. A brief explanation of what vector databases are and their use cases
  2. Required peer dependencies (if any)
  3. Supported Milvus versions

54-84: Enhance usage documentation with more examples.

Consider adding:

  1. TypeScript type information for parameters and return values
  2. Error handling examples
  3. Complete example with actual document structure
  4. Example of how to query with different similarity metrics
plugins/milvus/tests/index_test.ts (2)

31-34: Add documentation for client parameters.

The empty client parameters will cause most tests to be skipped. Consider adding:

  1. A comment explaining how to configure these parameters
  2. Instructions for setting up test credentials
  3. Environment variable support for CI/CD
 const clientParams = {
-  address: '',
-  token: '',
+  address: process.env.MILVUS_ADDRESS || '',
+  token: process.env.MILVUS_TOKEN || '',
 };
+
+// Note: Tests requiring a Milvus connection will be skipped if MILVUS_ADDRESS is not set
+// To run all tests locally:
+// 1. Start a Milvus instance
+// 2. Set MILVUS_ADDRESS (e.g., 'localhost:19530')
+// 3. Set MILVUS_TOKEN if authentication is enabled

105-107: Strengthen assertions for search results.

The current assertion only checks the error code and array length. Consider adding more comprehensive checks:

  1. Type validation for search results
  2. Verification of result structure
  3. Score threshold validation
-      assert(
-        result.status.error_code === 'Success' && result.results.length >= 0
-      );
+      assert(result.status.error_code === 'Success', 'Search should succeed');
+      assert(Array.isArray(result.results), 'Results should be an array');
+      assert(result.results.length > 0, 'Should return at least one result');
+      
+      // Verify result structure
+      const firstResult = result.results[0];
+      assert('score' in firstResult, 'Result should contain a score');
+      assert('id' in firstResult, 'Result should contain an id');
+      assert(firstResult.score >= 0 && firstResult.score <= 1, 'Score should be between 0 and 1');
plugins/milvus/src/index.ts (1)

297-305: Handle Sensitive Information in Error Logging

In the deleteMilvusCollection function, logging the error with console.error might expose sensitive information. Consider sanitizing the error message or logging less sensitive details to avoid potential information leakage.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 25df5a1 and de272c8.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (6)
  • plugins/milvus/README.md (1 hunks)
  • plugins/milvus/package.json (1 hunks)
  • plugins/milvus/src/index.ts (1 hunks)
  • plugins/milvus/tests/index_test.ts (1 hunks)
  • plugins/milvus/tsconfig.json (1 hunks)
  • plugins/milvus/tsup.config.ts (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • plugins/milvus/package.json
  • plugins/milvus/tsconfig.json
  • plugins/milvus/tsup.config.ts
🧰 Additional context used
🪛 LanguageTool
plugins/milvus/README.md

[uncategorized] ~35-~35: Loose punctuation mark.
Context: ...e three optional parameters: - dbName: Specified database - clientParams: I...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e: Specified database - clientParams`: If you're not running your Milvus serve...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~48-~48: Loose punctuation mark.
Context: ...oken: "", } ``` - embedderOptions: Use this parameter to pass options to t...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (4)
plugins/milvus/README.md (1)

1-84: 🛠️ Refactor suggestion

Add essential documentation sections.

Consider adding the following sections to make the documentation more comprehensive:

  1. Troubleshooting guide with common issues and solutions
  2. Migration guide for users coming from similar vector database plugins
  3. Performance considerations and best practices
  4. Security considerations when connecting to remote Milvus instances
🧰 Tools
🪛 LanguageTool

[uncategorized] ~35-~35: Loose punctuation mark.
Context: ...e three optional parameters: - dbName: Specified database - clientParams: I...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e: Specified database - clientParams`: If you're not running your Milvus serve...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~48-~48: Loose punctuation mark.
Context: ...oken: "", } ``` - embedderOptions: Use this parameter to pass options to t...

(UNLIKELY_OPENING_PUNCTUATION)

plugins/milvus/tests/index_test.ts (2)

1-28: LGTM! Proper license and imports.

The license header is complete, and the imports are appropriately structured using built-in Node.js test modules.


37-54: Verify vector dimensions match collection configuration.

The mock entries use 4-dimensional vectors, but there's no validation to ensure this matches the collection's configuration. Consider:

  1. Adding a constant for the vector dimension
  2. Validating mock data dimensions
+const VECTOR_DIMENSION = 4;
+
+// Validate vector dimensions
+const validateVectorDimension = (vector: number[]) => {
+  if (vector.length !== VECTOR_DIMENSION) {
+    throw new Error(`Vector dimension must be ${VECTOR_DIMENSION}`);
+  }
+};
+
 const mockEntries = [
   {
     vector: [0.1, 0.2, 0.3, 0.4],
     document: 'This is a test document',
     metadata: {
✅ Verification successful

Let me check the actual dimension configurations in these files to verify if there's consistency.


Let me check how this test file is using these mock entries and if there's any connection to the embedders.


Mock vector dimensions correctly match collection configuration

The mock entries with 4-dimensional vectors are correctly aligned with the collection configuration. The test file explicitly creates a Milvus collection with dimension: 4 and uses consistent 4-dimensional vectors in the mock data for testing insert and search operations. No changes are needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify vector dimension consistency across the codebase
rg -l "dimension.*4" --type ts

Length of output: 131


Script:

#!/bin/bash
# Check the context around dimension configurations
rg "dimension.*4" -B 2 -A 2 --type ts

Length of output: 1842


Script:

#!/bin/bash
# Get more context around the test file usage
cat plugins/milvus/tests/index_test.ts

Length of output: 3210

plugins/milvus/src/index.ts (1)

105-159: milvusRetriever Function Implementation Looks Good

The milvusRetriever function is well-implemented, correctly handles collection existence, and performs querying appropriately. Good job on integrating embedding and retrieval logic.

Comment on lines +56 to +118
describe('Test Genkit Milvus Plugin', () => {
it('should create a retriever', async () => {
const retriever = milvusRetrieverRef({
collectionName: COLLECTION_NAME,
});
assert(retriever.name === `milvus/${COLLECTION_NAME}`);
});

it('should create an indexer', async () => {
const indexer = milvusIndexerRef({
collectionName: COLLECTION_NAME,
});
assert(indexer.name === `milvus/${COLLECTION_NAME}`);
});

if (clientParams.address) {
it('should create a collection', async () => {
const isCollectionExists = await hasMilvusCollection({
collectionName: COLLECTION_NAME,
clientParams,
});
if (isCollectionExists) {
console.log('Collection already exists');
assert(true);
return;
}
const result = await createMilvusCollection({
collectionName: COLLECTION_NAME,
dimension: 4,
clientParams,
});
assert(result.error_code === 'Success');
});

it('should insert data', async () => {
const result = await insertMilvusData({
collectionName: COLLECTION_NAME,
entries: mockEntries,
clientParams,
});
assert(result.insert_cnt === `${mockEntries.length}`);
});

it('should search data', async () => {
const result = await searchMilvusData({
collectionName: COLLECTION_NAME,
query: mockEntries[0].vector,
clientParams,
});
assert(
result.status.error_code === 'Success' && result.results.length >= 0
);
});

it('should delete a collection', async () => {
const result = await deleteMilvusCollection({
collectionName: COLLECTION_NAME,
clientParams,
});
assert(result.error_code === 'Success');
});
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance test robustness and coverage.

The test suite could benefit from several improvements:

  1. Add cleanup in an afterAll hook to ensure collection deletion
  2. Add timeout handling for Milvus operations
  3. Include negative test cases
  4. Add more assertions for edge cases

Here's a suggested improvement:

 describe('Test Genkit Milvus Plugin', () => {
+  // Set timeout for long-running operations
+  jest.setTimeout(30000);
+
+  // Ensure cleanup after all tests
+  afterAll(async () => {
+    if (clientParams.address) {
+      try {
+        await deleteMilvusCollection({
+          collectionName: COLLECTION_NAME,
+          clientParams,
+        });
+      } catch (error) {
+        console.error('Cleanup failed:', error);
+      }
+    }
+  });
+
   // Existing tests...
+
+  // Add negative test cases
+  if (clientParams.address) {
+    it('should handle invalid vector dimensions', async () => {
+      const invalidEntry = {
+        vector: [0.1, 0.2], // Wrong dimension
+        document: 'Invalid vector',
+        metadata: { text: 'Invalid vector', pos: 1 },
+      };
+      
+      await assert.rejects(
+        async () => {
+          await insertMilvusData({
+            collectionName: COLLECTION_NAME,
+            entries: [invalidEntry],
+            clientParams,
+          });
+        },
+        /dimension/i
+      );
+    });
+  }
 });

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +333 to +322
address: configOrAddress,
token: '', // Default token is an empty string
username: '', // Default username is an empty string
password: '', // Default password is an empty string
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid Hardcoding Empty Credentials

The getDefaultConfig function sets default credentials (token, username, password) to empty strings. This may pose security risks if default credentials are inadvertently used in production. Recommend requiring explicit credentials or throwing an error when they're not provided.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (7)
plugins/milvus/README.md (3)

1-10: Add prerequisites and version compatibility information.

Consider enhancing the introduction section with:

  1. Required Milvus version compatibility
  2. Prerequisites (e.g., running Milvus instance, minimum Node.js version)
  3. Link to Milvus installation guide for new users

54-84: Enhance usage examples.

Consider adding:

  1. Error handling examples and common error scenarios
  2. Search parameters documentation (e.g., top_k, search radius)
  3. Batch processing examples for efficient indexing
  4. Example of how to delete or update indexed documents

1-84: Add security and performance considerations.

Please add sections covering:

  1. Security best practices:
    • Token management
    • Network security recommendations
    • Access control configuration
  2. Performance optimization:
    • Connection pooling
    • Batch size recommendations
    • Index optimization guidelines
🧰 Tools
🪛 LanguageTool

[uncategorized] ~35-~35: Loose punctuation mark.
Context: ...e three optional parameters: - dbName: Specified database - clientParams: I...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e: Specified database - clientParams`: If you're not running your Milvus serve...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~48-~48: Loose punctuation mark.
Context: ...oken: "", } ``` - embedderOptions: Use this parameter to pass options to t...

(UNLIKELY_OPENING_PUNCTUATION)

plugins/milvus/src/index.ts (4)

206-236: Add retry mechanism and timeout configuration.

Collection management operations could benefit from:

  1. Retry mechanism for transient failures
  2. Configurable timeout
  3. Error handling with specific error types
 export async function createMilvusCollection(params: {
   collectionName: string;
   dimension: number;
   dbName?: string;
   clientParams?: MilvusClientParams;
+  timeout?: number;
+  maxRetries?: number;
 }) {
   const { clientParams, collectionName, dimension, dbName } = params;
   const milvus = getMilvusInstance(clientParams);

-  return milvus.createCollection({
-    collection_name: collectionName,
-    db_name: dbName,
-    dimension,
-    auto_id: true,
-    enable_dynamic_field: true,
-  });
+  const retries = params.maxRetries ?? 3;
+  let lastError;
+
+  for (let i = 0; i < retries; i++) {
+    try {
+      return await milvus.createCollection({
+        collection_name: collectionName,
+        db_name: dbName,
+        dimension,
+        auto_id: true,
+        enable_dynamic_field: true,
+        timeout: params.timeout,
+      });
+    } catch (error) {
+      lastError = error;
+      if (i < retries - 1) {
+        await new Promise(resolve => setTimeout(resolve, Math.pow(2, i) * 1000));
+      }
+    }
+  }
+  throw new Error(`Failed to create collection after ${retries} attempts: ${lastError}`);

238-272: Add input validation and rate limiting.

Data operations could benefit from:

  1. Input validation for entries and query parameters
  2. Rate limiting mechanism
  3. Metrics collection for monitoring

Consider implementing a rate limiter utility:

class RateLimiter {
  private queue: Array<() => Promise<any>> = [];
  private processing = false;
  
  constructor(private rateLimit: number, private interval: number) {}
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          resolve(await fn());
        } catch (error) {
          reject(error);
        }
      });
      this.process();
    });
  }
  
  private async process() {
    if (this.processing) return;
    this.processing = true;
    
    while (this.queue.length > 0) {
      const batch = this.queue.splice(0, this.rateLimit);
      await Promise.all(batch.map(fn => fn()));
      if (this.queue.length > 0) {
        await new Promise(resolve => setTimeout(resolve, this.interval));
      }
    }
    
    this.processing = false;
  }
}

274-291: Enhance error handling and cleanup.

The collection deletion could be improved with:

  1. Specific error types
  2. Cleanup of related resources
  3. Logging of deletion attempts
 export async function deleteMilvusCollection(params: {
   collectionName: string;
   dbName?: string;
   clientParams?: MilvusClientParams;
 }) {
   const { clientParams, collectionName, dbName } = params;
   const milvus = getMilvusInstance(clientParams);

   try {
+    console.log(`Attempting to delete collection: ${collectionName}`);
+    
+    // Release collection resources first
+    await milvus.releaseCollection({
+      collection_name: collectionName,
+      db_name: dbName,
+    });
+
     return await milvus.dropCollection({
       collection_name: collectionName,
       db_name: dbName,
     });
   } catch (error) {
-    console.error('Failed to delete Milvus collection:', error);
+    if (error.message?.includes('collection not exists')) {
+      console.warn(`Collection ${collectionName} does not exist`);
+      return;
+    }
+    console.error(`Failed to delete collection ${collectionName}:`, error);
     throw error;
   }
 }

1-323: Consider implementing monitoring and observability.

To improve operational visibility, consider adding:

  1. Prometheus metrics for operation latencies and error rates
  2. OpenTelemetry tracing for debugging and performance analysis
  3. Health check endpoints for the Milvus connection
  4. Structured logging with correlation IDs
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between de272c8 and 80be313.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (6)
  • plugins/milvus/README.md (1 hunks)
  • plugins/milvus/package.json (1 hunks)
  • plugins/milvus/src/index.ts (1 hunks)
  • plugins/milvus/tests/index_test.ts (1 hunks)
  • plugins/milvus/tsconfig.json (1 hunks)
  • plugins/milvus/tsup.config.ts (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • plugins/milvus/package.json
🚧 Files skipped from review as they are similar to previous changes (3)
  • plugins/milvus/tests/index_test.ts
  • plugins/milvus/tsconfig.json
  • plugins/milvus/tsup.config.ts
🧰 Additional context used
🪛 LanguageTool
plugins/milvus/README.md

[uncategorized] ~35-~35: Loose punctuation mark.
Context: ...e three optional parameters: - dbName: Specified database - clientParams: I...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e: Specified database - clientParams`: If you're not running your Milvus serve...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~48-~48: Loose punctuation mark.
Context: ...oken: "", } ``` - embedderOptions: Use this parameter to pass options to t...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (3)
plugins/milvus/src/index.ts (3)

1-40: LGTM! Well-structured imports and schema definitions.

The imports, schema definitions, and interface declarations are well-organized and properly typed.


42-84: LGTM! Clean plugin registration and reference implementations.

The plugin registration and reference implementations are well-structured and type-safe.


293-307: LGTM! Collection existence check implementation.

The implementation for checking collection existence is clean and well-structured.

Comment on lines +22 to +49
{
collectionName: 'collection_01',
embedder: textEmbeddingGecko,
},
]),
],
// ...
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add technical configuration details.

The configuration section should include:

  1. Required collection schema (dimension size, field types, etc.)
  2. Collection creation behavior (auto-create vs. manual setup)
  3. Performance-related settings (index type, metric type)
  4. Example showing all possible configuration options

Comment on lines +145 to +204
/**
* Configures a Milvus vector store indexer.
*/
export function milvusIndexer<EmbedderCustomOptions extends z.ZodTypeAny>(
params: MilvusPluginOptions<EmbedderCustomOptions>
) {
const { collectionName, embedder, embedderOptions, dbName } = params;
const milvusConfig = params.clientParams ?? getDefaultConfig();

return defineIndexer(
{
name: `milvus/${collectionName}`,
configSchema: MilvusIndexerOptionsSchema.optional(),
},
async (docs) => {
const embeddings = await Promise.all(
docs.map((doc) =>
embed({
embedder,
content: doc,
options: embedderOptions,
})
)
);

const entries = embeddings.map((value, i) => {
const metadata: RowData = {
...docs[i].metadata,
};

return {
vector: value,
document: docs[i].text(),
metadata,
};
});

const isCollectionExists = await hasMilvusCollection({
collectionName,
dbName,
clientParams: milvusConfig,
});
if (!isCollectionExists) {
await createMilvusCollection({
dbName,
collectionName,
dimension: embeddings[0].length,
clientParams: milvusConfig,
});
}

await insertMilvusData({
collectionName: collectionName,
dbName,
entries,
clientParams: milvusConfig,
});
}
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add batch processing and error handling.

The parallel processing of documents could benefit from:

  1. Batch processing to prevent memory issues with large document sets
  2. Validation for empty embeddings array
  3. Error handling for failed embeddings
   async (docs) => {
+    if (!docs.length) {
+      return;
+    }
+
+    const BATCH_SIZE = 1000;
+    for (let i = 0; i < docs.length; i += BATCH_SIZE) {
+      const batch = docs.slice(i, i + BATCH_SIZE);
       const embeddings = await Promise.all(
-        docs.map((doc) =>
+        batch.map((doc) =>
           embed({
             embedder,
             content: doc,
             options: embedderOptions,
           })
         )
       );

+      if (!embeddings.length || !embeddings[0].length) {
+        throw new Error('Failed to generate embeddings');
+      }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +86 to +143
/**
* Configures a Milvus vector store retriever.
*/
export function milvusRetriever<EmbedderCustomOptions extends z.ZodTypeAny>(
params: MilvusPluginOptions<EmbedderCustomOptions>
) {
const { collectionName, embedder, embedderOptions, dbName } = params;
const milvusConfig = params.clientParams ?? getDefaultConfig();

return defineRetriever(
{
name: `milvus/${collectionName}`,
configSchema: MilvusRetrieverOptionsSchema,
},
async (content, options) => {
const queryEmbeddings = await embed({
embedder,
content,
options: embedderOptions,
});

const isCollectionExists = await hasMilvusCollection({
collectionName,
dbName,
clientParams: milvusConfig,
});
if (!isCollectionExists) {
await createMilvusCollection({
dbName,
collectionName,
dimension: queryEmbeddings.length,
clientParams: milvusConfig,
});
}

const response = await searchMilvusData({
collectionName,
dbName,
query: queryEmbeddings,
limit: options.limit,
filter: JSON.stringify(options.filter),
topk: options.k,
clientParams: milvusConfig,
});

return {
documents: response.results
.map((result) => result['$meta'])
.filter(
(m): m is { document: string; metadata: Record<string, any> } => !!m
)
.map((m) => {
return Document.fromText(m.document, m.metadata).toJSON();
}),
};
}
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and validation.

Consider adding:

  1. Error handling for embedding generation
  2. Validation for empty query embeddings
  3. Validation that queryEmbeddings.length > 0 before creating collection
 async (content, options) => {
+  if (!content) {
+    throw new Error('Content is required for embedding');
+  }
+
   const queryEmbeddings = await embed({
     embedder,
     content,
     options: embedderOptions,
   });
+
+  if (!queryEmbeddings || queryEmbeddings.length === 0) {
+    throw new Error('Failed to generate embeddings');
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Configures a Milvus vector store retriever.
*/
export function milvusRetriever<EmbedderCustomOptions extends z.ZodTypeAny>(
params: MilvusPluginOptions<EmbedderCustomOptions>
) {
const { collectionName, embedder, embedderOptions, dbName } = params;
const milvusConfig = params.clientParams ?? getDefaultConfig();
return defineRetriever(
{
name: `milvus/${collectionName}`,
configSchema: MilvusRetrieverOptionsSchema,
},
async (content, options) => {
const queryEmbeddings = await embed({
embedder,
content,
options: embedderOptions,
});
const isCollectionExists = await hasMilvusCollection({
collectionName,
dbName,
clientParams: milvusConfig,
});
if (!isCollectionExists) {
await createMilvusCollection({
dbName,
collectionName,
dimension: queryEmbeddings.length,
clientParams: milvusConfig,
});
}
const response = await searchMilvusData({
collectionName,
dbName,
query: queryEmbeddings,
limit: options.limit,
filter: JSON.stringify(options.filter),
topk: options.k,
clientParams: milvusConfig,
});
return {
documents: response.results
.map((result) => result['$meta'])
.filter(
(m): m is { document: string; metadata: Record<string, any> } => !!m
)
.map((m) => {
return Document.fromText(m.document, m.metadata).toJSON();
}),
};
}
);
}
/**
* Configures a Milvus vector store retriever.
*/
export function milvusRetriever<EmbedderCustomOptions extends z.ZodTypeAny>(
params: MilvusPluginOptions<EmbedderCustomOptions>
) {
const { collectionName, embedder, embedderOptions, dbName } = params;
const milvusConfig = params.clientParams ?? getDefaultConfig();
return defineRetriever(
{
name: `milvus/${collectionName}`,
configSchema: MilvusRetrieverOptionsSchema,
},
async (content, options) => {
if (!content) {
throw new Error('Content is required for embedding');
}
const queryEmbeddings = await embed({
embedder,
content,
options: embedderOptions,
});
if (!queryEmbeddings || queryEmbeddings.length === 0) {
throw new Error('Failed to generate embeddings');
}
const isCollectionExists = await hasMilvusCollection({
collectionName,
dbName,
clientParams: milvusConfig,
});
if (!isCollectionExists) {
await createMilvusCollection({
dbName,
collectionName,
dimension: queryEmbeddings.length,
clientParams: milvusConfig,
});
}
const response = await searchMilvusData({
collectionName,
dbName,
query: queryEmbeddings,
limit: options.limit,
filter: JSON.stringify(options.filter),
topk: options.k,
clientParams: milvusConfig,
});
return {
documents: response.results
.map((result) => result['$meta'])
.filter(
(m): m is { document: string; metadata: Record<string, any> } => !!m
)
.map((m) => {
return Document.fromText(m.document, m.metadata).toJSON();
}),
};
}
);
}

EPMatt
EPMatt previously approved these changes Nov 15, 2024
Copy link
Contributor

@EPMatt EPMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhanshuyou, thanks for your contribution!

Looks LGTM to me. I'll take care of the release.

@EPMatt
Copy link
Contributor

EPMatt commented Nov 15, 2024

@all-contributors add @zhanshuyou as contributor for code, docs, tests

Copy link
Contributor

@EPMatt

I've put up a pull request to add @zhanshuyou! 🎉

davidoort
davidoort previously approved these changes Nov 15, 2024
@EPMatt EPMatt force-pushed the feat/milvus-plugin branch from 3bf84b7 to a449367 Compare November 15, 2024 13:45
@EPMatt EPMatt requested a review from davidoort November 15, 2024 13:49
Copy link
Contributor

@davidoort davidoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for these contributions!!

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@EPMatt EPMatt enabled auto-merge (squash) November 17, 2024 22:19
@EPMatt EPMatt merged commit 8403d97 into TheFireCo:main Nov 17, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants