Skip to content

(EAI-506): Refactor full response generation as stateless function #489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 5, 2024

Conversation

mongodben
Copy link
Collaborator

@mongodben mongodben commented Aug 28, 2024

Jira: https://jira.mongodb.org/browse/EAI-506

Changes

  • Refactor the response generation as a stateless function so that we can more easily evaluate outputs w/o a server
  • Evaluate our whole RAG pipeline w/ braintrust autoevals
  • Refactor autoevals to use azure openai instead of plain ol openai
  • Upgrade typechat to "^0.1.0" b/c it was causing issues w/ braintrust CLI (see CLI Error: [ERROR] Could not resolve "readline/promises" braintrustdata/braintrust-sdk#356)
  • Add Braintrust instrumentation to our server implementation to trace results in the evals.
    • Note: We can also easily trace w the braintrust observability module if ever choose to use that.
  • Refactor server unit tests for simplicity

Notes

@mongodben mongodben marked this pull request as ready for review September 4, 2024 17:43
Comment on lines 116 to 121
const miscCases = getConversationsEvalCasesFromYaml(
fs.readFileSync(path.resolve(basePath, "conversations.yml"), "utf8")
);
const faqCases = getConversationsEvalCasesFromYaml(
fs.readFileSync(path.resolve(basePath, "faq_conversations.yml"), "utf8")
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you see us putting a set of evaluation cases in this file vs the processors/*.eval.ts approach?

I'm a bit confused on the name of the file too. Could we call this e.g. conversations.eval.ts or is there something particular about config.ts that we're evaluating here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you see us putting a set of evaluation cases in this file vs the processors/*.eval.ts approach?

i see this file as our "end-to-end" evaluation.

in contrast, the processors/*.eval.ts are for each AI component. i see these as more like "unit" evaluations.

I'm a bit confused on the name of the file too. Could we call this e.g. conversations.eval.ts or is there something particular about config.ts that we're evaluating here?

we can def change the name. i was also not loving the name. just put this as it is b/c it's largely evaluating the stuff in the config.ts file. i think conversations.eval.ts makes sense here.


more philosophically, i think there's a larger conversation that we ought to have about:

  1. how to programmatically run the evals
  2. how to structure evals for different features. for example for the OM/CM stuff your doing, i think we agree that we shoudl accompany that work w/ some evals. but where do we put them? in this conversations.evals.ts file/evaluation? create a separate file, etc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my $.02 is put them in the same yaml files and include tags for the relevant products, run the whole eval suite in the conversations.evals.ts file. see how we perform on the new eval cases and make sure there's no regression elsewhere.

Comment on lines 109 to 111
const ConversationContextRelevancy: ConversationEvalScorer = async (args) => {
return ContextRelevancy(getConversationRagasConfig(args, judgeModelConfig));
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM rate limits hit (HTTP status 429). the error is somewhere in the UI, but you have to look around/scroll a bit for it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR - do we have retries for this? maybe a good future work EAI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by not running tests in parallel we should be able to avoid this...fairly annoying for spikey workloads like this but ai magic aint infinite (yet)

Copy link
Collaborator

@nlarew nlarew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % merge conflict fix & some non-blocking discussion points, maybe we can talk about them in standup soon

mongodben and others added 3 commits September 5, 2024 15:41
Co-authored-by: Nick Larew <nick.larew@mongodb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants