(EAI-506): Refactor full response generation as stateless function #489

mongodben · 2024-08-28T14:30:49Z

Jira: https://jira.mongodb.org/browse/EAI-506

Changes

Refactor the response generation as a stateless function so that we can more easily evaluate outputs w/o a server
Evaluate our whole RAG pipeline w/ braintrust autoevals
Refactor autoevals to use azure openai instead of plain ol openai
Upgrade typechat to "^0.1.0" b/c it was causing issues w/ braintrust CLI (see CLI Error: [ERROR] Could not resolve "readline/promises" braintrustdata/braintrust-sdk#356)
Add Braintrust instrumentation to our server implementation to trace results in the evals.
- Note: We can also easily trace w the braintrust observability module if ever choose to use that.
Refactor server unit tests for simplicity

Notes

Example Braintrust run: https://www.braintrust.dev/app/mdb-test/p/mongodb-chatbot-conversations/experiments/mongodb-chatbot-latest

nlarew · 2024-09-04T23:38:54Z

packages/chatbot-server-mongodb-public/src/config.eval.ts

+    const miscCases = getConversationsEvalCasesFromYaml(
+      fs.readFileSync(path.resolve(basePath, "conversations.yml"), "utf8")
+    );
+    const faqCases = getConversationsEvalCasesFromYaml(
+      fs.readFileSync(path.resolve(basePath, "faq_conversations.yml"), "utf8")
+    );


When do you see us putting a set of evaluation cases in this file vs the processors/*.eval.ts approach?

I'm a bit confused on the name of the file too. Could we call this e.g. conversations.eval.ts or is there something particular about config.ts that we're evaluating here?

When do you see us putting a set of evaluation cases in this file vs the processors/*.eval.ts approach?

i see this file as our "end-to-end" evaluation.

in contrast, the processors/*.eval.ts are for each AI component. i see these as more like "unit" evaluations.

I'm a bit confused on the name of the file too. Could we call this e.g. conversations.eval.ts or is there something particular about config.ts that we're evaluating here?

we can def change the name. i was also not loving the name. just put this as it is b/c it's largely evaluating the stuff in the config.ts file. i think conversations.eval.ts makes sense here.

more philosophically, i think there's a larger conversation that we ought to have about:

how to programmatically run the evals

how to structure evals for different features. for example for the OM/CM stuff your doing, i think we agree that we shoudl accompany that work w/ some evals. but where do we put them? in this conversations.evals.ts file/evaluation? create a separate file, etc

my $.02 is put them in the same yaml files and include tags for the relevant products, run the whole eval suite in the conversations.evals.ts file. see how we perform on the new eval cases and make sure there's no regression elsewhere.

packages/chatbot-server-mongodb-public/src/config.eval.ts

nlarew · 2024-09-04T23:44:07Z

packages/chatbot-server-mongodb-public/src/config.eval.ts

+const ConversationContextRelevancy: ConversationEvalScorer = async (args) => {
+  return ContextRelevancy(getConversationRagasConfig(args, judgeModelConfig));
+};


I see some errors with this in the Braintrust UI. Any idea what's going on?

e.g. https://www.braintrust.dev/app/mdb-test/p/mongodb-chatbot-conversations/experiments/mongodb-chatbot-latest?r=3fe62b58-3ea1-437b-906c-41a59a7f0aba&s=9fb53e4e-05fd-4d09-99dd-6a978445a451

LLM rate limits hit (HTTP status 429). the error is somewhere in the UI, but you have to look around/scroll a bit for it

Not a blocker for this PR - do we have retries for this? maybe a good future work EAI?

by not running tests in parallel we should be able to avoid this...fairly annoying for spikey workloads like this but ai magic aint infinite (yet)

packages/chatbot-server-mongodb-public/src/test/evalHelpers.ts

packages/mongodb-chatbot-server/src/test/testConfig.ts

packages/mongodb-chatbot-server/src/routes/generateResponse.ts

nlarew

LGTM % merge conflict fix & some non-blocking discussion points, maybe we can talk about them in standup soon

Co-authored-by: Nick Larew <nick.larew@mongodb.com>

mongodben added 13 commits August 28, 2024 10:25

refactor full response generation

Loading
Loading status checks…

de5b6d1

Merge remote-tracking branch 'upstream/main' into EAI-506

767af7a

tests passing

Loading
Loading status checks…

33d3d46

Fix bug

Loading
Loading status checks…

cde8baa

Fix edgecase behavior

Loading
Loading status checks…

2da1e6e

config eval scaffold

fd10e0e

braintrust traced update

Loading
Loading status checks…

ae84fb5

update all evals to use azure openai

Loading
Loading status checks…

24a426e

fix retrieved context metric

Loading
Loading status checks…

1b95eb0

reduce concurrency

Loading
Loading status checks…

17e2ef4

update .env.exampe

Loading
Loading status checks…

81e60ee

fix broken tests

a6e6e4a

fix broken tests

Loading
Loading status checks…

6a888a8

mongodben marked this pull request as ready for review September 4, 2024 17:43

nlarew reviewed Sep 4, 2024

View reviewed changes

nlarew reviewed Sep 5, 2024

View reviewed changes

packages/mongodb-chatbot-server/src/routes/generateResponse.ts Outdated Show resolved Hide resolved

nlarew approved these changes Sep 5, 2024

View reviewed changes

mongodben and others added 3 commits September 5, 2024 15:41

Update packages/chatbot-server-mongodb-public/src/config.eval.ts

Loading
Loading status checks…

e39e650

Co-authored-by: Nick Larew <nick.larew@mongodb.com>

NL feedback

7644d5b

fix merge conflict

Loading
Loading status checks…

bfd4000

mongodben merged commit 4603628 into main Sep 5, 2024
1 check passed

mongodben deleted the EAI-506 branch September 5, 2024 19:55

mongodben mentioned this pull request Sep 6, 2024

Revert "(EAI-506): Refactor full response generation as stateless function" #497

Merged

mongodben restored the EAI-506 branch September 6, 2024 14:03

mongodben mentioned this pull request Sep 6, 2024

(EAI-506 redux): Refactor full response generation as a stateless function #498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(EAI-506): Refactor full response generation as stateless function #489

(EAI-506): Refactor full response generation as stateless function #489

mongodben commented Aug 28, 2024 •

edited

Loading

nlarew Sep 4, 2024

mongodben Sep 5, 2024

mongodben Sep 5, 2024

nlarew Sep 4, 2024

mongodben Sep 5, 2024

nlarew Sep 5, 2024

mongodben Sep 5, 2024

nlarew left a comment •

edited

Loading

(EAI-506): Refactor full response generation as stateless function #489

(EAI-506): Refactor full response generation as stateless function #489

Conversation

mongodben commented Aug 28, 2024 • edited Loading

Changes

Notes

nlarew Sep 4, 2024

Choose a reason for hiding this comment

mongodben Sep 5, 2024

Choose a reason for hiding this comment

mongodben Sep 5, 2024

Choose a reason for hiding this comment

nlarew Sep 4, 2024

Choose a reason for hiding this comment

mongodben Sep 5, 2024

Choose a reason for hiding this comment

nlarew Sep 5, 2024

Choose a reason for hiding this comment

mongodben Sep 5, 2024

Choose a reason for hiding this comment

nlarew left a comment • edited Loading

Choose a reason for hiding this comment

mongodben commented Aug 28, 2024 •

edited

Loading

nlarew left a comment •

edited

Loading