evaluation

Star

Here are 1,116 public repositories matching this topic...

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated Jun 12, 2024
Python

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated Jun 12, 2024
Jupyter Notebook

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 12, 2024
TypeScript

langchain-ai / langsmith-docs

Star

Documentation for langsmith

testing documentation evaluation tracing langchain langsmith

Updated Jun 12, 2024
MDX

NTDLS / CMathParser

Star

A fairly robust mathematics parsing engine for C++ projects.

library parsing math evaluation mathematics showcase expression-parser

Updated Jun 11, 2024
C++

ncalc / ncalc

Star

Mathematical Expressions Evaluator for .NET

parser csharp math runtime async dotnet evaluation antlr antlr4 expressions ncalc

Updated Jun 11, 2024
C#

JieyuZ2 / TaskMeAnything

Star

A task generation and model evaluation system.

benchmark evaluation foundation-models

Updated Jun 11, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jun 11, 2024
Python

kolenaIO / kolena

Star

Python client for Kolena's machine learning testing platform

testing machine-learning evaluation evaluation-metrics evaluation-framework mlops evaluate-models llmops

Updated Jun 11, 2024
Python

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated Jun 11, 2024
Python

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 11, 2024
TypeScript

VectorInstitute / cyclops-workshop

Star

CyclOps for clinical ML evaluation & monitoring workshop

monitoring evaluation

Updated Jun 11, 2024
Jupyter Notebook

IAAR-Shanghai / NewsBench

Star

[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

benchmark framework evaluation dataset gpt4 large-language-models llm chatgpt ernie-bot gpt35turbo chatglm2-6b xverse internlm-20b baichaun2 aquila2 qwen-14b chatglm3-6b acl2024

Updated Jun 11, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 11, 2024
Python

Striveworks / valor

Star

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.

computer-vision evaluation classification object-detection image-segmentation evaluation-metrics model-evaluation mlops

Updated Jun 11, 2024
Python

time-series-machine-learning / tsml-eval

Star

Evaluation tools for time series machine learning algorithms.

python benchmarking data-science machine-learning time-series evaluation

Updated Jun 11, 2024
Jupyter Notebook

CAS-SIAT-XinHai / CPsyCoun

Star

[ACL 2024]CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling

nlp evaluation dataset dataset-generation mental-health llm

Updated Jun 11, 2024
Jupyter Notebook

gereon-t / trajectopy

Star

Trajectopy - Trajectory Evaluation in Python

benchmark metrics evaluation comparison alignment trajectory-analysis trajectory

Updated Jun 11, 2024
Python

langwatch / langwatch

Star

🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated Jun 11, 2024
TypeScript

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated Jun 11, 2024
Go

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,116 public repositories matching this topic...

langchain-ai / langsmith-sdk

tatsu-lab / alpaca_eval

langfuse / langfuse

langchain-ai / langsmith-docs

NTDLS / CMathParser

ncalc / ncalc

JieyuZ2 / TaskMeAnything

Psycoy / MixEval

kolenaIO / kolena

microsoft / rag-experiment-accelerator

promptfoo / promptfoo

VectorInstitute / cyclops-workshop

IAAR-Shanghai / NewsBench

athina-ai / athina-evals

Striveworks / valor

time-series-machine-learning / tsml-eval

CAS-SIAT-XinHai / CPsyCoun

gereon-t / trajectopy

langwatch / langwatch

symflower / eval-dev-quality

Improve this page

Add this topic to your repo