llm-evaluation-framework · GitHub Topics

#大语言模型#Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command ...

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript 6.17 k

7 小时前

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 5.93 k

8 小时前

msoedov / agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

llm-security ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-framework

Python 1.28 k

1 天前

JinjieNi / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference

Python 234

5 个月前

cvs-health / langfair

#大语言模型#LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

人工智能 bias bias-detection fairness fairness-ai fairness-ml fairness-testing large-language-models 大语言模型 responsible-ai Python ai-safety llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 197

1 个月前

parea-ai / parea-sdk-py

#大语言模型#Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-tools llmops llm-eval llm-evaluation-framework prompt-engineering generative-ai good-first-issue 监控

Python 76

2 个月前

Addepto / contextcheck

#大语言模型# MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

大语言模型 llm-evaluation rag Testing chatbot-framework Open Source ai-chat ai-testing-tool large-language-models 持续集成 llm-evaluation-framework

Python 64

4 个月前

zhuohaoyu / KIEval

#大语言模型#[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

explainable-ai 大语言模型 llm-evaluation llm-evaluation-framework llm-evaluation-metrics 机器学习

Python 36

9 个月前

multinear / multinear

#大语言模型#Develop reliable AI apps

evaluation 大语言模型 llms reliability llm-eval llm-evaluation llm-evaluation-framework

Svelte 36

3 天前

flexpa / llm-fhir-eval

#大语言模型#Benchmarking Large Language Models for FHIR

evals fhir 大语言模型 llm-evaluation-framework

4 个月前

aws-samples / fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework

Python 18

5 个月前

honeyhiveai / realign

Realign is a testing and simulation framework for AI applications.

人工智能 alignment evaluation llms prompt-engineering red-teaming Simulation llm-eval llm-evaluation llm-evaluation-framework llmops rag

Python 16

4 个月前

Networks-Learning / prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

llm-eval llm-evaluation llm-evaluation-framework ranking-algorithm

Jupyter Notebook 9

6 个月前

pyladiesams / eval-llm-based-apps-jan2025

#大语言模型#Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

大语言模型 llmops llms workshop llm-eval llm-evaluation-framework llm-evaluation-metrics llm-monitoring

Jupyter Notebook 7

3 个月前

parea-ai / parea-sdk-ts

#大语言模型#TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-evaluation-framework llm-tools llms llm-eval prompt-engineering

TypeScript 4

3 个月前

yukinagae / genkitx-promptfoo

#大语言模型#Community Plugin for Genkit to use Promptfoo

人工智能 evaluation evaluation-framework Firebase genkit 大语言模型 llm-eval llm-evaluation llm-evaluation-framework llmops 插件 prompt prompt-testing prompts Testing

TypeScript 3

3 个月前

stair-lab / melt

Multilingual Evaluation Toolkits

llm-evaluation-framework multilingual

Python 3

5 个月前

jaaack-wang / multi-problem-eval-llm

#大语言模型#Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

explainable-ai large-language-models 大语言模型 llm-eval llm-evaluation-framework llm-prompting

Jupyter Notebook 2

9 个月前

yuzu-ai / ShinRakuda

#大语言模型#Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across div...

大语言模型 llm-eval llm-evaluation llm-evaluation-framework japanese

Python 2

7 个月前

yukinagae / promptfoo-sample

#大语言模型#Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

evaluation evaluation-framework 大语言模型 llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing prompts Testing

7 个月前