llm-evaluation · GitHub Topics

#大语言模型#🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics 大语言模型 llmops large-language-models openai 自托管 ycombinator 监控 observability Open Source langchain llama-index evaluation prompt-engineering prompt-management playground llm-evaluation llm-observability autogen

TypeScript 13.49 k

10 小时前

comet-ml / opik

#大语言模型#Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Open Source langchain openai playground prompt-engineering llama-index 大语言模型 llm-evaluation llm-observability llmops

Python 10.9 k

13 小时前

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 9.05 k

13 小时前

promptfoo / promptfoo

#大语言模型#Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command ...

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript 7.48 k

7 小时前

Arize-ai / phoenix

#数据仓库#AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval 数据集 agents 大语言模型 prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex smolagents

Jupyter Notebook 6.25 k

2 小时前

NVIDIA / garak

the LLM vulnerability scanner

人工智能 llm-evaluation llm-security security-scanners vulnerability-assessment

Python 4.71 k

5 天前

Giskard-AI / giskard

#大语言模型#🐢 Open-Source Evaluation & Testing for AI & LLM systems

mlops ml-validation ml-testing llmops responsible-ai fairness-ai llm-eval llm-evaluation rag-evaluation ai-security llm-security ai-red-team red-team-tools 大语言模型

Python 4.69 k

20 小时前

Helicone / helicone

#大语言模型#🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

large-language-models prompt-engineering agent-monitoring analytics evaluation gpt langchain llama-index 大语言模型 llm-cost llm-evaluation llm-observability llmops 监控 Open Source openai playground prompt-management ycombinator

TypeScript 4.09 k

1 天前

Marker-Inc-Korea / AutoRAG

#大语言模型#AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation 大语言模型 llm-evaluation llm-ops Open Source ops optimization pipeline Python qa rag rag-evaluation retrieval-augmented-generation

Python 4.08 k

5 天前

PacktPublishing / LLM-Engineers-Handbook

#大语言模型#The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

genai 大语言模型 llmops mlops rag Amazon Web Services fine-tuning-llm llm-evaluation ml-system-design

Python 3.64 k

4 个月前

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-tools prompt-engineering prompt-management llm-evaluation llm-framework rag-evaluation llm-observability llm-as-a-judge llm-monitoring llm-platform llm-playground llmops-platform

Python 2.92 k

2 天前

truera / trulens

#计算机科学#Evaluation and Tracking for LLM Experiments and AI Agents

机器学习 neural-networks explainable-ml llmops ai-monitoring ai-observability evals llm-evaluation 大语言模型 ai-agents llm-eval agentops

Python 2.61 k

1 天前

lmnr-ai / lmnr

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

aiops developer-tools observability agents 人工智能 Rust analytics llm-evaluation llm-observability 监控 Open Source 自托管 ai-observability llmops evals evaluation TypeScript ts

TypeScript 2.14 k

16 小时前

msoedov / agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

llm-security ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-framework

Python 1.52 k

4 天前

microsoft / prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandab...

generative-ai llm-evaluation 大语言模型 promptengineering

Python 950

1 个月前

cvs-health / uqlm

#大语言模型#UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

ai-safety hallucination 大语言模型 llm-evaluation uncertainty-estimation uncertainty-quantification

Python 772

5 天前

cyberark / FuzzyAI

#大语言模型#A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

jailbreak jailbreaking 大语言模型人工智能安全 Fuzzing/Fuzz testing llm-evaluation llm-security ai-red-team

Jupyter Notebook 630

10 天前

onejune2018 / Awesome-LLM-Eval

#自然语言处理#Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.