evals · GitHub Topics

#大语言模型#The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

agents 人工智能 chatbots JavaScript 大语言模型 Next Node.js React TypeScript workflows evals mcp tts

TypeScript 11.93 k

6 小时前

Arize-ai / phoenix

#数据仓库#AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval 数据集 agents llms prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex

Jupyter Notebook 5.35 k

1 天前

AgentOps-AI / agentops

#大语言模型#Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent agentops 人工智能 evals evaluation-metrics 大语言模型 anthropic autogen cost-estimation crewai groq langchain mistral ollama openai agents-sdk openai-agents

Python 4.21 k

1 天前

Kiln-AI / Kiln

#计算机科学#The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

人工智能 chain-of-thought collaboration fine-tuning 机器学习 macOS ollama openai prompt prompt-engineering Python rlhf synthetic-data Windows evals evaluation

Python 3.36 k

1 天前

lmnr-ai / lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

aiops developer-tools observability agents 人工智能 rag Rust analytics llm-evaluation llm-observability 监控 Open Source 自托管 ai-observability llmops evals evaluation

TypeScript 1.83 k

16 小时前

superlinear-ai / raglite

#大语言模型#🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite

hybrid-search 大语言模型 Markdown pdf rag retrieval-augmented-generation SQLite vector-search pgvector PostgreSQL reranker reranking tsvector late-chunking late-interaction colbert evals query-adapter chainlit

Python 902

4 天前

mattpocock / evalite

Test your LLM-powered apps with TypeScript. No API key required.

人工智能 evals TypeScript

TypeScript 517

9 天前

keshik6 / HourVideo

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

gemini-pro gpt-4 multimodal-large-language-models navigation perception summarization reasoning evals

Jupyter Notebook 145

1 个月前

METR / vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

人工智能 evals

TypeScript 88

2 天前

AIAnytime / rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

eval evals rag

Python 34

8 个月前

dustalov / evalica

#大语言模型#Evalica, your favourite evaluation toolkit

evals evaluation Library 大语言模型 pyo3 Python Rust ranking rating 统计 leaderboard Hacktoberfest

Python 33

10 天前

flexpa / llm-fhir-eval

#大语言模型#Benchmarking Large Language Models for FHIR

evals fhir 大语言模型 llm-evaluation-framework

4 个月前

NirantK / rag-to-riches

evals rag search

Jupyter Notebook 22

3 天前

maragudk / gai

#大语言模型#Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

人工智能 Go 大语言模型 eval evals embeddings

Go 18

13 天前

The-Swarm-Corporation / StatisticalModelEvaluator

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

agents 人工智能 evals llms 机器学习 multiagent

Python 16

7 天前

google / curie

#大语言模型#Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

data evals 大语言模型 science

Jupyter Notebook 14

10 天前

root-signals / rs-python-sdk

#大语言模型#Root Signals Python SDK

evaluation 大语言模型 llm-as-a-judge observability evals

Python 11

4 天前

openlayer-ai / templates

Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

人工智能 evals Example

Python 8

1 个月前

root-signals / root-signals-mcp

MCP for Root Signals Evaluation Platform

evals llm-as-a-judge mcp model-context-protocol

Python 4

20 小时前

nstankov-bg / oaievals-collector

#大语言模型#The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI...

ChatGPT DevOps Docker Go openai evals

Go 3

1 年前