llm-as-a-judge · GitHub Topics

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-tools prompt-engineering prompt-management llm-evaluation llm-framework rag-evaluation llm-observability llm-as-a-judge llm-monitoring llm-platform llm-playground llmops-platform

Python 2.44 k

3 天前

prometheus-eval / prometheus-eval

#大语言模型#Evaluate your LLM's response with Prometheus and GPT4 💯

evaluation 大语言模型 llmops Python vllm gpt4 llm-as-a-judge

Python 901

1 个月前

metauto-ai / agent-as-a-judge

🤠 Agent-as-a-Judge and DevAI dataset

llm-as-a-judge llms

Python 398

3 个月前

IAAR-Shanghai / xFinder

#大语言模型#[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

evaluation gpt 大语言模型 large-language-models Regular expression reliability benchmark dataset chatglm phi qwen llm-as-a-judge

Python 163

2 个月前

martin-wey / CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences

alignment code-generation dpo large-language-models llm-as-a-judge

Python 71

10 个月前

KID-22 / LLM-IR-Bias-Fairness-Survey

#大语言模型#This is the repo for the survey of Bias and Fairness in IR with LLMs.

bias fairness information-retrieval large-language-models recommender-systems ChatGPT 大语言模型 llm-as-a-judge

9 天前

MJ-Bench / MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

llm-as-a-judge

Jupyter Notebook 43

2 个月前

zhaochen0110 / Timo

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

llm-as-a-judge llms rlhf

Python 21

6 个月前

IAAR-Shanghai / xVerify

#大语言模型#xVerify: Efficient Answer Verifier for Large Language Model Evaluations

llm-as-a-judge benchmark evaluation Regular expression reliability ChatGPT 大语言模型 open-r1

Python 20

16 天前

minnesotanlp / cobbler

#自然语言处理#Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

bias evaluation 大语言模型自然语言处理 bias-detection llm-as-a-judge llm-evaluation llms

Jupyter Notebook 20

1 年前

PKU-ONELab / Themis

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluation llm-as-a-judge nlg

Python 20

2 个月前

OussamaSghaier / CuREV

Harnessing Large Language Models for Curated Code Reviews

代码审查 large-language-models llm-as-a-judge

Python 12

25 天前

root-signals / rs-python-sdk

#大语言模型#Root Signals Python SDK

evaluation 大语言模型 llm-as-a-judge observability evals

Python 11

4 天前

UMass-Meta-LLM-Eval / llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

large-language-models llm-as-a-judge 自然语言处理

Python 8

6 个月前

aws-samples / genai-system-evaluation

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

genai generative-ai information-retrieval llm-as-a-judge llm-evaluation

Jupyter Notebook 8

7 个月前

PKU-ONELab / LLM-evaluator-reliability

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

evaluation llm-as-a-judge nlg

Python 7

2 个月前

docling-project / docling-sdg

A set of tools to create synthetically-generated data from documents

人工智能 documents llm-as-a-judge question-answering sdg

Python 6

9 天前

HillPhelmuth / LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

llm-as-a-judge llm-evaluation semantickernel

C# 6

3 个月前

root-signals / root-signals-mcp

MCP for Root Signals Evaluation Platform

evals llm-as-a-judge mcp model-context-protocol

Python 4

2 天前

aws-samples / model-as-a-judge-eval

#大语言模型#Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

evaluation generative-ai 大语言模型 llm-as-a-judge

Jupyter Notebook 3

10 个月前