#大语言模型# Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command ...
#大语言模型# OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Arbitrary expression evaluation for golang
翻译 - golang的任意表达式求值
Python package for the evaluation of odometry and SLAM
翻译 - Python package for the evaluation of odometry and SLAM
#大语言模型# AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
#大语言模型# A unified evaluation framework for large language models
An open-source visual programming environment for battle-testing prompts to LLMs.
#计算机科学# 🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
#计算机科学# (IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
翻译 - 提交中的“ 3D多对象跟踪基准”的官方Python实现
#自然语言处理# An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
FuzzBench - Fuzzer benchmarking as a service.
翻译 - FuzzBench-Fuzzer基准测试即服务。
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".