#数据仓库#The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
翻译 - 在数据集中查找标签错误并使用嘈杂的标签进行学习。
#计算机科学#Refine high-quality datasets and visual AI models
翻译 - 用于构建高质量数据集和计算机视觉模型的开源工具
A Doctor for your data
#自然语言处理#The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
#计算机科学#Interactively explore unstructured datasets from your dataframe.
#计算机科学#A curated, but incomplete, list of data-centric AI resources.
#计算机科学#Automatically find issues in image datasets and practice data-centric computer vision.
#自然语言处理#Curated list of open source tooling for data-centric AI on unstructured data.
#计算机科学#Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽💻
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function" (NCFM, Rating: 555) in CVPR 2025.
#自然语言处理#[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
#大语言模型#Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
#自然语言处理#[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
#计算机科学#pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
#计算机科学#Introduction to Data-Centric AI, MIT IAP 2023 🤖
#计算机科学#OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
#自然语言处理#Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
#计算机科学#[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning