#数据仓库#The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
翻译 - 在数据集中查找标签错误并使用嘈杂的标签进行学习。
#计算机科学#Refine high-quality datasets and visual AI models
翻译 - 用于构建高质量数据集和计算机视觉模型的开源工具
A Doctor for your data
#计算机科学#fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data...
#计算机科学#Interactively explore unstructured datasets from your dataframe.
#计算机科学#A curated, but incomplete, list of data-centric AI resources.
#大语言模型#Scalable data pre processing and curation toolkit for LLMs
#自然语言处理#Curated list of open source tooling for data-centric AI on unstructured data.
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
#计算机科学#A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
Lesson guide and textbook for "Data as a Science" course.
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well a...
Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)
#计算机科学#🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
A web service for semi-automated conversion of raw imaging data to BIDS
#自然语言处理#Client interface to Cleanlab Studio and the Trustworthy Language Model
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
Curated list of known efforts in collecting and/or curating of chemical/materials data