#Awesome#A curated list of awesome responsible machine learning resources.
#数据仓库#Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
#自然语言处理#Deliver safe & effective language models
#大语言模型#Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
#计算机科学#PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Sa...
#计算机科学#[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
RuLES: a benchmark for evaluating rule-following in language models
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
[AAAI 2025] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
#大语言模型#LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Code accompanying the paper Pretraining Language Models with Human Preferences
#自然语言处理#📚 A curated list of papers & technical articles on AI Quality & Safety
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
#大语言模型#Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
#自然语言处理#Attack to induce LLMs within hallucinations
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
#数据仓库#BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).