#大语言模型#The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, ...
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
#大语言模型#Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
#大语言模型#[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
#大语言模型#PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]
😎 curated list of awesome LMM hallucinations papers, methods & resources.
[ICLR 2025] What do we expect from LMMs as AIGI evaluators and how do they perform?
#大语言模型#MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
#大语言模型#🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
[COLING 2025] Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs
#大语言模型#🤖 Discord AI assistant with OpenAI, Gemini, Claude & DeepSeek integration, multilingual support, multimodal chat, image generation, web search, and deep thinking | 一个强大的 Discord AI 助手,整合多种顶级 AI 模型,支持...
LMM solved catastrophic forgetting, AAAI2025