MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Happy experimenting with MLLM and LLM models!
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Accepted by IJCAI-24 Survey Track
This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use".
Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train Dataset for table understanding and develop a generalist tabula...
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train...
#自然语言处理#Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context l...