MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Happy experimenting with MLLM and LLM models!
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Accepted by IJCAI-24 Survey Track
This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use".
Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train Dataset for table understanding and develop a generalist tabula...
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train...
#自然语言处理#Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context l...