#大语言模型#LLaVA是一个具有 GPT-4V 级别功能的大语言和视觉模型助手
#大语言模型#[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Align Anything: Training All-modality Model with Feedback
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
#大语言模型#InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
#计算机科学#Collection of AWESOME vision-language models for vision tasks
#大语言模型#The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
#大语言模型#🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
#大语言模型#The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, ...
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
#大语言模型#MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
#大语言模型#日本語LLMまとめ - Overview of Japanese LLMs
#自然语言处理#This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.