a state-of-the-art-level open visual language model | 多模态预训练模型
GPT4V-level open-source multi-modal model based on Llama3-8B
Simple CogVLM client script
Famous Vision Language Models and Their Architectures
Using CogVLM and CogAgent for image captioning
CogVLM2 Autocaptioning Tools
streamline the fine-tuning process for multimodal models: PaliGemma, Florence-2, and Qwen2-VL