vision-and-language · GitHub Topics

aishwaryanr / awesome-generative-ai-guide

#面试#A one stop repository for generative AI research updates, interview resources, notebooks and much more!

Awesome Lists generative-ai 面试 large-language-models 大语言模型 notebook-jupyter vision-and-language

13.03 k

24 天前

salesforce / LAVIS

#计算机科学#LAVIS - A One-stop Library for Language-Vision Intelligence

深度学习 deep-learning-library image-captioning salesforce vision-and-language vision-framework vision-language-pretraining vision-language-transformer visual-question-anwsering multimodal-datasets multimodal-deep-learning

Jupyter Notebook 10.69 k

7 个月前

roboflow / maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision transformers vision-and-language vqa qwen2-vl

Python 2.58 k

8 天前

om-ai-lab / OmAgent

#大语言模型#Build multimodal language agents for fast prototype and production

large-language-models multimodal-agent vision-and-language agent workflow 聊天机器人 gpt4 大语言模型 multimodal rag vlm gpt gradio llama llava openai Python gemini

Python 2.52 k

3 个月前

salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method

vision-and-language representation-learning contrastive-learning

Python 1.67 k

3 年前

open-mmlab / Multimodal-GPT

Multimodal-GPT

flamingo gpt gpt-4 llama multimodal transformer vision-and-language

Python 1.5 k

2 年前

dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

vision-and-language

Python 1.47 k

1 年前

om-ai-lab / OmDet

Real-time and accurate open-vocabulary end-to-end object detection

object-detection vision-and-language zero-shot-object-detection 机器视觉 zero-shot coco real-time

Python 1.33 k

6 个月前

NVlabs / prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

image-captioning language-model multi-modal-learning multi-task-learning vision-language-model vision-and-language vqa

Python 1.31 k

1 年前

llm-jp / awesome-japanese-llm

#大语言模型#日本語LLMまとめ - Overview of Japanese LLMs

language-model language-models 大语言模型 large-language-models japanese japanese-language vision-and-language foundation-models multimodal vision-language vision-language-model generative-ai generative-model generative-models

TypeScript 1.19 k

7 天前

yuewang-cuhk / awesome-vision-language-pretraining-papers

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

vision-and-language pretraining multimodal-deep-learning bert

1.15 k

3 年前

rhymes-ai / Aria

Codebase for Aria - an Open Multimodal Native MoE

mixture-of-experts multimodal vision-and-language

Jupyter Notebook 1.06 k

5 个月前

microsoft / Oscar

Oscar and VinVL

vision-and-language pre-training image-captioning vqa oscar

Python 1.05 k

2 年前

OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

foundation-models multimodal representation-learning vision-language audio-language vision-and-language vision-transformer contrastive-loss

Python 1.04 k

9 个月前

YehLi / xmodaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense r...

image-captioning video-captioning vision-and-language pretraining cross-modal-retrieval visual-question-answering tden

Python 969

2 年前

mbzuai-oryx / groundingLMM

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

foundation-models lmm vision-and-language vision-language-model llm-agent

Python 889

23 天前