GME Logo

GME: General Multimodal Embedding

gme-Qwen2-VL-7B

We are excited to present GME-Qwen2VL series of unified multimodal embedding models, which are based on the advanced Qwen2-VL multimodal large language models (MLLMs).

The GME models support three types of input: text, image, and image-text pair, all of which can produce universal vector representations and have powerful retrieval performance.

Key Enhancements of GME Models:

  • Unified Multimodal Representation: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation. This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
  • High Performance: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (UMRB) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (MTEB).
  • Dynamic Image Resolution: Benefiting from Qwen2-VL and our training data, GME models support dynamic resolution image input.
  • Strong Visual Retrieval Performance: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots. This capability is particularly beneficial for complex document understanding scenarios, such as multimodal retrieval-augmented generation (RAG) applications focused on academic papers.

Developed by: Tongyi Lab, Alibaba Group

Paper: GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Model List

Models Model Size Max Seq. Length Dimension MTEB-en MTEB-zh UMRB
gme-Qwen2-VL-2B 2.21B 32768 1536 65.27 68.41 64.45
gme-Qwen2-VL-7B 8.29B 32768 3584 67.48 71.36 67.44

Usage

Use with custom code

# You can find the script gme_inference.py in https://huggingface.co/Alibaba-NLP/gme-Qwen2VL-2B/blob/main/scripts/gme_inference.py
from gme_inference import GmeQwen2VL

model = GmeQwen2VL('Alibaba-NLP/gme-Qwen2-VL-7B-Instruct')

texts = [
    "What kind of car is this?",
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
    'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
    'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]

# Single-modal embedding
e_text = gme.get_text_embeddings(texts=texts)
e_image = gme.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## tensor([0.1702, 0.5278], dtype=torch.float16)

# How to set embedding instruction
e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
# If is_query=False, we always use the default instruction.
e_corpus = gme.get_image_embeddings(images=images, is_query=False)
print((e_query * e_corpus).sum(-1))
## tensor([0.2000, 0.5752], dtype=torch.float16)

# Fused-modal embedding
e_fused = gme.get_fused_embeddings(texts=texts, images=images)
print((e_fused[0] * e_fused[1]).sum())
## tensor(0.6826, dtype=torch.float16)

Evaluation

We validated the performance on our universal multimodal retrieval benchmark (UMRB) among others.

Single-modal Cross-modal Fused-modal Avg.
T→T (16) I→I (1) T→I (4) T→VD (10) I→T (4) T→IT (2) IT→T (5) IT→I (2) IT→IT (3) (47)
VISTA 0.2B 55.15 31.98 32.88 10.12 31.23 45.81 53.32 8.97 26.26 37.32
CLIP-SF 0.4B 39.75 31.42 59.05 24.09 62.95 66.41 53.32 34.9 55.65 43.66
One-Peace 4B 43.54 31.27 61.38 42.9 65.59 42.72 28.29 6.73 23.41 42.01
DSE 4.2B 48.94 27.92 40.75 78.21 52.54 49.62 35.44 8.36 40.18 50.04
E5-V 8.4B 52.41 27.36 46.56 41.22 47.95 54.13 32.9 23.17 7.23 42.52
GME-Qwen2-VL-2B 2.2B 55.93 29.86 57.36 87.84 61.93 76.47 64.58 37.02 66.47 64.45
GME-Qwen2-VL-7B 8.3B 58.19 31.89 61.35 89.92 65.83 80.94 66.18 42.56 73.62 67.44

The MTEB Leaderboard English tab shows the text embeddings performence of our model.

More detailed experimental results can be found in the paper.

Limitations

  • Single Image Input: In Qwen2-VL, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency. Due to the lack of relevant data, our models and evaluations retain one single image.
  • English-only Training: Our models are trained on english data only. Although the Qwen2-VL models are multilingual, the multilingual-multimodal embedding performance are not guaranteed.

We will extend to multi-image input, image-text interleaved data as well as multilingual data in the future version.

Redistribution and Use

We encourage and value diverse applications of GME models and continuous enhancements to the models themselves.

  • If you distribute or make GME models (or any derivative works) available, or if you create a product or service (including another AI model) that incorporates them, you must prominently display Built with GME on your website, user interface, blog post, About page, or product documentation.

  • If you utilize GME models or their outputs to develop, train, fine-tune, or improve an AI model that is distributed or made available, you must prefix the name of any such AI model with GME.

Cloud API Services

In addition to the open-source GME series models, GME series models are also available as commercial API services on Alibaba Cloud.

Note that the models behind the commercial APIs are not entirely identical to the open-source models.

Hiring

We have open positions for Research Interns and Full-Time Researchers to join our team at Tongyi Lab. We are seeking passionate individuals with expertise in representation learning, LLM-driven information retrieval, Retrieval-Augmented Generation (RAG), and agent-based systems. Our team is located in the vibrant cities of Beijing and Hangzhou, offering a collaborative and dynamic work environment where you can contribute to cutting-edge advancements in artificial intelligence and machine learning. If you are driven by curiosity and eager to make a meaningful impact through your work, we would love to hear from you. Please submit your resume along with a brief introduction to [email protected].

Citation

If you find our paper or models helpful, please consider cite:

@misc{zhang2024gme,
      title={GME: Improving Universal Multimodal Retrieval by Multimodal LLMs}, 
      author={Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min},
      year={2024},
      eprint={2412.16855},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={http://arxiv.org/abs/2412.16855}, 
}
Downloads last month
700
Safetensors
Model size
8.29B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Alibaba-NLP/gme-Qwen2-VL-7B-Instruct

Base model

Qwen/Qwen2-VL-7B
Finetuned
(123)
this model
Quantizations
1 model

Collection including Alibaba-NLP/gme-Qwen2-VL-7B-Instruct

Evaluation results