visheratin
/

mexma-siglip2

Zero-Shot Image Classification

Model card Files Files and versions Community

visheratin commited on 8 days ago

Commit

87d91da

·

verified ·

1 Parent(s): f01b1a7

Create README.md

Files changed (1) hide show

README.md +138 -0

README.md ADDED Viewed

	@@ -0,0 +1,138 @@

+---
+license: mit
+language:
+- ar
+- kn
+- ar
+- ka
+- af
+- kk
+- am
+- km
+- ar
+- ky
+- ar
+- ko
+- as
+- lo
+- az
+- ml
+- az
+- mr
+- be
+- mk
+- bn
+- my
+- bs
+- nl
+- bg
+- ca
+- 'no'
+- cs
+- ne
+- ku
+- pl
+- cy
+- pt
+- da
+- ro
+- de
+- ru
+- el
+- sa
+- en
+- si
+- eo
+- sk
+- et
+- sl
+- eu
+- sd
+- fi
+- so
+- fr
+- es
+- gd
+- sr
+- ga
+- su
+- gl
+- sv
+- gu
+- sw
+- ha
+- ta
+- he
+- te
+- hi
+- th
+- hr
+- tr
+- hu
+- ug
+- hy
+- uk
+- id
+- ur
+- is
+- vi
+- it
+- xh
+- jv
+- zh
+- ja
+pipeline_tag: zero-shot-image-classification
+tags:
+- siglip2
+- clip
+- mexma
+model-index:
+  - name: mexma-siglip2
+    results:
+      - task:
+          type: zero-shot retrieval
+        dataset:
+          name: Crossmodal-3600
+          type: Crossmodal-3600
+        metrics:
+          - name: Image retrieval R@1
+            type: Image retrieval R@1
+            value: 62.54%
+          - name: Text retrieval R@1
+            type: Text retrieval R@1
+            value: 59.99%
+---
+## Model Summary
+MEXMA-SigLIP2 is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the
+[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch16-512/) model. This allows us to get a high-performance CLIP model for 80 languages.
+MEXMA-SigLIP2 sets new state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset with 62.54% R@1 for image retrieval and
+59.99% R@1 for text retrieval.
+## How to use
+```
+from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
+from PIL import Image
+import requests
+import torch
+model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
+tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
+processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")
+img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
+img = processor(images=img, return_tensors="pt")["pixel_values"]
+img = img.to(torch.bfloat16).to("cuda")
+with torch.inference_mode():
+    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
+    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
+    probs = image_logits.softmax(dim=-1)
+    print(probs)
+```
+## Acknowledgements
+I thank [ML Collective](https://mlcollective.org/) for providing compute resources to train the model.