visheratin commited on
Commit
87d91da
·
verified ·
1 Parent(s): f01b1a7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ar
5
+ - kn
6
+ - ar
7
+ - ka
8
+ - af
9
+ - kk
10
+ - am
11
+ - km
12
+ - ar
13
+ - ky
14
+ - ar
15
+ - ko
16
+ - as
17
+ - lo
18
+ - az
19
+ - ml
20
+ - az
21
+ - mr
22
+ - be
23
+ - mk
24
+ - bn
25
+ - my
26
+ - bs
27
+ - nl
28
+ - bg
29
+ - ca
30
+ - 'no'
31
+ - cs
32
+ - ne
33
+ - ku
34
+ - pl
35
+ - cy
36
+ - pt
37
+ - da
38
+ - ro
39
+ - de
40
+ - ru
41
+ - el
42
+ - sa
43
+ - en
44
+ - si
45
+ - eo
46
+ - sk
47
+ - et
48
+ - sl
49
+ - eu
50
+ - sd
51
+ - fi
52
+ - so
53
+ - fr
54
+ - es
55
+ - gd
56
+ - sr
57
+ - ga
58
+ - su
59
+ - gl
60
+ - sv
61
+ - gu
62
+ - sw
63
+ - ha
64
+ - ta
65
+ - he
66
+ - te
67
+ - hi
68
+ - th
69
+ - hr
70
+ - tr
71
+ - hu
72
+ - ug
73
+ - hy
74
+ - uk
75
+ - id
76
+ - ur
77
+ - is
78
+ - vi
79
+ - it
80
+ - xh
81
+ - jv
82
+ - zh
83
+ - ja
84
+ pipeline_tag: zero-shot-image-classification
85
+ tags:
86
+ - siglip2
87
+ - clip
88
+ - mexma
89
+ model-index:
90
+ - name: mexma-siglip2
91
+ results:
92
+ - task:
93
+ type: zero-shot retrieval
94
+ dataset:
95
+ name: Crossmodal-3600
96
+ type: Crossmodal-3600
97
+ metrics:
98
+ - name: Image retrieval R@1
99
+ type: Image retrieval R@1
100
+ value: 62.54%
101
+ - name: Text retrieval R@1
102
+ type: Text retrieval R@1
103
+ value: 59.99%
104
+ ---
105
+
106
+ ## Model Summary
107
+
108
+ MEXMA-SigLIP2 is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the
109
+ [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch16-512/) model. This allows us to get a high-performance CLIP model for 80 languages.
110
+ MEXMA-SigLIP2 sets new state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset with 62.54% R@1 for image retrieval and
111
+ 59.99% R@1 for text retrieval.
112
+
113
+
114
+ ## How to use
115
+
116
+ ```
117
+ from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
118
+ from PIL import Image
119
+ import requests
120
+ import torch
121
+
122
+ model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
123
+ tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
124
+ processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")
125
+
126
+ img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
127
+ img = processor(images=img, return_tensors="pt")["pixel_values"]
128
+ img = img.to(torch.bfloat16).to("cuda")
129
+ with torch.inference_mode():
130
+ text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
131
+ image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
132
+ probs = image_logits.softmax(dim=-1)
133
+ print(probs)
134
+ ```
135
+
136
+ ## Acknowledgements
137
+
138
+ I thank [ML Collective](https://mlcollective.org/) for providing compute resources to train the model.