add model card
Browse files
README.md
CHANGED
@@ -1,11 +1,175 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
base_model:
|
5 |
- HuggingFaceTB/SmolLM2-1.7B-Instruct
|
6 |
- google/siglip-so400m-patch14-384
|
7 |
---
|
|
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
# Model Card for Model ID
|
10 |
|
11 |
<!-- Provide a quick summary of what the model is/does. -->
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
license: apache-2.0
|
4 |
+
datasets:
|
5 |
+
- HuggingFaceM4/the_cauldron
|
6 |
+
- HuggingFaceM4/Docmatix
|
7 |
+
pipeline_tag: image-text-to-text
|
8 |
+
language:
|
9 |
+
- en
|
10 |
base_model:
|
11 |
- HuggingFaceTB/SmolLM2-1.7B-Instruct
|
12 |
- google/siglip-so400m-patch14-384
|
13 |
---
|
14 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM.png" width="800" height="auto" alt="Image description">
|
15 |
|
16 |
+
# SmolVLM
|
17 |
+
|
18 |
+
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs.
|
19 |
+
Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images,
|
20 |
+
or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications
|
21 |
+
while maintaining strong performance on multimodal tasks.
|
22 |
+
|
23 |
+
## Model Summary
|
24 |
+
|
25 |
+
- **Developed by:** Hugging Face 🤗
|
26 |
+
- **Model type:** Multi-modal model (image+text)
|
27 |
+
- **Language(s) (NLP):** English
|
28 |
+
- **License:** Apache 2.0
|
29 |
+
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
|
30 |
+
|
31 |
+
## Resources
|
32 |
+
|
33 |
+
- **Demo:** [SmolVLM Demo](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
|
34 |
+
- **Blog:** [Blog post](https://huggingface.co/blog/smolvlm)
|
35 |
+
|
36 |
+
## Uses
|
37 |
+
|
38 |
+
SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images.
|
39 |
+
Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on
|
40 |
+
visual content. The model does not support image generation.
|
41 |
+
|
42 |
+
To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
|
43 |
+
<!-- todo: add link to fine-tuning tutorial -->
|
44 |
+
|
45 |
+
### Technical Summary
|
46 |
+
|
47 |
+
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience.
|
48 |
+
It introduces several changes compared to previous Idefics models:
|
49 |
+
|
50 |
+
- **Image compression:** We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
|
51 |
+
- **Visual Token Encoding:** SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
|
52 |
+
|
53 |
+
More details about the training and architecture are available in our technical report.
|
54 |
+
|
55 |
+
|
56 |
+
### How to get started
|
57 |
+
|
58 |
+
You can use transformers to load, infer and fine-tune SmolVLM.
|
59 |
+
|
60 |
+
```python
|
61 |
+
import torch
|
62 |
+
from PIL import Image
|
63 |
+
from transformers import AutoProcessor, AutoModelForVision2Seq
|
64 |
+
from transformers.image_utils import load_image
|
65 |
+
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
66 |
+
# Load images
|
67 |
+
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
|
68 |
+
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
|
69 |
+
# Initialize processor and model
|
70 |
+
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Base")
|
71 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
72 |
+
"HuggingFaceTB/SmolVLM-Base",
|
73 |
+
torch_dtype=torch.bfloat16,
|
74 |
+
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
|
75 |
+
).to(DEVICE)
|
76 |
+
# Create input messages
|
77 |
+
messages = [
|
78 |
+
{
|
79 |
+
"role": "user",
|
80 |
+
"content": [
|
81 |
+
{"type": "image"},
|
82 |
+
{"type": "image"},
|
83 |
+
{"type": "text", "text": "Can you describe the two images?"}
|
84 |
+
]
|
85 |
+
},
|
86 |
+
]
|
87 |
+
# Prepare inputs
|
88 |
+
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
|
89 |
+
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
|
90 |
+
inputs = inputs.to(DEVICE)
|
91 |
+
# Generate outputs
|
92 |
+
generated_ids = model.generate(**inputs, max_new_tokens=500)
|
93 |
+
generated_texts = processor.batch_decode(
|
94 |
+
generated_ids,
|
95 |
+
skip_special_tokens=True,
|
96 |
+
)
|
97 |
+
print(generated_texts[0])
|
98 |
+
"""
|
99 |
+
User:<image>Can you describe the two images?
|
100 |
+
Assistant: I can describe the first one, but I can't describe the second one.
|
101 |
+
"""
|
102 |
+
```
|
103 |
+
|
104 |
+
|
105 |
+
### Model optimizations
|
106 |
+
|
107 |
+
**Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.
|
108 |
+
|
109 |
+
```python
|
110 |
+
from transformers import AutoModelForVision2Seq
|
111 |
+
import torch
|
112 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
113 |
+
"HuggingFaceTB/SmolVLM-Base",
|
114 |
+
torch_dtype=torch.bfloat16
|
115 |
+
).to("cuda")
|
116 |
+
```
|
117 |
+
|
118 |
+
You can also load SmolVLM with 4/8-bit quantization using bitsandbytes, torchao or Quanto. Refer to [this page](https://huggingface.co/docs/transformers/en/main_classes/quantization) for other options.
|
119 |
+
|
120 |
+
```python
|
121 |
+
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
|
122 |
+
import torch
|
123 |
+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
124 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
125 |
+
"HuggingFaceTB/SmolVLM-Base",
|
126 |
+
quantization_config=quantization_config,
|
127 |
+
)
|
128 |
+
```
|
129 |
+
|
130 |
+
**Vision Encoder Efficiency**: Adjust the image resolution by setting `size={"longest_edge": N*384}` when initializing the processor, where N is your desired value. The default `N=4` works well, which results in input images of
|
131 |
+
size 1536×1536. For documents, `N=5` might be beneficial. Decreasing N can save GPU memory and is appropriate for lower-resolution images. This is also useful if you want to fine-tune on videos.
|
132 |
+
|
133 |
+
|
134 |
+
## Misuse and Out-of-scope Use
|
135 |
+
|
136 |
+
SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
|
137 |
+
|
138 |
+
- Prohibited Uses:
|
139 |
+
- Evaluating or scoring individuals (e.g., in employment, education, credit)
|
140 |
+
- Critical automated decision-making
|
141 |
+
- Generating unreliable factual content
|
142 |
+
- Malicious Activities:
|
143 |
+
- Spam generation
|
144 |
+
- Disinformation campaigns
|
145 |
+
- Harassment or abuse
|
146 |
+
- Unauthorized surveillance
|
147 |
+
|
148 |
+
### License
|
149 |
+
|
150 |
+
SmolVLM is built upon [the shape-optimized SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) for text decoder part.
|
151 |
+
|
152 |
+
We release the SmolVLM checkpoints under the Apache 2.0 license.
|
153 |
+
|
154 |
+
## Training Details
|
155 |
+
|
156 |
+
### Training Data
|
157 |
+
|
158 |
+
The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.
|
159 |
+
<img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:90%;" />
|
160 |
+
|
161 |
+
|
162 |
+
## Evaluation
|
163 |
+
|
164 |
+
| Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
|
165 |
+
|-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
|
166 |
+
| SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
|
167 |
+
| Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
|
168 |
+
| InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
|
169 |
+
| PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
|
170 |
+
| moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
|
171 |
+
| MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
|
172 |
+
| MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
|
173 |
# Model Card for Model ID
|
174 |
|
175 |
<!-- Provide a quick summary of what the model is/does. -->
|