google/siglip2-base-patch16-224 · SigLip2 Does Not Reproduce Expected Results

13 days ago

I am using AutoModel from transformers to load SigLip2 and AutoProcessor for both image and text processing:

self.model = AutoModel.from_pretrained("google/" + model_name[3:], device_map="auto")
self.image_processor = AutoProcessor.from_pretrained("google/" + model_name[3:])
self.text_processor = AutoTokenizer.from_pretrained("google/" + model_name[3:])

For text features, I process class names with the open_ai_imagenet templates and compute their embeddings:

texts = self.text_processor(text=texts, padding="max_length", max_length=64, return_tensors="pt").to(device)
class_embeddings = self.model.get_text_features(**texts)  
class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
class_embedding = class_embeddings.mean(dim=0)
class_embedding /= class_embedding.norm()

For classification, I compute image embeddings and take the dot product with precomputed text features:

image_output = self.model.get_image_features(images)
image_output /= image_output.norm(dim=-1, keepdim=True)
logits = image_output @ self._text_features

This approach works correctly for SigLip 1, but does not reproduce expected results for SigLip 2. Any insights on differences in text processing or model behavior would be appreciated. The accuracy on IN-1K I get is 0.69738.

dogukan-bg changed discussion status to closed 12 days ago

feasfeef

11 days ago

Hey! Did you come up with any solution?

feasfeef

10 days ago

Preprocess text to lowercase to reproduce the results.

qubvel-hf

9 days ago

I was reproducing results for SigLIP2 on imagenet 1k, you would need to lowercase the text and remove any punctuation to get reported scores.