SigLip2 Does Not Reproduce Expected Results
I am using AutoModel
from transformers
to load SigLip2 and AutoProcessor
for both image and text processing:
self.model = AutoModel.from_pretrained("google/" + model_name[3:], device_map="auto")
self.image_processor = AutoProcessor.from_pretrained("google/" + model_name[3:])
self.text_processor = AutoTokenizer.from_pretrained("google/" + model_name[3:])
For text features, I process class names with the open_ai_imagenet templates and compute their embeddings:
texts = self.text_processor(text=texts, padding="max_length", max_length=64, return_tensors="pt").to(device)
class_embeddings = self.model.get_text_features(**texts)
class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
class_embedding = class_embeddings.mean(dim=0)
class_embedding /= class_embedding.norm()
For classification, I compute image embeddings and take the dot product with precomputed text features:
image_output = self.model.get_image_features(images)
image_output /= image_output.norm(dim=-1, keepdim=True)
logits = image_output @ self._text_features
This approach works correctly for SigLip 1, but does not reproduce expected results for SigLip 2. Any insights on differences in text processing or model behavior would be appreciated. The accuracy on IN-1K I get is 0.69738.
Hey! Did you come up with any solution?
Preprocess text to lowercase to reproduce the results.
I was reproducing results for SigLIP2 on imagenet 1k, you would need to lowercase the text and remove any punctuation to get reported scores.