Seems like the user prompt is ignored

#80
by jlmeunier - opened

Thanks for this contributed model HuggingFaceM4/idefics2-8b, actually I've trouble with it.

It seems to me that the user prompt is ignored and that the model always answers to a question like "Describe the image".

I ran into this problem while computing a VQAv2 metric.

To convince me, I ran the provided decoding example, where the users says "What’s the difference between these two images?".
Here is the script output:

User: What’s the difference between these two images?<image><image><end_of_utterance>
Assistant:
Generated text: User: What’s the difference between these two images? 
Assistant: A dog and a cat are sleeping on a couch.

I would indeed have expected instead something closer to the GT answer given in the next (training) example, so "The difference is that one image is about dogs and the other one about cats."

Any idea??

Thanks
JL

Actually, definitely a problem. Changing the prompt returns the same.

User: Is there a cow in the image? Yes or no?<image><image><end_of_utterance>
Assistant:
Generated text: User: Is there a cow in the image? Yes or no? 
Assistant: The dog and cat are sleeping on the couch.

In case it helps, my settings are:

- `transformers` version: 4.47.0
- Platform: Linux-4.18.0-553.27.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.12.8
- Huggingface_hub version: 0.27.0
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: single process
- Using GPU in script?: 1 GPU
- GPU type: NVIDIA A100-SXM4-80GB

Can you try with transformers==4.40.0 to see if you have the same problem? It should work with this version

Thanks for your answer.
Actually, tranformers==4.40 does not change the output.

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
User: Is there a cow in the image? Yes or no?<image><image><end_of_utterance>
Assistant:
Generated text: User: Is there a cow in the image? Yes or no? 
Assistant: The dog and cat are sleeping on the couch.
/home/meunier/miniconda3/envs/fdm_Tr440/bin/python
transformers.__version__ 4.40.2

Do you have the full code you're using for the inference? Are you using exactly the same format as what's in the model card?

I think I'm using the example shown in the model card. I'm attaching my code for reference. Oups, not possible. I'm copying it below.

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        #{"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "text", "text": "Is there a cow in the image? Yes or no?"},
        {"type": "image"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)

# at inference time, one needs to pass `add_generation_prompt=True` in order to make sure the model completes the prompt
text = processor.apply_chat_template(messages, add_generation_prompt=True)
print(text)
# 'User: What’s the difference between these two images?<image><image><end_of_utterance>\nAssistant:'

inputs = processor(images=images, text=text, return_tensors="pt").to(device)

generated_text = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0]
print("Generated text:", generated_text)

Also, the settings I used:

- `transformers` version: 4.40.2
- Platform: Linux-4.18.0-553.22.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.12.0
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: CUDA 12.7   + A100-SXM4-80GB
- Using distributed or parallel set-up in script?: 1 GPU

Sign up or log in to comment