KeyError: 'kosmos-2'

#1
by yingss - opened

Thank you for the great work!

I am trying to run the example in README. However, I got KeyError: 'kosmos-2' after running model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").

The version of my transformers is 4.33.3. Is this an issue of the version? If so, which version should I install?

Also, could you provide an input example with interleaved text and multiple images? I am confused how to construct the input for an input sequence with text and multiple images.

Thank you for the help!

Hi, you have to use the latest dev version (from the main branch). There will be a release this week if you could wait.

Also, could you provide an input example with interleaved text and multiple images? I am confused how to construct the input for an input sequence with text and multiple images.

This is not shown explicitly in the paper IIRC. In the original Microsoft GitHub repository, there is some code regarding this, but not easy to run it to see what the format it has.
The current Kosmos2Processor is designed to handle a text or an image with a text, but not interleaved data.

I will try to see what I can provide regarding part in the next few days.

Microsoft org

The release will take place on Thursday @yingss

The release will take place on Thursday @yingss

Will this release support interleaved text and multiple images?

I am mostly interested in the capability of accepting interleaved text and multiple images showcased in kosmos-1. However, I could not find the checkpoint for kosmos-1 in the official repo or huggingface. I am assuming kosmos-2 will have similar capabilities in terms of handling interleaved text and multiple images?

I am assuming kosmos-2 will have similar capabilities in terms of handling interleaved text and multiple images?

Kosmos-2 indeed is also trained on the interleaved data, but the official demo never shows how this is used, as you can see

https://github.com/microsoft/unilm/blob/7ae2ee53bf7fff85e730c72083b7e999b0b9ba44/kosmos-2/demo/gradio_app.py#L100C8-L100C9

In Kosmos-1 paper, they mentioned

1.png

and

2.png

I can provide a helper method to deal with this case, but so far it won't be in the official release: I will post here in the next comment.

I am assuming kosmos-2 will have similar capabilities in terms of handling interleaved text and multiple images?

Kosmos-2 indeed is also trained on the interleaved data, but the official demo never shows how this is used, as you can see

https://github.com/microsoft/unilm/blob/7ae2ee53bf7fff85e730c72083b7e999b0b9ba44/kosmos-2/demo/gradio_app.py#L100C8-L100C9

In Kosmos-1 paper, they mentioned

1.png

and

2.png

I can provide a helper method to deal with this case, but so far it won't be in the official release: I will post here in the next comment.

Thank you so much!

Transformers 4.35.0 is released on pypi and works with kosmos-2.

I need to take a final look again, but the following should work
(remember that this implementation is based what I see in the paper instead of an original implementation!)

The helper function

from transformers import BatchFeature


def process_interleaved_example(processor, prompt, images, placeholder="<i>", num_image_tokens=64, add_special_tokens=True, add_eos_token=False, return_tensors=None):

    first_image_token_id = processor.tokenizer.unk_token_id + 1

    image_input_ids = [processor.tokenizer.convert_tokens_to_ids(processor.boi_token)] + list(range(first_image_token_id, num_image_tokens + first_image_token_id)) + [processor.tokenizer.convert_tokens_to_ids(processor.eoi_token)]
    image_attention_mask = [1] * len(image_input_ids)
    # `-2`: not including `boi` and `eoi`
    image_embeds_position_mask = [0] + [1] * (len(image_input_ids) - 2) + [0]

    import re
    components = re.split(rf"({placeholder})", prompt)

    outputs = {"input_ids": [], "attention_mask": [], "image_embeds_position_mask": []}
    for component in components:
        if component != "<i>":
            # add text tokens: no special tokens -> add them at the end
            encoded = processor(text=component, add_special_tokens=False)
            for key in ["input_ids", "attention_mask"]:
                outputs[key].extend(encoded[key])
            outputs["image_embeds_position_mask"].extend([0] * len(encoded["input_ids"]))
        else:
            # add tokens to indicate image placeholder
            outputs["input_ids"].extend(image_input_ids)
            outputs["attention_mask"].extend(image_attention_mask)
            outputs["image_embeds_position_mask"].extend(image_embeds_position_mask)

    if add_special_tokens:
        outputs["input_ids"] = [processor.tokenizer.bos_token_id] + outputs["input_ids"] + ([processor.tokenizer.eos_token_id] if add_eos_token else [])
        outputs["attention_mask"] = [1] + outputs["attention_mask"] + ([1] if add_eos_token else [])
        outputs["image_embeds_position_mask"] = [0] + outputs["image_embeds_position_mask"] + ([0] if add_eos_token  else [])

    outputs["pixel_values"] = processor.image_processor(images).pixel_values

    for k in ["input_ids", "attention_mask", "image_embeds_position_mask"]:
        outputs[k] = [outputs[k]]
    outputs = BatchFeature(data=outputs,tensor_type=return_tensors)

    return outputs

An example use it:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt")
print(inputs)

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
outputs = model(**inputs)
print(outputs[0].shape)

I need to take a final look again, but the following should work
(remember that this implementation is based what I see in the paper instead of an original implementation!)

The helper function

from transformers import BatchFeature


def process_interleaved_example(processor, prompt, images, placeholder="<i>", num_image_tokens=64, add_special_tokens=True, add_eos_token=False, return_tensors=None):

    first_image_token_id = processor.tokenizer.unk_token_id + 1

    image_input_ids = [processor.tokenizer.convert_tokens_to_ids(processor.boi_token)] + list(range(first_image_token_id, num_image_tokens + first_image_token_id)) + [processor.tokenizer.convert_tokens_to_ids(processor.eoi_token)]
    image_attention_mask = [1] * len(image_input_ids)
    # `-2`: not including `boi` and `eoi`
    image_embeds_position_mask = [0] + [1] * (len(image_input_ids) - 2) + [0]

    import re
    components = re.split(rf"({placeholder})", prompt)

    outputs = {"input_ids": [], "attention_mask": [], "image_embeds_position_mask": []}
    for component in components:
        if component != "<i>":
            # add text tokens: no special tokens -> add them at the end
            encoded = processor(text=component, add_special_tokens=False)
            for key in ["input_ids", "attention_mask"]:
                outputs[key].extend(encoded[key])
            outputs["image_embeds_position_mask"].extend([0] * len(encoded["input_ids"]))
        else:
            # add tokens to indicate image placeholder
            outputs["input_ids"].extend(image_input_ids)
            outputs["attention_mask"].extend(image_attention_mask)
            outputs["image_embeds_position_mask"].extend(image_embeds_position_mask)

    if add_special_tokens:
        outputs["input_ids"] = [processor.tokenizer.bos_token_id] + outputs["input_ids"] + ([processor.tokenizer.eos_token_id] if add_eos_token else [])
        outputs["attention_mask"] = [1] + outputs["attention_mask"] + ([1] if add_eos_token else [])
        outputs["image_embeds_position_mask"] = [0] + outputs["image_embeds_position_mask"] + ([0] if add_eos_token  else [])

    outputs["pixel_values"] = processor.image_processor(images).pixel_values

    for k in ["input_ids", "attention_mask", "image_embeds_position_mask"]:
        outputs[k] = [outputs[k]]
    outputs = BatchFeature(data=outputs,tensor_type=return_tensors)

    return outputs

An example use it:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt")
print(inputs)

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
outputs = model(**inputs)
print(outputs[0].shape)

Thank you so much!!

Could you please provide the implementation for decoding?

@zhaominxiao

The code example in the model card should work well

https://huggingface.co/microsoft/kosmos-2-patch14-224

but let me know if there is anything missing

@zhaominxiao

The code example in the model card should work well

https://huggingface.co/microsoft/kosmos-2-patch14-224

but let me know if there is anything missing

Thanks for your prompt response. Yes, the code can generate the output tensor. But when I tried to use the method in the README file to decode the output tensor, it returned the error saying "TypeError: PreTrainedTokenizerBase.decode() missing 1 required positional argument: 'token_ids'."

The decoding method I used is as follows.

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt").to('cuda')
print(inputs)

outputs = model(**inputs)
print(outputs[0].shape)

generated_text = processor.decode(**outputs, skip_special_tokens=True)[0]
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)
processed_text, _ = processor.post_process_generation(generated_text)

Hi, first : I am not sure if you really intend to use model(**inputs) instead of model.generate.

model(**inputs) gives outputs as something like dictionary, and processor.decode expect a list of token ids.

I would suggest you follow the code example to use model.generate and see how it use its outputs to decode.

If you intend to use model(**inputs), you will have to do extra work to make it work, for which I won't have the bandwidth to help.

@ydshieh
Thanks for your help. I am totally ok with using model.generate. I checked the code example in the model card. It is about how to use one image and one text sequence to do visual question answering (or image captioning). My use case needs the model to consume interleaved text-and-image sequences. For example, to do the few-shot learning, I need to provide multiple examples before asking the "real" question. In this case, I need to make sure the text and images are presented to KOSMOS in order. I went through your helper function implementation and my understanding is that with the dictionary returned by your helper function, I can use its input_ids, attention_mask, and image_embeds_position_mask as the value of the parameters in model.generate. But I am not sure how I can set the pixel_values and image_embeds. Should I set them as None?

@ydshieh
Oh, I think I got the answer. The pixel_values was assigned some values in the helper function. Regarding the image_embeds, I think I should leave it as None.

Thank you very much!

Yes, usually you don't need image_embeds. Passing pixel_values is the usual case.

@ydshieh As given above the function of process_interleaved_example, I used to process the images and the text inputs, then, I used model.generate to generate the ids which are then decoded and processed into the final output. I am getting the results but the results are not at all good. I have given a series of images which are from a video. The description of the images which I am getting is different. Please help me if I am applying the code logically wrong anywhere.

def run_example_kosmos(model, processor, image, prompt):
    inputs = process_interleaved_example(processor, prompt, images=image, add_eos_token=True, return_tensors="pt")
    generated_ids = model.generate(
      pixel_values=inputs["pixel_values"],
      input_ids=inputs["input_ids"],
      attention_mask=inputs["attention_mask"],
      image_embeds=None,
      image_embeds_position_mask=inputs["image_embeds_position_mask"],
      use_cache=True,
      max_new_tokens=300,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    _processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)
    processed_text, entities = processor.post_process_generation(generated_text)
    print(processed_text)
    return processed_text

I need to take a final look again, but the following should work
(remember that this implementation is based what I see in the paper instead of an original implementation!)

The helper function

from transformers import BatchFeature


def process_interleaved_example(processor, prompt, images, placeholder="<i>", num_image_tokens=64, add_special_tokens=True, add_eos_token=False, return_tensors=None):

    first_image_token_id = processor.tokenizer.unk_token_id + 1

    image_input_ids = [processor.tokenizer.convert_tokens_to_ids(processor.boi_token)] + list(range(first_image_token_id, num_image_tokens + first_image_token_id)) + [processor.tokenizer.convert_tokens_to_ids(processor.eoi_token)]
    image_attention_mask = [1] * len(image_input_ids)
    # `-2`: not including `boi` and `eoi`
    image_embeds_position_mask = [0] + [1] * (len(image_input_ids) - 2) + [0]

    import re
    components = re.split(rf"({placeholder})", prompt)

    outputs = {"input_ids": [], "attention_mask": [], "image_embeds_position_mask": []}
    for component in components:
        if component != "<i>":
            # add text tokens: no special tokens -> add them at the end
            encoded = processor(text=component, add_special_tokens=False)
            for key in ["input_ids", "attention_mask"]:
                outputs[key].extend(encoded[key])
            outputs["image_embeds_position_mask"].extend([0] * len(encoded["input_ids"]))
        else:
            # add tokens to indicate image placeholder
            outputs["input_ids"].extend(image_input_ids)
            outputs["attention_mask"].extend(image_attention_mask)
            outputs["image_embeds_position_mask"].extend(image_embeds_position_mask)

    if add_special_tokens:
        outputs["input_ids"] = [processor.tokenizer.bos_token_id] + outputs["input_ids"] + ([processor.tokenizer.eos_token_id] if add_eos_token else [])
        outputs["attention_mask"] = [1] + outputs["attention_mask"] + ([1] if add_eos_token else [])
        outputs["image_embeds_position_mask"] = [0] + outputs["image_embeds_position_mask"] + ([0] if add_eos_token  else [])

    outputs["pixel_values"] = processor.image_processor(images).pixel_values

    for k in ["input_ids", "attention_mask", "image_embeds_position_mask"]:
        outputs[k] = [outputs[k]]
    outputs = BatchFeature(data=outputs,tensor_type=return_tensors)

    return outputs

An example use it:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt")
print(inputs)

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
outputs = model(**inputs)
print(outputs[0].shape)

Hi @ydshieh , thank you for the wrapper function example for interleaved text and data. Currently I'm working to try few shot for Referring Expression Comprehension in Kosmos-2. I'm confused how to include bounding box inside the interleaved example?

The paper explains bounding box representation like the image below, however I confused about the actual implementation of it. (couldn't find it on https://github.com/microsoft/unilm/tree/master/kosmos-2).
image.png

Thank you

I need to take a final look again, but the following should work
(remember that this implementation is based what I see in the paper instead of an original implementation!)

The helper function

from transformers import BatchFeature


def process_interleaved_example(processor, prompt, images, placeholder="<i>", num_image_tokens=64, add_special_tokens=True, add_eos_token=False, return_tensors=None):

    first_image_token_id = processor.tokenizer.unk_token_id + 1

    image_input_ids = [processor.tokenizer.convert_tokens_to_ids(processor.boi_token)] + list(range(first_image_token_id, num_image_tokens + first_image_token_id)) + [processor.tokenizer.convert_tokens_to_ids(processor.eoi_token)]
    image_attention_mask = [1] * len(image_input_ids)
    # `-2`: not including `boi` and `eoi`
    image_embeds_position_mask = [0] + [1] * (len(image_input_ids) - 2) + [0]

    import re
    components = re.split(rf"({placeholder})", prompt)

    outputs = {"input_ids": [], "attention_mask": [], "image_embeds_position_mask": []}
    for component in components:
        if component != "<i>":
            # add text tokens: no special tokens -> add them at the end
            encoded = processor(text=component, add_special_tokens=False)
            for key in ["input_ids", "attention_mask"]:
                outputs[key].extend(encoded[key])
            outputs["image_embeds_position_mask"].extend([0] * len(encoded["input_ids"]))
        else:
            # add tokens to indicate image placeholder
            outputs["input_ids"].extend(image_input_ids)
            outputs["attention_mask"].extend(image_attention_mask)
            outputs["image_embeds_position_mask"].extend(image_embeds_position_mask)

    if add_special_tokens:
        outputs["input_ids"] = [processor.tokenizer.bos_token_id] + outputs["input_ids"] + ([processor.tokenizer.eos_token_id] if add_eos_token else [])
        outputs["attention_mask"] = [1] + outputs["attention_mask"] + ([1] if add_eos_token else [])
        outputs["image_embeds_position_mask"] = [0] + outputs["image_embeds_position_mask"] + ([0] if add_eos_token  else [])

    outputs["pixel_values"] = processor.image_processor(images).pixel_values

    for k in ["input_ids", "attention_mask", "image_embeds_position_mask"]:
        outputs[k] = [outputs[k]]
    outputs = BatchFeature(data=outputs,tensor_type=return_tensors)

    return outputs

An example use it:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt")
print(inputs)

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
outputs = model(**inputs)
print(outputs[0].shape)

################################################
################################################
Hi @ydshieh , thanks a lot for the example code!!!
I am actually trying to use interleaved text and multiple images dataset to finetune KOSMOS-2 (for example, the dataset format is consist of "image1+text1+image2+text2")
As your suggested code above, I am able to let KOSMOS-2 inference with interleaved text and multiple images input, But when I try to finetune KOSMOS-2 with dataset format as "image1+text1+image2+text2", I need to further form a batch, for example: batch_size= 8, I need to get a "image1+text1+image2+text2" * 8 data batch, and feed it to the model,
Then I got the "RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [8, 2, 3, 224, 224]" because I use 2 images in the model input
So if I want to form interleaved text and multiple images dataset (such as "image1+text1+image2+text2") to form a batch to finetune KOSMOS-2, what should be done?
Best!

Sign up or log in to comment