Getting Bounding Boxes for Vision

#29
by sujan2023 - opened

generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
output_scores=True,
return_dict_in_generate=True,
)

After generating the output, I tried to fetch the bounding boxes like this,
bounding_boxes = getattr(generate_output, "box_coordinates", None)

I am pretty sure, Ph4-multimodal-instruct doesn't provide bounding_box like Florence-2.
However it would be great, if Ph4-multimodal-instruct would have provided that information because it is doing the Optical character recognition.

Any idea how to get the bounding boxes from the model would be a great help in case of vision capability.
Or Am I missing something.

Any suggestion or idea will be highly appreciable.
Regard.

I have also attempted to get the attention patterns like this,
with torch.no_grad():
outputs = self.model(
**inputs,
output_attentions=True,
output_hidden_states=True
)
attention_patterns = outputs.attentions

But it seems that outputs doesn't have any attentions attribute.
Any suggestion will be highly appreciable
Thanks.

Sign up or log in to comment