Demo example in the paper

#6
by willsky - opened

Great work!
May I ask how to get the results in Figure 4 and Figure 5 in the paper? I.e., retrieve the specific frames corresponding to the prompts.
Many thanks!

OpenGVLab org

To achieve a more detail video understanding than conversations, you need to load third-party modules from TPO, which can be referred to https://huggingface.co/OpenGVLab/VideoChat-TPO/tree/main

It seems the the third-party modules in TPO are cgdetr and sam2. How should I proceed after loading these two modules?

OpenGVLab org

After loading the corresponding task decoder, the model will identify whether the task decoder needs to be called and assist in giving the corresponding response.

Could you give an example code to do this? I just want to get the specific frame number or specific time corresponding to the prompts. For example, "In this video, in which frames does a man appear?" "In this video, from which second to which second does a man appear?" Currently, the demo cannot output the right frames/seconds.

OpenGVLab org

You can try this:

Based on the video content, Determine the start and end times of **various activity events** in the video, accompanied by descriptions.

I have tried this, but it cannot output the right time. For a video of 6 second, it outputs "25 to 30 seconds"

Sign up or log in to comment