Demo example in the paper
Great work!
May I ask how to get the results in Figure 4 and Figure 5 in the paper? I.e., retrieve the specific frames corresponding to the prompts.
Many thanks!
To achieve a more detail video understanding than conversations, you need to load third-party modules from TPO, which can be referred to https://huggingface.co/OpenGVLab/VideoChat-TPO/tree/main
It seems the the third-party modules in TPO are cgdetr and sam2. How should I proceed after loading these two modules?
After loading the corresponding task decoder, the model will identify whether the task decoder needs to be called and assist in giving the corresponding response.
Could you give an example code to do this? I just want to get the specific frame number or specific time corresponding to the prompts. For example, "In this video, in which frames does a man appear?" "In this video, from which second to which second does a man appear?" Currently, the demo cannot output the right frames/seconds.
You can try this:
Based on the video content, Determine the start and end times of **various activity events** in the video, accompanied by descriptions.
I have tried this, but it cannot output the right time. For a video of 6 second, it outputs "25 to 30 seconds"