File size: 4,910 Bytes
ee124bb 065016b ee124bb 9806b4d 70023dc ffebb5b 9e6271f ffebb5b 70023dc ee124bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
pipeline_tag: video-text-to-text
model-index:
- name: VideoChat-Flash-Qwen2_5-2B_res448
results:
- task:
type: multimodal
dataset:
name: MLVU
type: mlvu
metrics:
- type: accuracy
value: 65.7
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: MVBench
type: mvbench
metrics:
- type: accuracy
value: 70.0
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: Perception Test
type: percepTest
metrics:
- type: accuracy
value: 70.5
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: LongVideoBench
type: longvideobench
metrics:
- type: accuracy
value: 58.3
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: VideoMME (wo sub)
type: videomme
metrics:
- type: accuracy
value: 57.0
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: LVBench
type: lvbench
metrics:
- type: accuracy
value: 42.9
name: accuracy
verified: true
---
# 🦜VideoChat-Flash-Qwen2_5-2B_res448⚡
[\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-Flash) [\[📜 Tech Report\]](https://www.arxiv.org/abs/2501.00574) [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash)
VideoChat-Flash-2B is constructed upon UMT-L (300M) and Qwen2_5-2B, employing only **16 tokens per frame**. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately **10,000 frames**.
> Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.
## 📈 Performance
| Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
| --- | --- | --- | --- |
|[VideoChat-Flash-Qwen2_5-2B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448)| 70.0 | 58.3 | 57.0|
|[VideoChat-Flash-Qwen2-7B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res224) | 73.2 | 64.2 | 64.0 |
|[VideoChat-Flash-Qwen2-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448)| 74.0| 64.7 | 65.3|
## 🚀 How to use the model
First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
```
pip install transformers==4.39.2
pip install timm
pip install av
pip install imageio
pip install decord
pip install opencv-python
pip install flash-attn --no-build-isolation
```
Then you could use our model:
```python
from transformers import AutoModel, AutoTokenizer
# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
image_processor = model.get_vision_tower().image_processor
mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
model.config.mm_llm_compress = True
model.config.llm_compress_type = "uniform0_attention"
model.config.llm_compress_layer_list = [4, 18]
model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
model.config.mm_llm_compress = True
# evaluation setting
max_num_frames = 512
generation_config = dict(
do_sample=False,
temperature=0.0,
max_new_tokens=1024,
top_p=0.1,
num_beams=1
)
video_path = "your_video.mp4"
# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output1)
# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output2)
```
## ✏️ Citation
```bibtex
@article{li2024videochatflash,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
``` |