Papers
arxiv:2402.08875

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Published on Feb 14, 2024
Authors:
,
,
,
,
,
,
,
,
,

Abstract

The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.08875 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.08875 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.