MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos

Overview

Anomaly detection for public safety requires modeling fine-grained motion and contextual information across multiple time scales. We propose Multi-Timescale Feature Learning (MTFL), which leverages short-, medium-, and long-term temporal tubelets within a Video Swin Transformer to enhance spatio-temporal representations. MTFL achieves 89.78% AUC on UCF-Crime, outperforming existing methods, and demonstrates complementary performance with 95.32% AUC on ShanghaiTech and 84.57% AP on XD-Violence. In addition, we introduce the Video Anomaly Detection Dataset (VADD), an extended version of UCF-Crime containing 2,591 videos across 18 anomaly classes.

Workflow of Multi-Timescale Feature Learning (MTFL) model. The input video is segmented into 𝑇 snippets. The Multi- Timescale Feature Generator (MTFG) creates three sets of 𝑇 features of 𝐷 dimensions, 𝐅L, 𝐅M, and 𝐅S, corresponding to features extracted within long, medium, and short temporal tubelets. Next, the Multi-Timescale Feature Fusion (MTFF) captures the correlations among three features and the dependencies among different video snippets to fuse the features into the output feature matrix 𝐗. The final anomaly scores of 𝑇 snippets are obtained after a classifier. A loss function involving feature magnitude loss and classification loss is used for training the MTFF and the classifier.

Citation

@article{mtfl2024,
  title   = {MTFL: Weakly Supervised Anomaly Detection in Surveillance Videos},
  author  = {Zhang, Yiling and Akdag, Erkut and Bondarev, Egor and De With, Peter H. N.},
  journal = {arXiv preprint arXiv:2410.05900},
  year    = {2024}
}

Overview

Results

Citation