Audio-based Action Recognition Models¶

ResNet for Audio¶

Audiovisual SlowFast Networks for Video Recognition

Abstract¶

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Au- dio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.

Results and Models¶

Kinetics-400¶

frame sampling strategy	n_fft	gpus	backbone	pretrain	top1 acc	top5 acc	testing protocol	FLOPs	params	config	ckpt	log
64x1x1	1024	8	Resnet18	None	13.7	27.3	1 clips	0.37G	11.4M	config	ckpt	log

Train¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train ResNet model on Kinetics-400 audio dataset in a deterministic option with periodic validation.

python tools/train.py configs/recognition_audio/resnet/tsn_r18_8xb320-64x1x1-100e_kinetics400-audio-feature.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test ResNet model on Kinetics-400 audio dataset and dump the result to a pkl file.

python tools/test.py configs/recognition_audio/resnet/tsn_r18_8xb320-64x1x1-100e_kinetics400-audio-feature.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@article{xiao2020audiovisual,
  title={Audiovisual SlowFast Networks for Video Recognition},
  author={Xiao, Fanyi and Lee, Yong Jae and Grauman, Kristen and Malik, Jitendra and Feichtenhofer, Christoph},
  journal={arXiv preprint arXiv:2001.08740},
  year={2020}
}