Spatio Temporal Action Detection Models¶

ACRN¶

Abstract¶

Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Results and Models¶

AVA2.1¶

frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
8x8x1	8	SlowFast ResNet50	Kinetics-400	27.65	config	ckpt	log

AVA2.2¶

frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
8x8x1	8	SlowFast ResNet50	Kinetics-400	27.71	config	ckpt	log

The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.

For more details on data preparation, you can refer to AVA.

Train¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train ACRN with SlowFast backbone on AVA2.1 in a deterministic option with periodic validation.

python tools/train.py configs/detection/acrn/slowfast-acrn_kinetics400-pretrained-r50_8xb8-8x8x1-cosine-10e_ava21-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test ACRN with SlowFast backbone on AVA2.1 and dump the result to a pkl file.

python tools/test.py configs/detection/acrn/slowfast-acrn_kinetics400-pretrained-r50_8xb8-8x8x1-cosine-10e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@inproceedings{sun2018actor,
  title={Actor-centric relation network},
  author={Sun, Chen and Shrivastava, Abhinav and Vondrick, Carl and Murphy, Kevin and Sukthankar, Rahul and Schmid, Cordelia},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={318--334},
  year={2018}
}

LFB¶

Long-term feature banks for detailed video understanding

Abstract¶

To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

Results and Models¶

AVA2.1¶

frame sampling strategy	resolution	gpus	backbone	pretrain	mAP	gpu_mem(M)	config	ckpt	log
4x16x1	raw	8	SlowOnly ResNet50 (with Nonlocal LFB)	Kinetics-400	24.11	8620	config	ckpt	log
4x16x1	raw	8	SlowOnly ResNet50 (with Max LFB)	Kinetics-400	22.15	8425	config	ckpt	log

Note:

The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
We use slowonly_r50_4x16x1 instead of I3D-R50-NL in the original paper as the backbone of LFB, but we have achieved the similar improvement: (ours: 20.1 -> 24.05 vs. author: 22.1 -> 25.8).
Because the long-term features are randomly sampled in testing, the test accuracy may have some differences.
Before train or test lfb, you need to infer feature bank with the slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py. For more details on infer feature bank, you can refer to Train part.
The ROIHead now supports single-label classification (i.e. the network outputs at most one-label per actor). This can be done by (a) setting multilabel=False during training and the test_cfg.rcnn.action_thr for testing.

Train¶

a. Infer long-term feature bank for training¶

Before train or test lfb, you need to infer long-term feature bank first. You can also dowonload long-term feature bank from AVA_train_val_float32_lfb or AVA_train_val_float16_lfb, and then put them on lfb_prefix_path. In this case, you can skip this step.

Specifically, run the test on the training, validation, testing dataset with the config file slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py (The config file will only infer the feature bank of training dataset and you need set dataset_mode = 'val' to infer the feature bank of validation dataset in the config file.), and the shared head LFBInferHead will generate the feature bank.

A long-term feature bank file of AVA training and validation datasets with float32 precision occupies 3.3 GB. If store the features with float16 precision, the feature bank occupies 1.65 GB.

You can use the following command to infer feature bank of AVA training and validation dataset and the feature bank will be stored in lfb_prefix_path/lfb_train.pkl and lfb_prefix_path/lfb_val.pkl.

## set `dataset_mode = 'train'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/slowonly-lfb-infer_r50_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth

## set `dataset_mode = 'val'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/slowonly-lfb-infer_r50_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth

We use slowonly_r50_4x16x1 checkpoint from slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb to infer feature bank.

b. Train LFB¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train LFB model on AVA with half-precision long-term feature bank.

python tools/train.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
  --seed 0 --deterministic

For more details and optional arguments infos, you can refer to the Training part in the Training and Test Tutorial.

Test¶

a. Infer long-term feature bank for testing¶

Before train or test lfb, you also need to infer long-term feature bank first. If you have generated the feature bank file, you can skip it.

The step is the same with Infer long-term feature bank for training part in Train.

b. Test LFB¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test LFB model on AVA with half-precision long-term feature bank and dump the result to a pkl file.

python tools/test.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}

@inproceedings{wu2019long,
  title={Long-term feature banks for detailed video understanding},
  author={Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={284--293},
  year={2019}
}

SlowFast¶

Slowfast networks for video recognition

Abstract¶

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.

Results and Models¶

AVA2.1¶

frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
4x16x1	8	SlowFast ResNet50	Kinetics-400	24.32	config	ckpt	log
4x16x1	8	SlowFast ResNet50 (with context)	Kinetics-400	25.34	config	ckpt	log
8x8x1	8	SlowFast ResNet50	Kinetics-400	25.80	config	ckpt	log

AVA2.2¶

frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
8x8x1	8	SlowFast ResNet50	Kinetics-400	25.90	config	ckpt	log
8x8x1	8	SlowFast ResNet50 (temporal-max)	Kinetics-400	26.41	config	ckpt	log
8x8x1	8	SlowFast ResNet50 (temporal-max, focal loss)	Kinetics-400	26.65	config	ckpt	log

MultiSports¶

frame sampling strategy	gpus	backbone	pretrain	f-mAP	v-mAP@0.2	v-mAP@0.5	v-mAP@0.1:0.9	gpu_mem(M)	config	ckpt	log
4x16x1	8	SlowFast ResNet50	Kinetics-400	36.88	22.83	16.9	14.74	18618	config	ckpt	log

The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
with context indicates that using both RoI feature and global pooled feature for classification; temporal-max indicates that using max pooling in the temporal dimension for the feature.
MultiSports dataset utilizes frame-mAP(f-mAP) and video-mAP(v-map) to evaluate performance. Frame-mAP evaluates on detection results of each frame, and video-mAP uses 3D IoU to evaluate tube-level results under several thresholds. You could refer to the competition page for details.

For more details on data preparation, you can refer to

Train¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train the SlowFast model on AVA2.1 in a deterministic option with periodic validation.

python tools/train.py configs/detection/slowfast/slowfast_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test the SlowFast model on AVA2.1 and dump the result to a pkl file.

python tools/test.py configs/detection/slowfast/slowfast_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={ICCV},
  pages={6202--6211},
  year={2019}
}

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={CVPR},
  pages={6047--6056},
  year={2018}
}

SlowOnly¶

Slowfast networks for video recognition

Abstract¶

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.

Results and Models¶

AVA2.1¶

frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
4x16x1	8	SlowOnly ResNet50	Kinetics-400	20.72	config	ckpt	log
4x16x1	8	SlowOnly ResNet50	Kinetics-700	22.77	config	ckpt	log
4x16x1	8	SlowOnly ResNet50 (NonLocalEmbedGauss)	Kinetics-400	21.55	config	ckpt	log
8x8x1	8	SlowOnly ResNet50 (NonLocalEmbedGauss)	Kinetics-400	23.77	config	ckpt	log
8x8x1	8	SlowOnly ResNet101	Kinetics-400	24.83	config	ckpt	log

AVA2.2 (Trained on AVA-Kinetics)¶

Currently, we only use the training set of AVA-Kinetics and evaluate on the AVA2.2 validation dataset. The AVA-Kinetics validation dataset will be supported soon.

frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
4x16x1	8	SlowOnly ResNet50	Kinetics-400	24.53	config	ckpt	log
4x16x1	8	SlowOnly ResNet50	Kinetics-700	25.87	config	ckpt	log
8x8x1	8	SlowOnly ResNet50	Kinetics-400	26.10	config	ckpt	log
8x8x1	8	SlowOnly ResNet50	Kinetics-700	27.82	config	ckpt	log

AVA2.2 (Trained on AVA-Kinetics with tricks)¶

We conduct ablation studies to show the improvements of training tricks using SlowOnly8x8 pretrained on the Kinetics700 dataset. The baseline is the last row in AVA2.2 (Trained on AVA-Kinetics).

method	frame sampling strategy	gpus	backbone	pretrain	mAP	config	ckpt	log
baseline	8x8x1	8	SlowOnly ResNet50	Kinetics-700	27.82	config	ckpt	log
+ context	8x8x1	8	SlowOnly ResNet50	Kinetics-700	28.31	config	ckpt	log
+ temporal max pooling	8x8x1	8	SlowOnly ResNet50	Kinetics-700	28.48	config	ckpt	log
+ nonlinear head	8x8x1	8	SlowOnly ResNet50	Kinetics-700	29.83	config	ckpt	log
+ focal loss	8x8x1	8	SlowOnly ResNet50	Kinetics-700	30.33	config	ckpt	log
+ more frames	16x4x1	8	SlowOnly ResNet50	Kinetics-700	31.29	config	ckpt	log

MultiSports¶

frame sampling strategy	gpus	backbone	pretrain	f-mAP	v-mAP@0.2	v-mAP@0.5	v-mAP@0.1:0.9	gpu_mem(M)	config	ckpt	log
4x16x1	8	SlowOnly ResNet50	Kinetics-400	26.40	15.48	10.62	9.65	8509	config	ckpt	log

The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
+ context indicates that using both RoI feature and global pooled feature for classification; + temporal max pooling indicates that using max pooling in the temporal dimension for the feature; nonlinear head indicates that using a 2-layer mlp instead of a linear classifier.
MultiSports dataset utilizes frame-mAP(f-mAP) and video-mAP(v-map) to evaluate performance. Frame-mAP evaluates on detection results of each frame, and video-mAP uses 3D IoU to evaluate tube-level results under several thresholds. You could refer to the competition page for details.

For more details on data preparation, you can refer to

Train¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train the SlowOnly model on AVA2.1 in a deterministic option with periodic validation.

python tools/train.py configs/detection/slowonly/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test the SlowOnly model on AVA2.1 and dump the result to a pkl file.

python tools/test.py configs/detection/slowonly/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={ICCV},
  pages={6202--6211},
  year={2019}
}

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={CVPR},
  pages={6047--6056},
  year={2018}
}

@article{li2020ava,
  title={The ava-kinetics localized human actions video dataset},
  author={Li, Ang and Thotakuri, Meghana and Ross, David A and Carreira, Jo{\~a}o and Vostrikov, Alexander and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2005.00214},
  year={2020}
}

VideoMAE¶

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Abstract¶

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data.

Results and Models¶

AVA2.2¶

Currently, we use the training set of AVA-Kinetics and evaluate on the AVA2.2 validation dataset.

frame sampling strategy	resolution	gpus	backbone	pretrain	mAP	config	ckpt	log
16x4x1	raw	8	ViT Base	Kinetics-400	33.6	config	ckpt	log
16x4x1	raw	8	ViT Large	Kinetics-400	38.7	config	ckpt	log

Train¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train the ViT base model on AVA-Kinetics in a deterministic option.

python tools/train.py configs/detection/ava_kinetics/vit-base-p16_videomae-k400-pre_8xb8-16x4x1-20e-adamw_ava-kinetics-rgb.py \
    --cfg-options randomness.seed=0 randomness.deterministic=True

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test the ViT base model on AVA-Kinetics and dump the result to a pkl file.

python tools/test.py configs/detection/ava_kinetics/vit-base-p16_videomae-k400-pre_8xb8-16x4x1-20e-adamw_ava-kinetics-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}