Shortcuts

Spatio Temporal Action Detection Models

ACRN

Actor-centric relation network

Abstract

Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Results and Models

AVA2.1

frame sampling strategy gpus backbone pretrain mAP config ckpt log
8x8x1 8 SlowFast ResNet50 Kinetics-400 27.65 config ckpt log

AVA2.2

frame sampling strategy gpus backbone pretrain mAP config ckpt log
8x8x1 8 SlowFast ResNet50 Kinetics-400 27.71 config ckpt log
  1. The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.

For more details on data preparation, you can refer to AVA.

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train ACRN with SlowFast backbone on AVA2.1 in a deterministic option with periodic validation.

python tools/train.py configs/detection/acrn/slowfast-acrn_kinetics400-pretrained-r50_8xb8-8x8x1-cosine-10e_ava21-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test ACRN with SlowFast backbone on AVA2.1 and dump the result to a pkl file.

python tools/test.py configs/detection/acrn/slowfast-acrn_kinetics400-pretrained-r50_8xb8-8x8x1-cosine-10e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{sun2018actor,
  title={Actor-centric relation network},
  author={Sun, Chen and Shrivastava, Abhinav and Vondrick, Carl and Murphy, Kevin and Sukthankar, Rahul and Schmid, Cordelia},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={318--334},
  year={2018}
}

LFB

Long-term feature banks for detailed video understanding

Abstract

To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

Results and Models

AVA2.1

frame sampling strategy resolution gpus backbone pretrain mAP gpu_mem(M) config ckpt log
4x16x1 raw 8 SlowOnly ResNet50 (with Nonlocal LFB) Kinetics-400 24.11 8620 config ckpt log
4x16x1 raw 8 SlowOnly ResNet50 (with Max LFB) Kinetics-400 22.15 8425 config ckpt log

Note:

  1. The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.

  2. We use slowonly_r50_4x16x1 instead of I3D-R50-NL in the original paper as the backbone of LFB, but we have achieved the similar improvement: (ours: 20.1 -> 24.05 vs. author: 22.1 -> 25.8).

  3. Because the long-term features are randomly sampled in testing, the test accuracy may have some differences.

  4. Before train or test lfb, you need to infer feature bank with the slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py. For more details on infer feature bank, you can refer to Train part.

  5. The ROIHead now supports single-label classification (i.e. the network outputs at most one-label per actor). This can be done by (a) setting multilabel=False during training and the test_cfg.rcnn.action_thr for testing.

Train

a. Infer long-term feature bank for training

Before train or test lfb, you need to infer long-term feature bank first. You can also dowonload long-term feature bank from AVA_train_val_float32_lfb or AVA_train_val_float16_lfb, and then put them on lfb_prefix_path. In this case, you can skip this step.

Specifically, run the test on the training, validation, testing dataset with the config file slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py (The config file will only infer the feature bank of training dataset and you need set dataset_mode = 'val' to infer the feature bank of validation dataset in the config file.), and the shared head LFBInferHead will generate the feature bank.

A long-term feature bank file of AVA training and validation datasets with float32 precision occupies 3.3 GB. If store the features with float16 precision, the feature bank occupies 1.65 GB.

You can use the following command to infer feature bank of AVA training and validation dataset and the feature bank will be stored in lfb_prefix_path/lfb_train.pkl and lfb_prefix_path/lfb_val.pkl.

## set `dataset_mode = 'train'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/slowonly-lfb-infer_r50_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth

## set `dataset_mode = 'val'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/slowonly-lfb-infer_r50_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth

We use slowonly_r50_4x16x1 checkpoint from slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb to infer feature bank.

b. Train LFB

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train LFB model on AVA with half-precision long-term feature bank.

python tools/train.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
  --seed 0 --deterministic

For more details and optional arguments infos, you can refer to the Training part in the Training and Test Tutorial.

Test

a. Infer long-term feature bank for testing

Before train or test lfb, you also need to infer long-term feature bank first. If you have generated the feature bank file, you can skip it.

The step is the same with Infer long-term feature bank for training part in Train.

b. Test LFB

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test LFB model on AVA with half-precision long-term feature bank and dump the result to a pkl file.

python tools/test.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}
@inproceedings{wu2019long,
  title={Long-term feature banks for detailed video understanding},
  author={Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={284--293},
  year={2019}
}

SlowFast

Slowfast networks for video recognition

Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.

Results and Models

AVA2.1

frame sampling strategy gpus backbone pretrain mAP config ckpt log
4x16x1 8 SlowFast ResNet50 Kinetics-400 24.32 config ckpt log
4x16x1 8 SlowFast ResNet50 (with context) Kinetics-400 25.34 config ckpt log
8x8x1 8 SlowFast ResNet50 Kinetics-400 25.80 config ckpt log

AVA2.2

frame sampling strategy gpus backbone pretrain mAP config ckpt log
8x8x1 8 SlowFast ResNet50 Kinetics-400 25.90 config ckpt log
8x8x1 8 SlowFast ResNet50 (temporal-max) Kinetics-400 26.41 config ckpt log
8x8x1 8 SlowFast ResNet50 (temporal-max, focal loss) Kinetics-400 26.65 config ckpt log

MultiSports

frame sampling strategy gpus backbone pretrain f-mAP v-mAP@0.2 v-mAP@0.5 v-mAP@0.1:0.9 gpu_mem(M) config ckpt log
4x16x1 8 SlowFast ResNet50 Kinetics-400 36.88 22.83 16.9 14.74 18618 config ckpt log
  1. The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.

  2. with context indicates that using both RoI feature and global pooled feature for classification; temporal-max indicates that using max pooling in the temporal dimension for the feature.

  3. MultiSports dataset utilizes frame-mAP(f-mAP) and video-mAP(v-map) to evaluate performance. Frame-mAP evaluates on detection results of each frame, and video-mAP uses 3D IoU to evaluate tube-level results under several thresholds. You could refer to the competition page for details.

For more details on data preparation, you can refer to

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train the SlowFast model on AVA2.1 in a deterministic option with periodic validation.

python tools/train.py configs/detection/slowfast/slowfast_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test the SlowFast model on AVA2.1 and dump the result to a pkl file.

python tools/test.py configs/detection/slowfast/slowfast_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={ICCV},
  pages={6202--6211},
  year={2019}
}
@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={CVPR},
  pages={6047--6056},
  year={2018}
}

SlowOnly

Slowfast networks for video recognition

Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.

Results and Models

AVA2.1

frame sampling strategy gpus backbone pretrain mAP config ckpt log
4x16x1 8 SlowOnly ResNet50 Kinetics-400 20.72 config ckpt log
4x16x1 8 SlowOnly ResNet50 Kinetics-700 22.77 config ckpt log
4x16x1 8 SlowOnly ResNet50 (NonLocalEmbedGauss) Kinetics-400 21.55 config ckpt log
8x8x1 8 SlowOnly ResNet50 (NonLocalEmbedGauss) Kinetics-400 23.77 config ckpt log
8x8x1 8 SlowOnly ResNet101 Kinetics-400 24.83 config ckpt log

AVA2.2 (Trained on AVA-Kinetics)

Currently, we only use the training set of AVA-Kinetics and evaluate on the AVA2.2 validation dataset. The AVA-Kinetics validation dataset will be supported soon.

frame sampling strategy gpus backbone pretrain mAP config ckpt log
4x16x1 8 SlowOnly ResNet50 Kinetics-400 24.53 config ckpt log
4x16x1 8 SlowOnly ResNet50 Kinetics-700 25.87 config ckpt log
8x8x1 8 SlowOnly ResNet50 Kinetics-400 26.10 config ckpt log
8x8x1 8 SlowOnly ResNet50 Kinetics-700 27.82 config ckpt log

AVA2.2 (Trained on AVA-Kinetics with tricks)

We conduct ablation studies to show the improvements of training tricks using SlowOnly8x8 pretrained on the Kinetics700 dataset. The baseline is the last row in AVA2.2 (Trained on AVA-Kinetics).

method frame sampling strategy gpus backbone pretrain mAP config ckpt log
baseline 8x8x1 8 SlowOnly ResNet50 Kinetics-700 27.82 config ckpt log
+ context 8x8x1 8 SlowOnly ResNet50 Kinetics-700 28.31 config ckpt log
+ temporal max pooling 8x8x1 8 SlowOnly ResNet50 Kinetics-700 28.48 config ckpt log
+ nonlinear head 8x8x1 8 SlowOnly ResNet50 Kinetics-700 29.83 config ckpt log
+ focal loss 8x8x1 8 SlowOnly ResNet50 Kinetics-700 30.33 config ckpt log
+ more frames 16x4x1 8 SlowOnly ResNet50 Kinetics-700 31.29 config ckpt log

MultiSports

frame sampling strategy gpus backbone pretrain f-mAP v-mAP@0.2 v-mAP@0.5 v-mAP@0.1:0.9 gpu_mem(M) config ckpt log
4x16x1 8 SlowOnly ResNet50 Kinetics-400 26.40 15.48 10.62 9.65 8509 config ckpt log
  1. The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.

  2. + context indicates that using both RoI feature and global pooled feature for classification; + temporal max pooling indicates that using max pooling in the temporal dimension for the feature; nonlinear head indicates that using a 2-layer mlp instead of a linear classifier.

  3. MultiSports dataset utilizes frame-mAP(f-mAP) and video-mAP(v-map) to evaluate performance. Frame-mAP evaluates on detection results of each frame, and video-mAP uses 3D IoU to evaluate tube-level results under several thresholds. You could refer to the competition page for details.

For more details on data preparation, you can refer to

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train the SlowOnly model on AVA2.1 in a deterministic option with periodic validation.

python tools/train.py configs/detection/slowonly/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test the SlowOnly model on AVA2.1 and dump the result to a pkl file.

python tools/test.py configs/detection/slowonly/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={ICCV},
  pages={6202--6211},
  year={2019}
}
@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={CVPR},
  pages={6047--6056},
  year={2018}
}
@article{li2020ava,
  title={The ava-kinetics localized human actions video dataset},
  author={Li, Ang and Thotakuri, Meghana and Ross, David A and Carreira, Jo{\~a}o and Vostrikov, Alexander and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2005.00214},
  year={2020}
}

VideoMAE

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data.

Results and Models

Results and Models

AVA2.2

Currently, we use the training set of AVA-Kinetics and evaluate on the AVA2.2 validation dataset.

frame sampling strategy resolution gpus backbone pretrain mAP config ckpt log
16x4x1 raw 8 ViT Base Kinetics-400 33.6 config ckpt log
16x4x1 raw 8 ViT Large Kinetics-400 38.7 config ckpt log

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train the ViT base model on AVA-Kinetics in a deterministic option.

python tools/train.py configs/detection/ava_kinetics/vit-base-p16_videomae-k400-pre_8xb8-16x4x1-20e-adamw_ava-kinetics-rgb.py \
    --cfg-options randomness.seed=0 randomness.deterministic=True

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test the ViT base model on AVA-Kinetics and dump the result to a pkl file.

python tools/test.py configs/detection/ava_kinetics/vit-base-p16_videomae-k400-pre_8xb8-16x4x1-20e-adamw_ava-kinetics-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}
Read the Docs v: stable
Versions
latest
stable
1.x
0.x
dev-1.x
Downloads
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.