Spatio Temporal Action Detection Models¶
ACRN¶
Actor-centric relation network
Abstract¶
Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Results and Models¶
AVA2.1¶
Model | Modality | Pretrained | Backbone | Input | gpus | mAP | log | json | ckpt |
---|---|---|---|---|---|---|---|---|---|
slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8 | 27.1 | log | json | ckpt |
AVA2.2¶
Model | Modality | Pretrained | Backbone | Input | gpus | mAP | log | json | ckpt |
---|---|---|---|---|---|---|---|---|---|
slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8 | 27.8 | log | json | ckpt |
Note
The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
For more details on data preparation, you can refer to AVA in Data Preparation.
Train¶
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train ACRN with SlowFast backbone on AVA with periodic validation.
python tools/train.py configs/detection/acrn/slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb.py --validate
For more details and optional arguments infos, you can refer to Training setting part in getting_started.
Test¶
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test ACRN with SlowFast backbone on AVA and dump the result to a csv file.
python tools/test.py configs/detection/acrn/slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb.py checkpoints/SOME_CHECKPOINT.pth --eval mAP --out results.csv
For more details and optional arguments infos, you can refer to Test a dataset part in getting_started .
Citation¶
@inproceedings{gu2018ava,
title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={6047--6056},
year={2018}
}
@inproceedings{sun2018actor,
title={Actor-centric relation network},
author={Sun, Chen and Shrivastava, Abhinav and Vondrick, Carl and Murphy, Kevin and Sukthankar, Rahul and Schmid, Cordelia},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
pages={318--334},
year={2018}
}
AVA¶
Ava: A video dataset of spatio-temporally localized atomic visual actions

Abstract¶
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.

@inproceedings{feichtenhofer2019slowfast,
title={Slowfast networks for video recognition},
author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
booktitle={Proceedings of the IEEE international conference on computer vision},
pages={6202--6211},
year={2019}
}
Results and Models¶
AVA2.1¶
Model | Modality | Pretrained | Backbone | Input | gpus | Resolution | mAP | log | json | ckpt |
---|---|---|---|---|---|---|---|---|---|---|
slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 4x16 | 8 | short-side 256 | 20.1 | log | json | ckpt |
slowonly_omnisource_pretrained_r50_4x16x1_20e_ava_rgb | RGB | OmniSource | ResNet50 | 4x16 | 8 | short-side 256 | 21.8 | log | json | ckpt |
slowonly_nl_kinetics_pretrained_r50_4x16x1_10e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 4x16 | 8 | short-side 256 | 21.75 | log | json | ckpt |
slowonly_nl_kinetics_pretrained_r50_8x8x1_10e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 8x8 | 8x2 | short-side 256 | 23.79 | log | json | ckpt |
slowonly_kinetics_pretrained_r101_8x8x1_20e_ava_rgb | RGB | Kinetics-400 | ResNet101 | 8x8 | 8x2 | short-side 256 | 24.6 | log | json | ckpt |
slowonly_omnisource_pretrained_r101_8x8x1_20e_ava_rgb | RGB | OmniSource | ResNet101 | 8x8 | 8x2 | short-side 256 | 25.9 | log | json | ckpt |
slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8x2 | short-side 256 | 24.4 | log | json | ckpt |
slowfast_context_kinetics_pretrained_r50_4x16x1_20e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8x2 | short-side 256 | 25.4 | log | json | ckpt |
slowfast_kinetics_pretrained_r50_8x8x1_20e_ava_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8x2 | short-side 256 | 25.5 | log | json | ckpt |
AVA2.2¶
Model | Modality | Pretrained | Backbone | Input | gpus | mAP | log | json | ckpt |
---|---|---|---|---|---|---|---|---|---|
slowfast_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8 | 26.1 | log | json | ckpt |
slowfast_temporal_max_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8 | 26.4 | log | json | ckpt |
slowfast_temporal_max_focal_alpha3_gamma1_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb | RGB | Kinetics-400 | ResNet50 | 32x2 | 8 | 26.8 | log | json | ckpt |
Note
The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
Context indicates that using both RoI feature and global pooled feature for classification, which leads to around 1% mAP improvement in general.
For more details on data preparation, you can refer to AVA in Data Preparation.
Train¶
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train SlowOnly model on AVA with periodic validation.
python tools/train.py configs/detection/ava/slowonly_kinetics_pretrained_r50_8x8x1_20e_ava_rgb.py --validate
For more details and optional arguments infos, you can refer to Training setting part in getting_started .
Train Custom Classes From Ava Dataset¶
You can train custom classes from ava. Ava suffers from class imbalance. There are more then 100,000 samples for classes like stand
/listen to (a person)
/talk to (e.g., self, a person, a group)
/watch (a person)
, whereas half of all classes has less than 500 samples. In most cases, training custom classes with fewer samples only will lead to better results.
Three steps to train custom classes:
Step 1: Select custom classes from original classes, named
custom_classes
. Class0
should not be selected since it is reserved for further usage (to identify whether a proposal is positive or negative, not implemented yet) and will be added automatically.Step 2: Set
num_classes
. In order to be compatible with current codes, Please make surenum_classes == len(custom_classes) + 1
.The new class
0
corresponds to original class0
. The new classi
(i > 0) corresponds to original classcustom_classes[i-1]
.There are three
num_classes
in ava config,model -> roi_head -> bbox_head -> num_classes
,data -> train -> num_classes
anddata -> val -> num_classes
.If
num_classes <= 5
, input argtopk
ofBBoxHeadAVA
should be modified. The default value oftopk
is(3, 5)
, and all elements oftopk
must be smaller thannum_classes
.
Step 3: Make sure all custom classes are in
label_file
. It is worth mentioning that there are two label files,ava_action_list_v2.1_for_activitynet_2018.pbtxt
(contains 60 classes, 20 classes are missing) andava_action_list_v2.1.pbtxt
(contains all 80 classes).
Take slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb
as an example, training custom classes with AP in range (0.1, 0.3)
, aka [3, 6, 10, 27, 29, 38, 41, 48, 51, 53, 54, 59, 61, 64, 70, 72]
. Please note that, the previously mentioned AP is calculated by original ckpt, which is trained by all 80 classes. The results are listed as follows.
training classes | mAP(custom classes) | config | log | json | ckpt |
---|---|---|---|---|---|
All 80 classes | 0.1948 | slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb | log | json | ckpt |
custom classes | 0.3311 | slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb_custom_classes | log | json | ckpt |
All 80 classes | 0.1864 | slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb.py | log | json | ckpt |
custom classes | 0.3785 | slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb_custom_classes | log | json | ckpt |
Test¶
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test SlowOnly model on AVA and dump the result to a csv file.
python tools/test.py configs/detection/ava/slowonly_kinetics_pretrained_r50_8x8x1_20e_ava_rgb.py checkpoints/SOME_CHECKPOINT.pth --eval mAP --out results.csv
For more details and optional arguments infos, you can refer to Test a dataset part in getting_started .
Citation¶
@inproceedings{gu2018ava,
title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={6047--6056},
year={2018}
}
@article{duan2020omni,
title={Omni-sourced Webly-supervised Learning for Video Recognition},
author={Duan, Haodong and Zhao, Yue and Xiong, Yuanjun and Liu, Wentao and Lin, Dahua},
journal={arXiv preprint arXiv:2003.13042},
year={2020}
}
LFB¶
Long-term feature banks for detailed video understanding
Abstract¶
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

Results and Models¶
AVA2.1¶
Model | Modality | Pretrained | Backbone | Input | gpus | Resolution | mAP | log | json | ckpt |
---|---|---|---|---|---|---|---|---|---|---|
lfb_nl_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py | RGB | Kinetics-400 | slowonly_r50_4x16x1 | 4x16 | 8 | short-side 256 | 24.11 | log | json | ckpt |
lfb_avg_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py | RGB | Kinetics-400 | slowonly_r50_4x16x1 | 4x16 | 8 | short-side 256 | 20.17 | log | json | ckpt |
lfb_max_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py | RGB | Kinetics-400 | slowonly_r50_4x16x1 | 4x16 | 8 | short-side 256 | 22.15 | log | json | ckpt |
Note
The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
We use
slowonly_r50_4x16x1
instead ofI3D-R50-NL
in the original paper as the backbone of LFB, but we have achieved the similar improvement: (ours: 20.1 -> 24.11 vs. author: 22.1 -> 25.8).Because the long-term features are randomly sampled in testing, the test accuracy may have some differences.
Before train or test lfb, you need to infer feature bank with the lfb_slowonly_r50_ava_infer.py. For more details on infer feature bank, you can refer to Train part.
You can also dowonload long-term feature bank from AVA_train_val_float32_lfb or AVA_train_val_float16_lfb, and then put them on
lfb_prefix_path
.The ROIHead now supports single-label classification (i.e. the network outputs at most one-label per actor). This can be done by (a) setting multilabel=False during training and the test_cfg.rcnn.action_thr for testing.
Train¶
a. Infer long-term feature bank for training¶
Before train or test lfb, you need to infer long-term feature bank first.
Specifically, run the test on the training, validation, testing dataset with the config file lfb_slowonly_r50_ava_infer (The config file will only infer the feature bank of training dataset and you need set dataset_mode = 'val'
to infer the feature bank of validation dataset in the config file.), and the shared head LFBInferHead will generate the feature bank.
A long-term feature bank file of AVA training and validation datasets with float32 precision occupies 3.3 GB. If store the features with float16 precision, the feature bank occupies 1.65 GB.
You can use the following command to infer feature bank of AVA training and validation dataset and the feature bank will be stored in lfb_prefix_path/lfb_train.pkl
and lfb_prefix_path/lfb_val.pkl
.
## set `dataset_mode = 'train'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/lfb_slowonly_r50_ava_infer.py \
checkpoints/YOUR_BASELINE_CHECKPOINT.pth --eval mAP
## set `dataset_mode = 'val'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/lfb_slowonly_r50_ava_infer.py \
checkpoints/YOUR_BASELINE_CHECKPOINT.pth --eval mAP
We use slowonly_r50_4x16x1 checkpoint from slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb to infer feature bank.
b. Train LFB¶
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train LFB model on AVA with half-precision long-term feature bank.
python tools/train.py configs/detection/lfb/lfb_nl_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py \
--validate --seed 0 --deterministic
For more details and optional arguments infos, you can refer to Training setting part in getting_started.
Test¶
a. Infer long-term feature bank for testing¶
Before train or test lfb, you also need to infer long-term feature bank first. If you have generated the feature bank file, you can skip it.
The step is the same with Infer long-term feature bank for training part in Train.
b. Test LFB¶
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test LFB model on AVA with half-precision long-term feature bank and dump the result to a csv file.
python tools/test.py configs/detection/lfb/lfb_nl_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py \
checkpoints/SOME_CHECKPOINT.pth --eval mAP --out results.csv
For more details, you can refer to Test a dataset part in getting_started.
Citation¶
@inproceedings{gu2018ava,
title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={6047--6056},
year={2018}
}
@inproceedings{wu2019long,
title={Long-term feature banks for detailed video understanding},
author={Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={284--293},
year={2019}
}