Action Localization Models¶
BMN¶
Bmn: Boundary-matching network for temporal action proposal generation
Abstract¶
Temporal action proposal generation is an challenging and promising task which aims to locate temporal regions in real-world videos where action or event may occur. Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map. Based on BM mechanism, we propose an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously. The two-branches of BMN are jointly trained in an unified framework. We conduct experiments on two challenging datasets: THUMOS-14 and ActivityNet-1.3, where BMN shows significant performance improvement with remarkable efficiency and generalizability. Further, combining with existing action classifier, BMN can achieve state-of-the-art temporal action detection performance.

Results and Models¶
ActivityNet feature¶
config | feature | gpus | AR@100 | AUC | AP@0.5 | AP@0.75 | AP@0.95 | mAP | gpu_mem(M) | iter time(s) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bmn_400x100_9e_2x8_activitynet_feature | cuhk_mean_100 | 2 | 75.28 | 67.22 | 42.47 | 31.31 | 9.92 | 30.34 | 5420 | 3.27 | ckpt | log | json |
mmaction_video | 2 | 75.43 | 67.22 | 42.62 | 31.56 | 10.86 | 30.77 | 5420 | 3.27 | ckpt | log | json | |
mmaction_clip | 2 | 75.35 | 67.38 | 43.08 | 32.19 | 10.73 | 31.15 | 5420 | 3.27 | ckpt | log | json | |
BMN-official (for reference)* | cuhk_mean_100 | - | 75.27 | 67.49 | 42.22 | 30.98 | 9.22 | 30.00 | - | - | - | - | - |
Note
The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
For feature column, cuhk_mean_100 denotes the widely used cuhk activitynet feature extracted by anet2016-cuhk, mmaction_video and mmaction_clip denote feature extracted by mmaction, with video-level activitynet finetuned model or clip-level activitynet finetuned model respectively.
We evaluate the action detection performance of BMN, using anet_cuhk_2017 submission for ActivityNet2017 Untrimmed Video Classification Track to assign label for each action proposal.
*We train BMN with the official repo, evaluate its proposal generation and action detection performance with anet_cuhk_2017 for label assigning.
For more details on data preparation, you can refer to ActivityNet feature in Data Preparation.
Train¶
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train BMN model on ActivityNet features dataset.
python tools/train.py configs/localization/bmn/bmn_400x100_2x8_9e_activitynet_feature.py
For more details and optional arguments infos, you can refer to Training setting part in getting_started .
Test¶
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test BMN on ActivityNet feature dataset.
## Note: If evaluated, then please make sure the annotation file for test data contains groundtruth.
python tools/test.py configs/localization/bmn/bmn_400x100_2x8_9e_activitynet_feature.py checkpoints/SOME_CHECKPOINT.pth --eval AR@AN --out results.json
You can also test the action detection performance of the model, with anet_cuhk_2017 prediction file and generated proposal file (results.json
in last command).
python tools/analysis/report_map.py --proposal path/to/proposal_file
Note
(Optional) You can use the following command to generate a formatted proposal file, which will be fed into the action classifier (Currently supports SSN and P-GCN, not including TSN, I3D etc.) to get the classification result of proposals.
python tools/data/activitynet/convert_proposal_format.py
For more details and optional arguments infos, you can refer to Test a dataset part in getting_started .
Citation¶
@inproceedings{lin2019bmn,
title={Bmn: Boundary-matching network for temporal action proposal generation},
author={Lin, Tianwei and Liu, Xiao and Li, Xin and Ding, Errui and Wen, Shilei},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
pages={3889--3898},
year={2019}
}
@article{zhao2017cuhk,
title={Cuhk \& ethz \& siat submission to activitynet challenge 2017},
author={Zhao, Y and Zhang, B and Wu, Z and Yang, S and Zhou, L and Yan, S and Wang, L and Xiong, Y and Lin, D and Qiao, Y and others},
journal={arXiv preprint arXiv:1710.08011},
volume={8},
year={2017}
}
BSN¶
Bsn: Boundary sensitive network for temporal action proposal generation
Abstract¶
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts “local to global” fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14, where BSN outperforms other state-of-the-art temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the state-of-the-art temporal action detection performance.

Results and Models¶
ActivityNet feature¶
config | feature | gpus | pretrain | AR@100 | AUC | gpu_mem(M) | iter time(s) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|
bsn_400x100_1x16_20e_activitynet_feature | cuhk_mean_100 | 1 | None | 74.66 | 66.45 | 41(TEM)+25(PEM) | 0.074(TEM)+0.036(PEM) | ckpt_tem ckpt_pem | log_tem log_pem | json_tem json_pem |
mmaction_video | 1 | None | 74.93 | 66.74 | 41(TEM)+25(PEM) | 0.074(TEM)+0.036(PEM) | ckpt_tem ckpt_pem | log_tem log_pem | json_tem json_pem | |
mmaction_clip | 1 | None | 75.19 | 66.81 | 41(TEM)+25(PEM) | 0.074(TEM)+0.036(PEM) | ckpt_tem ckpt_pem | log_tem log_pem | json_tem json_pem |
Note
The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
For feature column, cuhk_mean_100 denotes the widely used cuhk activitynet feature extracted by anet2016-cuhk, mmaction_video and mmaction_clip denote feature extracted by mmaction, with video-level activitynet finetuned model or clip-level activitynet finetuned model respectively.
For more details on data preparation, you can refer to ActivityNet feature in Data Preparation.
Train¶
You can use the following commands to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Examples:
train BSN(TEM) on ActivityNet features dataset.
python tools/train.py configs/localization/bsn/bsn_tem_400x100_1x16_20e_activitynet_feature.py
train BSN(PEM) on PGM results.
python tools/train.py configs/localization/bsn/bsn_pem_400x100_1x16_20e_activitynet_feature.py
For more details and optional arguments infos, you can refer to Training setting part in getting_started.
Inference¶
You can use the following commands to inference a model.
For TEM Inference
## Note: This could not be evaluated. python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
For PGM Inference
python tools/misc/bsn_proposal_generation.py ${CONFIG_FILE} [--mode ${MODE}]
For PEM Inference
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Examples:
Inference BSN(TEM) with pretrained model.
python tools/test.py configs/localization/bsn/bsn_tem_400x100_1x16_20e_activitynet_feature.py checkpoints/SOME_CHECKPOINT.pth
Inference BSN(PGM) with pretrained model.
python tools/misc/bsn_proposal_generation.py configs/localization/bsn/bsn_pgm_400x100_activitynet_feature.py --mode train
Inference BSN(PEM) with evaluation metric ‘AR@AN’ and output the results.
## Note: If evaluated, then please make sure the annotation file for test data contains groundtruth. python tools/test.py configs/localization/bsn/bsn_pem_400x100_1x16_20e_activitynet_feature.py checkpoints/SOME_CHECKPOINT.pth --eval AR@AN --out results.json
Test¶
You can use the following commands to test a model.
TEM
## Note: This could not be evaluated. python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
PGM
python tools/misc/bsn_proposal_generation.py ${CONFIG_FILE} [--mode ${MODE}]
PEM
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Examples:
Test a TEM model on ActivityNet dataset.
python tools/test.py configs/localization/bsn/bsn_tem_400x100_1x16_20e_activitynet_feature.py checkpoints/SOME_CHECKPOINT.pth
Test a PGM model on ActivityNet dataset.
python tools/misc/bsn_proposal_generation.py configs/localization/bsn/bsn_pgm_400x100_activitynet_feature.py --mode test
Test a PEM model with with evaluation metric ‘AR@AN’ and output the results.
python tools/test.py configs/localization/bsn/bsn_pem_400x100_1x16_20e_activitynet_feature.py checkpoints/SOME_CHECKPOINT.pth --eval AR@AN --out results.json
Note
(Optional) You can use the following command to generate a formatted proposal file, which will be fed into the action classifier (Currently supports only SSN and P-GCN, not including TSN, I3D etc.) to get the classification result of proposals.
python tools/data/activitynet/convert_proposal_format.py
For more details and optional arguments infos, you can refer to Test a dataset part in getting_started.
Citation¶
@inproceedings{lin2018bsn,
title={Bsn: Boundary sensitive network for temporal action proposal generation},
author={Lin, Tianwei and Zhao, Xu and Su, Haisheng and Wang, Chongjing and Yang, Ming},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
pages={3--19},
year={2018}
}
SSN¶
Temporal Action Detection With Structured Segment Networks
Abstract¶
Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.

Results and Models¶
config | gpus | backbone | pretrain | mAP@0.3 | mAP@0.4 | mAP@0.5 | reference mAP@0.3 | reference mAP@0.4 | reference mAP@0.5 | gpu_mem(M) | ckpt | log | json | reference ckpt | reference json |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ssn_r50_450e_thumos14_rgb | 8 | ResNet50 | ImageNet | 29.37 | 22.15 | 15.69 | 27.61 | 21.28 | 14.57 | 6352 | ckpt | log | json | ckpt | json |
Note
The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
Since SSN utilizes different structured temporal pyramid pooling methods at training and testing, please refer to ssn_r50_450e_thumos14_rgb_train at training and ssn_r50_450e_thumos14_rgb_test at testing.
We evaluate the action detection performance of SSN, using action proposals of TAG. For more details on data preparation, you can refer to thumos14 TAG proposals in Data Preparation.
The reference SSN in is evaluated with
ResNet50
backbone in MMAction, which is the same backbone with ours. Note that the original setting of MMAction SSN uses theBNInception
backbone.
Train¶
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train SSN model on thumos14 dataset.
python tools/train.py configs/localization/ssn/ssn_r50_450e_thumos14_rgb_train.py
For more details and optional arguments infos, you can refer to Training setting part in getting_started.
Test¶
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test BMN on ActivityNet feature dataset.
## Note: If evaluated, then please make sure the annotation file for test data contains groundtruth.
python tools/test.py configs/localization/ssn/ssn_r50_450e_thumos14_rgb_test.py checkpoints/SOME_CHECKPOINT.pth --eval mAP
For more details and optional arguments infos, you can refer to Test a dataset part in getting_started.
Citation¶
@InProceedings{Zhao_2017_ICCV,
author = {Zhao, Yue and Xiong, Yuanjun and Wang, Limin and Wu, Zhirong and Tang, Xiaoou and Lin, Dahua},
title = {Temporal Action Detection With Structured Segment Networks},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}
}