Spatio Temporal Action Detection Models¶

ACRN¶

Abstract¶

Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Results and Models¶

AVA2.1¶

frame sampling strategy	resolution	gpus	backbone	pretrain	mAP	gpu_mem(M)	config	ckpt	log
8x8x1	raw	8	SlowFast ResNet50	Kinetics-400	27.58	15263	config	ckpt	log

AVA2.2¶

frame sampling strategy	resolution	gpus	backbone	pretrain	mAP	gpu_mem(M)	config	ckpt	log
8x8x1	raw	8	SlowFast ResNet50	Kinetics-400	27.63	15263	config	ckpt	log

Note:

The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.

For more details on data preparation, you can refer to to AVA Data Preparation.

Train¶

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train ACRN with SlowFast backbone on AVA in a deterministic option.

python tools/train.py configs/detection/acrn/slowfast-acrn_kinetics400-pretrained-r50_8xb8-8x8x1-cosine-10e_ava21-rgb.py \
    --cfg-options randomness.seed=0 randomness.deterministic=True

For more details and optional arguments infos, you can refer to the Training part in the Training and Test Tutorial.

Test¶

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test ACRN with SlowFast backbone on AVA and dump the result to a pkl file.

python tools/test.py configs/detection/acrn/slowfast-acrn_kinetics400-pretrained-r50_8xb8-8x8x1-cosine-10e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details and optional arguments infos, you can refer to the Test part in the Training and Test Tutorial.

Citation¶

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}

@inproceedings{sun2018actor,
  title={Actor-centric relation network},
  author={Sun, Chen and Shrivastava, Abhinav and Vondrick, Carl and Murphy, Kevin and Sukthankar, Rahul and Schmid, Cordelia},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={318--334},
  year={2018}
}

AVA¶

Ava: A video dataset of spatio-temporally localized atomic visual actions

Abstract¶

This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  pages={6202--6211},
  year={2019}
}

Results and Models¶

AVA2.1¶

frame sampling strategy	resolution	gpus	backbone	pretrain	mAP	gpu_mem(M)	config	ckpt	log
4x16x1	raw	8	SlowOnly ResNet50	Kinetics-400	20.76	8503	config	ckpt	log
4x16x1	raw	8	SlowOnly ResNet50	Kinetics-700	22.77	8503	config	ckpt	log
4x16x1	raw	8	SlowOnly ResNet50 (NonLocalEmbedGauss)	Kinetics-400	21.49	11870	config	ckpt	log
8x8x1	raw	8	SlowOnly ResNet50 (NonLocalEmbedGauss)	Kinetics-400	23.74	25375	config	ckpt	log
8x8x1	raw	8	SlowOnly ResNet101	Kinetics-400	24.82	23477	config	ckpt	log
4x16x1	raw	8	SlowFast ResNet50	Kinetics-400	24.27	18616	config	ckpt	log
4x16x1	raw	8	SlowFast ResNet50 (with context)	Kinetics-400	25.25	18616	config	ckpt	log
8x8x1	raw	8	SlowFast ResNet50	Kinetics-400	25.73	13802	config	ckpt	log

AVA2.2¶

frame sampling strategy	resolution	gpus	backbone	pretrain	mAP	gpu_mem(M)	config	ckpt	log
8x8x1	raw	8	SlowFast ResNet50	Kinetics-400	25.82	10484	config	ckpt	log
8x8x1	raw	8	SlowFast ResNet50 (temporal-max)	Kinetics-400	26.32	10484	config	ckpt	log
8x8x1	raw	8	SlowFast ResNet50 (temporal-max, focal loss)	Kinetics-400	26.58	10484	config	ckpt	log

Note:

The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
With context indicates that using both RoI feature and global pooled feature for classification, which leads to around 1% mAP improvement in general.

For more details on data preparation, you can refer to [AVA Data Preparation](https://github.com/open-mmlab/mmaction2/tree/master/tools/data/ava/README.md).

### Train

You can use the following command to train a model.

```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

Example: train the SlowOnly model on AVA in a deterministic option.

```shell
python tools/train.py configs/detection/ava/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    --cfg-options randomness.seed=0 randomness.deterministic=True
```

For more details, you can refer to the **Training** part in the [Training and Test Tutorial](en/user_guides/4_train_test.md).

### Test

You can use the following command to test a model.

```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```

Example: test the SlowOnly model on AVA and dump the result to a pkl file.

```shell
python tools/test.py configs/detection/ava/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

For more details, you can refer to the **Test** part in the [Training and Test Tutorial](en/user_guides/4_train_test.md).

### Citation

<!-- [DATASET] -->

```BibTeX
@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}
```

```BibTeX
@article{duan2020omni,
  title={Omni-sourced Webly-supervised Learning for Video Recognition},
  author={Duan, Haodong and Zhao, Yue and Xiong, Yuanjun and Liu, Wentao and Lin, Dahua},
  journal={arXiv preprint arXiv:2003.13042},
  year={2020}
}
```
## AVA

[The AVA-Kinetics Localized Human Actions Video Dataset](https://arxiv.org/abs/2005.00214)

<!-- [ALGORITHM] -->

<div align="center">
  <img src="https://user-images.githubusercontent.com/35267818/205511687-8cafd48c-7f4a-4a4c-a8e6-8182635b0411.png" width="800px"/>
</div>

### Abstract

<!-- [ABSTRACT] -->

This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set.

```BibTeX
@article{li2020ava,
  title={The ava-kinetics localized human actions video dataset},
  author={Li, Ang and Thotakuri, Meghana and Ross, David A and Carreira, Jo{\~a}o and Vostrikov, Alexander and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2005.00214},
  year={2020}
}
```

### Results and Models

#### AVA2.2

Currently, we only use the training set of AVA-Kinetics and evaluate on the AVA2.2 validation dataset. The AVA-Kinetics validation dataset will be supported soon.

<table border="1" class="docutils">
<thead>
<tr>
<th style="text-align: center;">frame sampling strategy</th>
<th style="text-align: center;">resolution</th>
<th style="text-align: center;">gpus</th>
<th style="text-align: center;">backbone</th>
<th style="text-align: center;">pretrain</th>
<th style="text-align: center;">mAP</th>
<th style="text-align: center;">config</th>
<th style="text-align: center;">ckpt</th>
<th style="text-align: center;">log</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center;">4x16x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-400</td>
<td style="text-align: center;">24.53</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb_20221205-33e3ca7c.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">4x16x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">25.87</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb_20221205-a07e8c15.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-400</td>
<td style="text-align: center;">26.10</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k400-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb_20221205-8f8dff3b.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k400-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">27.82</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb_20221205-16a01c37.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
</tbody>
</table>

#### Training with tricks

We conduct ablation studies to show the improvements of training tricks using SlowOnly8x8 pretrained on the Kinetics700 dataset. The baseline is the last raw in [AVA2.2](https://github.com/hukkai/mmaction2/tree/ava-kinetics-exp/configs/detection/ava_kinetics##ava22).

<table border="1" class="docutils">
<thead>
<tr>
<th style="text-align: center;">method</th>
<th style="text-align: center;">frame sampling strategy</th>
<th style="text-align: center;">resolution</th>
<th style="text-align: center;">gpus</th>
<th style="text-align: center;">backbone</th>
<th style="text-align: center;">pretrain</th>
<th style="text-align: center;">mAP</th>
<th style="text-align: center;">config</th>
<th style="text-align: center;">ckpt</th>
<th style="text-align: center;">log</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center;">baseline</td>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">27.82</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb_20221205-16a01c37.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-8x8x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">+ context</td>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">28.31</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50-context_8xb8-8x8x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context_8xb8-8x8x1-10e_ava-kinetics-rgb_20221205-5d514f8c.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context_8xb8-8x8x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">+ temporal max pooling</td>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">28.48</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max_8xb8-8x8x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context-temporal-max_8xb8-8x8x1-10e_ava-kinetics-rgb_20221205-5b5e71eb.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context-temporal-max_8xb8-8x8x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">+ nonlinear head</td>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">29.83</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-10e_ava-kinetics-rgb_20221205-87624265.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">+ focal loss</td>
<td style="text-align: center;">8x8x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">30.33</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-focal-10e_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-focal-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-focal-10e_ava-kinetics-rgb_20221205-37aa8395.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-focal-10e_ava-kinetics-rgb/slowonly_k700-pre-r50-context-temporal-max-nl-head_8xb8-8x8x1-focal-10e_ava-kinetics-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">+ more frames</td>
<td style="text-align: center;">16x4x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50</td>
<td style="text-align: center;">Kinetics-700</td>
<td style="text-align: center;">31.29</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-16x4x1-10e-tricks_ava-kinetics-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-16x4x1-10e-tricks_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-16x4x1-10e-tricks_ava-kinetics-rgb_20221205-dd652f81.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/ava_kinetics/slowonly_k700-pre-r50_8xb8-16x4x1-10e-tricks_ava-kinetics-rgb/slowonly_k700-pre-r50_8xb8-16x4x1-10e-tricks_ava-kinetics-rgb.log">log</a></td>
</tr>
</tbody>
</table>

Note:

The **gpus** indicates the number of gpu we used to get the checkpoint; **+ context** indicates that using both RoI feature and global pooled feature for classification; **+ temporal max pooling** indicates that using max pooling in the temporal dimension for the feature; **nonlinear head** indicates that using a 2-layer mlp instead of a linear classifier.

For more details on data preparation, you can refer to [AVA-Kinetics Data Preparation](https://github.com/open-mmlab/mmaction2/tree/master/tools/data/ava_kinetics/README.md).

### Train

You can use the following command to train a model.

```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

Example: train the SlowOnly model on AVA-Kinetics in a deterministic option.

```shell
python tools/train.py configs/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb.py \
    --cfg-options randomness.seed=0 randomness.deterministic=True
```

For more details, you can refer to the **Training** part in the [Training and Test Tutorial](en/user_guides/4_train_test.md).

### Test

You can use the following command to test a model.

```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```

Example: test the SlowOnly model on AVA-Kinetics and dump the result to a pkl file.

```shell
python tools/test.py configs/detection/ava_kinetics/slowonly_k400-pre-r50_8xb8-4x16x1-10e_ava-kinetics-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

For more details, you can refer to the **Test** part in the [Training and Test Tutorial](en/user_guides/4_train_test.md).

### Citation

<!-- [DATASET] -->

```BibTeX
@article{li2020ava,
  title={The ava-kinetics localized human actions video dataset},
  author={Li, Ang and Thotakuri, Meghana and Ross, David A and Carreira, Jo{\~a}o and Vostrikov, Alexander and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2005.00214},
  year={2020}
}
```
## LFB

[Long-term feature banks for detailed video understanding](https://openaccess.thecvf.com/content_CVPR_2019/html/Wu_Long-Term_Feature_Banks_for_Detailed_Video_Understanding_CVPR_2019_paper.html)

<!-- [ALGORITHM] -->

### Abstract

<!-- [ABSTRACT] -->

To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

<!-- [IMAGE] -->

<div align=center>
<img src="https://user-images.githubusercontent.com/34324155/143016220-21d90fb3-fd9f-499c-820f-f6c421bda7aa.png" width="800"/>
</div>

### Results and Models

#### AVA2.1

<table border="1" class="docutils">
<thead>
<tr>
<th style="text-align: center;">frame sampling strategy</th>
<th style="text-align: center;">resolution</th>
<th style="text-align: center;">gpus</th>
<th style="text-align: center;">backbone</th>
<th style="text-align: center;">pretrain</th>
<th style="text-align: center;">mAP</th>
<th style="text-align: center;">gpu_mem(M)</th>
<th style="text-align: center;">config</th>
<th style="text-align: center;">ckpt</th>
<th style="text-align: center;">log</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center;">4x16x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50 (with Nonlocal LFB)</td>
<td style="text-align: center;">Kinetics-400</td>
<td style="text-align: center;">24.05</td>
<td style="text-align: center;">8620</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb_20220906-4c5b9f25.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.log">log</a></td>
</tr>
<tr>
<td style="text-align: center;">4x16x1</td>
<td style="text-align: center;">raw</td>
<td style="text-align: center;">8</td>
<td style="text-align: center;">SlowOnly ResNet50 (with Max LFB)</td>
<td style="text-align: center;">Kinetics-400</td>
<td style="text-align: center;">22.15</td>
<td style="text-align: center;">8425</td>
<td style="text-align: center;"><a href="https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/lfb/slowonly-lfb-max_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py">config</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/lfb/slowonly-lfb-max_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb/slowonly-lfb-max_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb_20220906-4963135b.pth">ckpt</a></td>
<td style="text-align: center;"><a href="https://download.openmmlab.com/mmaction/v1.0/detection/lfb/slowonly-lfb-max_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb/slowonly-lfb-max_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.log">log</a></td>
</tr>
</tbody>
</table>

Note:

1. The **gpus** indicates the number of gpu we used to get the checkpoint.
   According to the [Linear Scaling Rule](https://arxiv.org/abs/1706.02677), you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU,
   e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
2. We use `slowonly_r50_4x16x1` instead of `I3D-R50-NL` in the original paper as the backbone of LFB, but we have achieved the similar improvement: (ours: 20.1 -> 24.05 vs. author: 22.1 -> 25.8).
3. Because the long-term features are randomly sampled in testing, the test accuracy may have some differences.
4. Before train or test lfb, you need to infer feature bank with the [slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py](https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/lfb/slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py). For more details on infer feature bank, you can refer to [Train](##Train) part.
5. You can also dowonload long-term feature bank from [AVA_train_val_float32_lfb](https://download.openmmlab.com/mmaction/detection/lfb/AVA_train_val_float32_lfb.rar) or [AVA_train_val_float16_lfb](https://download.openmmlab.com/mmaction/detection/lfb/AVA_train_val_float16_lfb.rar), and then put them on `lfb_prefix_path`.
6. The ROIHead now supports single-label classification (i.e. the network outputs at most
   one-label per actor). This can be done by (a) setting multilabel=False during training and
   the test_cfg.rcnn.action_thr for testing.

### Train

#### a. Infer long-term feature bank for training

Before train or test lfb, you need to infer long-term feature bank first.

Specifically, run the test on the training, validation, testing dataset with the config file [slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py](https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/lfb/slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py) (The config file will only infer the feature bank of training dataset and you need set `dataset_mode = 'val'` to infer the feature bank of validation dataset in the config file.), and the shared head [LFBInferHead](https://github.com/open-mmlab/mmaction2/tree/master/mmaction/models/roi_heads/shared_heads/lfb_infer_head.py) will generate the feature bank.

A long-term feature bank file of AVA training and validation datasets with float32 precision occupies 3.3 GB. If store the features with float16 precision, the feature bank occupies 1.65 GB.

You can use the following command to infer feature bank of AVA training and validation dataset and the feature bank will be stored in `lfb_prefix_path/lfb_train.pkl` and `lfb_prefix_path/lfb_val.pkl`.

```shell
## set `dataset_mode = 'train'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth --eval mAP

## set `dataset_mode = 'val'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth --eval mAP
```

We use [slowonly_r50_4x16x1 checkpoint](https://download.openmmlab.com/mmaction/detection/ava/slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb/slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb_20201217-40061d5f.pth) from [slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb](https://github.com/open-mmlab/mmaction2/tree/master/configs/detection/ava/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb.py) to infer feature bank.

#### b. Train LFB

You can use the following command to train a model.

```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

Example: train LFB model on AVA with half-precision long-term feature bank.

```shell
python tools/train.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
  --validate --seed 0 --deterministic
```

For more details and optional arguments infos, you can refer to the **Training** part in the [Training and Test Tutorial](en/user_guides/4_train_test.md).

### Test

#### a. Infer long-term feature bank for testing

Before train or test lfb, you also need to infer long-term feature bank first. If you have generated the feature bank file, you can skip it.

The step is the same with **Infer long-term feature bank for training** part in [Train](##Train).

#### b. Test LFB

You can use the following command to test a model.

```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```

Example: test LFB model on AVA with half-precision long-term feature bank and dump the result to a pkl file.

```shell
python tools/test.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

For more details, you can refer to the **Test** part in the [Training and Test Tutorial](en/user_guides/4_train_test.md).

### Citation

<!-- [DATASET] -->

```BibTeX
@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}
```

```BibTeX
@inproceedings{wu2019long,
  title={Long-term feature banks for detailed video understanding},
  author={Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={284--293},
  year={2019}
}
```