时空动作检测模型¶

ACRN¶

简介¶

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}

@inproceedings{sun2018actor,
  title={Actor-centric relation network},
  author={Sun, Chen and Shrivastava, Abhinav and Vondrick, Carl and Murphy, Kevin and Sukthankar, Rahul and Schmid, Cordelia},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={318--334},
  year={2018}
}

模型库¶

AVA2.1¶

配置文件	模态	预训练	主干网络	输入	GPU 数量	mAP	log	json	ckpt
slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava_rgb	RGB	Kinetics-400	ResNet50	32x2	8	27.1	log	json	ckpt

AVA2.2¶

配置文件	模态	预训练	主干网络	输入	GPU 数量	mAP	log	json	ckpt
slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb	RGB	Kinetics-400	ResNet50	32x2	8	27.8	log	json	ckpt

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。

对于数据集准备的细节，用户可参考数据准备。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：在 AVA 数据集上训练 ACRN 辅以 SlowFast 主干网络，并定期验证。

python tools/train.py configs/detection/acrn/slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb.py --validate

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 AVA 上测试 ACRN 辅以 SlowFast 主干网络，并将结果存为 csv 文件。

python tools/test.py configs/detection/acrn/slowfast_acrn_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb.py checkpoints/SOME_CHECKPOINT.pth --eval mAP --out results.csv

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

AVA¶

简介¶

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}

@article{duan2020omni,
  title={Omni-sourced Webly-supervised Learning for Video Recognition},
  author={Duan, Haodong and Zhao, Yue and Xiong, Yuanjun and Liu, Wentao and Lin, Dahua},
  journal={arXiv preprint arXiv:2003.13042},
  year={2020}
}

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  pages={6202--6211},
  year={2019}
}

模型库¶

AVA2.1¶

配置文件	模态	预训练	主干网络	输入	GPU 数量	分辨率	mAP	log	json	ckpt
slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb	RGB	Kinetics-400	ResNet50	4x16	8	短边 256	20.1	log	json	ckpt
slowonly_omnisource_pretrained_r50_4x16x1_20e_ava_rgb	RGB	OmniSource	ResNet50	4x16	8	短边 256	21.8	log	json	ckpt
slowonly_nl_kinetics_pretrained_r50_4x16x1_10e_ava_rgb	RGB	Kinetics-400	ResNet50	4x16	8	短边 256	21.75	log	json	ckpt
slowonly_nl_kinetics_pretrained_r50_8x8x1_10e_ava_rgb	RGB	Kinetics-400	ResNet50	8x8	8x2	短边 256	23.79	log	json	ckpt
slowonly_kinetics_pretrained_r101_8x8x1_20e_ava_rgb	RGB	Kinetics-400	ResNet101	8x8	8x2	短边 256	24.6	log	json	ckpt
slowonly_omnisource_pretrained_r101_8x8x1_20e_ava_rgb	RGB	OmniSource	ResNet101	8x8	8x2	短边 256	25.9	log	json	ckpt
slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb	RGB	Kinetics-400	ResNet50	32x2	8x2	短边 256	24.4	log	json	ckpt
slowfast_context_kinetics_pretrained_r50_4x16x1_20e_ava_rgb	RGB	Kinetics-400	ResNet50	32x2	8x2	短边 256	25.4	log	json	ckpt
slowfast_kinetics_pretrained_r50_8x8x1_20e_ava_rgb	RGB	Kinetics-400	ResNet50	32x2	8x2	短边 256	25.5	log	json	ckpt

AVA2.2¶

配置文件	模态	预训练	主干网络	输入	GPU 数量	mAP	log	json	ckpt
slowfast_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb	RGB	Kinetics-400	ResNet50	32x2	8	26.1	log	json	ckpt
slowfast_temporal_max_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb	RGB	Kinetics-400	ResNet50	32x2	8	26.4	log	json	ckpt
slowfast_temporal_max_focal_alpha3_gamma1_kinetics_pretrained_r50_8x8x1_cosine_10e_ava22_rgb	RGB	Kinetics-400	ResNet50	32x2	8	26.8	log	json	ckpt

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
Context 表示同时使用 RoI 特征与全局特征进行分类，可带来约 1% mAP 的提升。

对于数据集准备的细节，用户可参考数据准备。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：在 AVA 数据集上训练 SlowOnly，并定期验证。

python tools/train.py configs/detection/ava/slowonly_kinetics_pretrained_r50_8x8x1_20e_ava_rgb.py --validate

更多训练细节，可参考基础教程中的 训练配置 部分。

训练 AVA 数据集中的自定义类别¶

用户可以训练 AVA 数据集中的自定义类别。AVA 中不同类别的样本量很不平衡：其中有超过 100000 样本的类别： stand/listen to (a person)/talk to (e.g., self, a person, a group)/watch (a person)，也有样本较少的类别（半数类别不足 500 样本）。大多数情况下，仅使用样本较少的类别进行训练将在这些类别上得到更好精度。

训练 AVA 数据集中的自定义类别包含 3 个步骤：

从原先的类别中选择希望训练的类别，将其填写至配置文件的 custom_classes 域中。其中 0 不表示具体的动作类别，不应被选择。
将 num_classes 设置为 num_classes = len(custom_classes) + 1。
- 在新的类别到编号的对应中，编号 0 仍对应原类别 0，编号 i (i > 0) 对应原类别 custom_classes[i-1]。
- 配置文件中 3 处涉及 num_classes 需要修改：model -> roi_head -> bbox_head -> num_classes， data -> train -> num_classes， data -> val -> num_classes.
- 若 num_classes <= 5，配置文件 BBoxHeadAVA 中的 topk 参数应被修改。topk 的默认值为 (3, 5)，topk 中的所有元素应小于 num_classes。
确认所有自定义类别在 label_file 中。

以 slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb 为例，这一配置文件训练所有 AP 在 (0.1, 0.3) 间的类别（这里的 AP 为 AVA 80 类训出模型的表现），即 [3, 6, 10, 27, 29, 38, 41, 48, 51, 53, 54, 59, 61, 64, 70, 72]。下表列出了自定义类别训练的模型精度：

训练类别	mAP （自定义类别）	配置文件	log	json	ckpt
全部 80 类	0.1948	slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb	log	json	ckpt
自定义类别	0.3311	slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb_custom_classes	log	json	ckpt
全部 80 类	0.1864	slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb	log	json	ckpt
自定义类别	0.3785	slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb_custom_classes	log	json	ckpt

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 AVA 上测试 SlowOnly 模型，并将结果存为 csv 文件。

python tools/test.py configs/detection/ava/slowonly_kinetics_pretrained_r50_8x8x1_20e_ava_rgb.py checkpoints/SOME_CHECKPOINT.pth --eval mAP --out results.csv

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

LFB¶

简介¶

@inproceedings{wu2019long,
  title={Long-term feature banks for detailed video understanding},
  author={Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={284--293},
  year={2019}
}

模型库¶

AVA2.1¶

配置文件	模态	预训练	主干网络	输入	GPU 数量	分辨率	平均精度	log	json	ckpt
lfb_nl_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py	RGB	Kinetics-400	slowonly_r50_4x16x1	4x16	8	短边 256	24.11	log	json	ckpt
lfb_avg_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py	RGB	Kinetics-400	slowonly_r50_4x16x1	4x16	8	短边 256	20.17	log	json	ckpt
lfb_max_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py	RGB	Kinetics-400	slowonly_r50_4x16x1	4x16	8	短边 256	22.15	log	json	ckpt

注:

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
本 LFB 模型暂没有使用原论文中的 I3D-R50-NL 作为主干网络，而是用 slowonly_r50_4x16x1 替代，但取得了同样的提升效果：（本模型：20.1 -> 24.11 而原论文模型：22.1 -> 25.8）。
因为测试时，长时特征是被随机采样的，所以测试精度可能有一些偏差。
在训练或测试 LFB 之前，用户需要使用配置文件特征库 lfb_slowonly_r50_ava_infer.py 来推导长时特征库。有关推导长时特征库的更多细节，请参照训练部分。
用户也可以直接从 AVA_train_val_float32_lfb 或者 AVA_train_val_float16_lfb 下载 float32 或 float16 的长时特征库，并把它们放在 lfb_prefix_path 上。

训练¶

a. 为训练 LFB 推导长时特征库¶

在训练或测试 LFB 之前，用户首先需要推导长时特征库。

具体来说，使用配置文件 lfb_slowonly_r50_ava_infer，在训练集、验证集、测试集上都运行一次模型测试。

配置文件的默认设置是推导训练集的长时特征库，用户需要将 dataset_mode 设置成 'val' 来推导验证集的长时特征库，在推导过程中。共享头 LFBInferHead 会生成长时特征库。

AVA 训练集和验证集的 float32 精度的长时特征库文件大约占 3.3 GB。如果以半精度来存储长时特征，文件大约占 1.65 GB。

用户可以使用以下命令来推导 AVA 训练集和验证集的长时特征库，而特征库会被存储为 lfb_prefix_path/lfb_train.pkl 和 lfb_prefix_path/lfb_val.pkl。

## 在 lfb_slowonly_r50_ava_infer.py 中 设置 `dataset_mode = 'train'`
python tools/test.py configs/detection/lfb/lfb_slowonly_r50_ava_infer.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth --eval mAP

## 在 lfb_slowonly_r50_ava_infer.py 中 设置 `dataset_mode = 'val'`
python tools/test.py configs/detection/lfb/lfb_slowonly_r50_ava_infer.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth --eval mAP

MMAction2 使用来自配置文件 slowonly_kinetics_pretrained_r50_4x16x1_20e_ava_rgb 的模型权重文件 slowonly_r50_4x16x1 checkpoint作为推导长时特征库的 LFB 模型的主干网络的预训练模型。

b. 训练 LFB¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：使用半精度的长时特征库在 AVA 数据集上训练 LFB 模型。

python tools/train.py configs/detection/lfb/lfb_nl_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py \
  --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

测试¶

a. 为测试 LFB 推导长时特征库¶

在训练或测试 LFB 之前，用户首先需要推导长时特征库。如果用户之前已经生成了特征库文件，可以跳过这一步。

这一步做法与训练部分中的 为训练 LFB 推导长时特征库 相同。

b. 测试 LFB¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：使用半精度的长时特征库在 AVA 数据集上测试 LFB 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/detection/lfb/lfb_nl_kinetics_pretrained_slowonly_r50_4x16x1_20e_ava_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval mAP --out results.csv

更多测试细节，可参考基础教程中的 测试某个数据集 部分。