动作识别模型¶

C3D¶

简介¶

@ARTICLE{2014arXiv1412.0767T,
author = {Tran, Du and Bourdev, Lubomir and Fergus, Rob and Torresani, Lorenzo and Paluri, Manohar},
title = {Learning Spatiotemporal Features with 3D Convolutional Networks},
keywords = {Computer Science - Computer Vision and Pattern Recognition},
year = 2014,
month = dec,
eid = {arXiv:1412.0767}
}

模型库¶

UCF-101¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	测试方案	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
c3d_sports1m_16x1x1_45e_ucf101_rgb.py	128x171	8	c3d	sports1m	83.27	95.90	10 clips x 1 crop	x	6053	ckpt	log	json

注：

C3D 的原论文使用 UCF-101 的数据均值进行数据正则化，并且使用 SVM 进行视频分类。MMAction2 使用 ImageNet 的 RGB 均值进行数据正则化，并且使用线性分类器。
这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。

对于数据集准备的细节，用户可参考数据集准备文档中的 UCF-101 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 C3D 模型在 UCF-101 数据集上的训练。

python tools/train.py configs/recognition/c3d/c3d_sports1m_16x1x1_45e_ucf101_rgb.py \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 UCF-101 数据集上测试 C3D 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/c3d/c3d_sports1m_16x1x1_45e_ucf101_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

CSN¶

简介¶

@inproceedings{inproceedings,
author = {Wang, Heng and Feiszli, Matt and Torresani, Lorenzo},
year = {2019},
month = {10},
pages = {5551-5560},
title = {Video Classification With Channel-Separated Convolutional Networks},
doi = {10.1109/ICCV.2019.00565}
}

@inproceedings{ghadiyaram2019large,
  title={Large-scale weakly-supervised pre-training for video action recognition},
  author={Ghadiyaram, Deepti and Tran, Du and Mahajan, Dhruv},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={12046--12055},
  year={2019}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
ircsn_bnfrozen_r50_32x2x1_180e_kinetics400_rgb	短边 320	x	ResNet50	None	73.6	91.3	x	x	ckpt	log	json
ircsn_ig65m_pretrained_bnfrozen_r50_32x2x1_58e_kinetics400_rgb	短边 320	x	ResNet50	IG65M	79.0	94.2	x	x	infer_ckpt	x	x
ircsn_bnfrozen_r152_32x2x1_180e_kinetics400_rgb	短边 320	x	ResNet152	None	76.5	92.1	x	x	infer_ckpt	x	x
ircsn_sports1m_pretrained_bnfrozen_r152_32x2x1_58e_kinetics400_rgb	短边 320	x	ResNet152	Sports1M	78.2	93.0	x	x	infer_ckpt	x	x
ircsn_ig65m_pretrained_bnfrozen_r152_32x2x1_58e_kinetics400_rgb.py	短边 320	8x4	ResNet152	IG65M	82.76/82.6	95.68/95.3	x	8516	ckpt/infer_ckpt	log	json
ipcsn_bnfrozen_r152_32x2x1_180e_kinetics400_rgb	短边 320	x	ResNet152	None	77.8	92.8	x	x	infer_ckpt	x	x
ipcsn_sports1m_pretrained_bnfrozen_r152_32x2x1_58e_kinetics400_rgb	短边 320	x	ResNet152	Sports1M	78.8	93.5	x	x	infer_ckpt	x	x
ipcsn_ig65m_pretrained_bnfrozen_r152_32x2x1_58e_kinetics400_rgb	短边 320	x	ResNet152	IG65M	82.5	95.3	x	x	infer_ckpt	x	x
ircsn_ig65m_pretrained_r152_32x2x1_58e_kinetics400_rgb.py	短边 320	8x4	ResNet152	IG65M	80.14	94.93	x	8517	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
这里使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。
这里的 infer_ckpt 表示该模型权重文件是从 VMZ 导入的。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 CSN 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/csn/ircsn_ig65m_pretrained_r152_32x2x1_58e_kinetics400_rgb.py \
    --work-dir work_dirs/ircsn_ig65m_pretrained_r152_32x2x1_58e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 CSN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/csn/ircsn_ig65m_pretrained_r152_32x2x1_58e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

I3D¶

简介¶

@inproceedings{inproceedings,
  author = {Carreira, J. and Zisserman, Andrew},
  year = {2017},
  month = {07},
  pages = {4724-4733},
  title = {Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset},
  doi = {10.1109/CVPR.2017.502}
}

@article{NonLocal2018,
  author =   {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
  title =    {Non-local Neural Networks},
  journal =  {CVPR},
  year =     {2018}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
i3d_r50_32x2x1_100e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	72.68	90.78	1.7 (320x3 frames)	5170	ckpt	log	json
i3d_r50_32x2x1_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	73.27	90.92	x	5170	ckpt	log	json
i3d_r50_video_32x2x1_100e_kinetics400_rgb	短边 256p	8	ResNet50	ImageNet	72.85	90.75	x	5170	ckpt	log	json
i3d_r50_dense_32x2x1_100e_kinetics400_rgb	340x256	8x2	ResNet50	ImageNet	72.77	90.57	1.7 (320x3 frames)	5170	ckpt	log	json
i3d_r50_dense_32x2x1_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	73.48	91.00	x	5170	ckpt	log	json
i3d_r50_lazy_32x2x1_100e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	72.32	90.72	1.8 (320x3 frames)	5170	ckpt	log	json
i3d_r50_lazy_32x2x1_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	73.24	90.99	x	5170	ckpt	log	json
i3d_nl_embedded_gaussian_r50_32x2x1_100e_kinetics400_rgb	短边 256p	8x4	ResNet50	ImageNet	74.71	91.81	x	6438	ckpt	log	json
i3d_nl_gaussian_r50_32x2x1_100e_kinetics400_rgb	短边 256p	8x4	ResNet50	ImageNet	73.37	91.26	x	4944	ckpt	log	json
i3d_nl_dot_product_r50_32x2x1_100e_kinetics400_rgb	短边 256p	8x4	ResNet50	ImageNet	73.92	91.59	x	4832	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 I3D 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py \
    --work-dir work_dirs/i3d_r50_32x2x1_100e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 I3D 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

Omni-sourced Webly-supervised Learning for Video Recognition¶

Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, Dahua Lin

In ECCV, 2020. Paper, Dataset

pipeline

模型库¶

Kinetics-400¶

MMAction2 当前公开了 4 个 OmniSource 框架训练的模型，包含 2D 架构与 3D 架构。下表比较了使用或不适用 OmniSource 框架训练得的模型在 Kinetics-400 上的精度：

模型	模态	预训练	主干网络	输入	分辨率	Top-1 准确率(Baseline / OmniSource (Delta))	Top-5 准确率(Baseline / OmniSource (Delta)))	模型下载链接
TSN	RGB	ImageNet	ResNet50	3seg	340x256	70.6 / 73.6 (+ 3.0)	89.4 / 91.0 (+ 1.6)	Baseline / OmniSource
TSN	RGB	IG-1B	ResNet50	3seg	short-side 320	73.1 / 75.7 (+ 2.6)	90.4 / 91.9 (+ 1.5)	Baseline / OmniSource
SlowOnly	RGB	None	ResNet50	4x16	short-side 320	72.9 / 76.8 (+ 3.9)	90.9 / 92.5 (+ 1.6)	Baseline / OmniSource
SlowOnly	RGB	None	ResNet101	8x8	short-side 320	76.5 / 80.4 (+ 3.9)	92.7 / 94.4 (+ 1.7)	Baseline / OmniSource

我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

Mini-Kinetics 上的基准测试¶

OmniSource 项目当前公开了所采集网络数据的一个子集，涉及 Mini-Kinetics 中的 200 个动作类别。OmniSource 数据集准备中记录了这些数据集的详细统计信息。用户可以通过填写申请表获取这些数据，在完成填写后，数据下载链接会被发送至用户邮箱。更多关于 OmniSource 网络数据集的信息请参照 OmniSource 数据集准备。

MMAction2 在公开的数据集上进行了 OmniSource 框架的基准测试，下表记录了详细的结果（在 Mini-Kinetics 验证集上的精度），这些结果可以作为使用网络数据训练视频识别任务的基线。

TSN-8seg-ResNet50¶

模型	模态	预训练	主干网络	输入	分辨率	Top-1 准确率	Top-5 准确率	ckpt	json	log
tsn_r50_1x1x8_100e_minikinetics_rgb	RGB	ImageNet	ResNet50	3seg	short-side 320	77.4	93.6	ckpt	json	log
tsn_r50_1x1x8_100e_minikinetics_googleimage_rgb	RGB	ImageNet	ResNet50	3seg	short-side 320	78.0	93.6	ckpt	json	log
tsn_r50_1x1x8_100e_minikinetics_webimage_rgb	RGB	ImageNet	ResNet50	3seg	short-side 320	78.6	93.6	ckpt	json	log
tsn_r50_1x1x8_100e_minikinetics_insvideo_rgb	RGB	ImageNet	ResNet50	3seg	short-side 320	80.6	95.0	ckpt	json	log
tsn_r50_1x1x8_100e_minikinetics_kineticsraw_rgb	RGB	ImageNet	ResNet50	3seg	short-side 320	78.6	93.2	ckpt	json	log
tsn_r50_1x1x8_100e_minikinetics_omnisource_rgb	RGB	ImageNet	ResNet50	3seg	short-side 320	81.3	94.8	ckpt	json	log

SlowOnly-8x8-ResNet50¶

模型	模态	预训练	主干网络	输入	分辨率	Top-1 准确率	Top-5 准确率	ckpt	json	log
slowonly_r50_8x8x1_256e_minikinetics_rgb	RGB	None	ResNet50	8x8	short-side 320	78.6	93.9	ckpt	json	log
slowonly_r50_8x8x1_256e_minikinetics_googleimage_rgb	RGB	None	ResNet50	8x8	short-side 320	80.8	95.0	ckpt	json	log
slowonly_r50_8x8x1_256e_minikinetics_webimage_rgb	RGB	None	ResNet50	8x8	short-side 320	81.3	95.2	ckpt	json	log
slowonly_r50_8x8x1_256e_minikinetics_insvideo_rgb	RGB	None	ResNet50	8x8	short-side 320	82.4	95.6	ckpt	json	log
slowonly_r50_8x8x1_256e_minikinetics_kineticsraw_rgb	RGB	None	ResNet50	8x8	short-side 320	80.3	94.5	ckpt	json	log
slowonly_r50_8x8x1_256e_minikinetics_omnisource_rgb	RGB	None	ResNet50	8x8	short-side 320	82.9	95.8	ckpt	json	log

下表列出了原论文中在 Kinetics-400 上进行基准测试的结果供参考：

Model	Baseline	+GG-img	+[GG-IG]-img	+IG-vid	+KRaw	OmniSource
TSN-3seg-ResNet50	70.6 / 89.4	71.5 / 89.5	72.0 / 90.0	72.0 / 90.3	71.7 / 89.6	73.6 / 91.0
SlowOnly-4x16-ResNet50	73.8 / 90.9	74.5 / 91.4	75.2 / 91.6	75.2 / 91.7	74.5 / 91.1	76.6 / 92.5

注：¶

如果 OmniSource 项目对您的研究有所帮助，请使用以下 BibTex 项进行引用：

@article{duan2020omni,
  title={Omni-sourced Webly-supervised Learning for Video Recognition},
  author={Duan, Haodong and Zhao, Yue and Xiong, Yuanjun and Liu, Wentao and Lin, Dahua},
  journal={arXiv preprint arXiv:2003.13042},
  year={2020}
}

R2plus1D¶

简介¶

@inproceedings{tran2018closer,
  title={A closer look at spatiotemporal convolutions for action recognition},
  author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={6450--6459},
  year={2018}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
r2plus1d_r34_8x8x1_180e_kinetics400_rgb	短边 256	8x4	ResNet34	None	67.30	87.65	x	5019	ckpt	log	json
r2plus1d_r34_video_8x8x1_180e_kinetics400_rgb	短边 256	8	ResNet34	None	67.3	87.8	x	5019	ckpt	log	json
r2plus1d_r34_8x8x1_180e_kinetics400_rgb	短边 320	8x2	ResNet34	None	68.68	88.36	1.6 (80x3 frames)	5019	ckpt	log	json
r2plus1d_r34_32x2x1_180e_kinetics400_rgb	短边 320	8x2	ResNet34	None	74.60	91.59	0.5 (320x3 frames)	12975	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 R(2+1)D 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py \
    --work-dir work_dirs/r2plus1d_r34_3d_8x8x1_180e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 R(2+1)D 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips=prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

SlowFast¶

简介¶

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  pages={6202--6211},
  year={2019}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
slowfast_r50_4x16x1_256e_kinetics400_rgb	短边256	8x4	ResNet50	None	74.75	91.73	x	6203	ckpt	log	json
slowfast_r50_video_4x16x1_256e_kinetics400_rgb	短边256	8	ResNet50	None	73.95	91.50	x	6203	ckpt	log	json
slowfast_r50_4x16x1_256e_kinetics400_rgb	短边320	8x2	ResNet50	None	76.0	92.54	1.6 ((32+4)x10x3 frames)	6203	ckpt	log	json
slowfast_prebn_r50_4x16x1_256e_kinetics400_rgb	短边320	8x2	ResNet50	None	76.34	92.67	x	6203	ckpt	log	json
slowfast_r50_8x8x1_256e_kinetics400_rgb	短边320	8x3	ResNet50	None	76.94	92.8	1.3 ((32+8)x10x3 frames)	9062	ckpt	log	json
slowfast_r50_8x8x1_256e_kinetics400_rgb_steplr	短边320	8x4	ResNet50	None	76.34	92.61	1.3 ((32+8)x10x3 frames)	9062	ckpt	log	json
slowfast_multigrid_r50_8x8x1_358e_kinetics400_rgb	短边320	8x2	ResNet50	None	76.07	92.21	x	9062	ckpt	log	json
slowfast_prebn_r50_8x8x1_256e_kinetics400_rgb_steplr	短边320	8x4	ResNet50	None	76.58	92.85	1.3 ((32+8)x10x3 frames)	9062	ckpt	log	json
slowfast_r101_r50_4x16x1_256e_kinetics400_rgb	短边256	8x1	ResNet101 + ResNet50	None	76.69	93.07		16628	ckpt	log	json
slowfast_r101_8x8x1_256e_kinetics400_rgb	短边256	8x4	ResNet101	None	77.90	93.51		25994	ckpt	log	json
slowfast_r152_r50_4x16x1_256e_kinetics400_rgb	短边256	8x1	ResNet152 + ResNet50	None	77.13	93.20		10077	ckpt	log	json

Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
slowfast_r50_16x8x1_22e_sthv1_rgb	高 100	8	ResNet50	Kinetics400	49.67	79.00	x	9293	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 SlowFast 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/slowfast/slowfast_r50_4x16x1_256e_kinetics400_rgb.py \
    --work-dir work_dirs/slowfast_r50_4x16x1_256e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 SlowFast 数据集上测试 CSN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/slowfast/slowfast_r50_4x16x1_256e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips=prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

SlowOnly¶

简介¶

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  pages={6202--6211},
  year={2019}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_rgb	短边 256	8x4	ResNet50	None	72.76	90.51	x	3168	ckpt	log	json
slowonly_r50_video_4x16x1_256e_kinetics400_rgb	短边 320	8x2	ResNet50	None	72.90	90.82	x	8472	ckpt	log	json
slowonly_r50_8x8x1_256e_kinetics400_rgb	短边 256	8x4	ResNet50	None	74.42	91.49	x	5820	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_rgb	短边 320	8x2	ResNet50	None	73.02	90.77	4.0 (40x3 frames)	3168	ckpt	log	json
slowonly_r50_8x8x1_256e_kinetics400_rgb	短边 320	8x3	ResNet50	None	74.93	91.92	2.3 (80x3 frames)	5820	ckpt	log	json
slowonly_imagenet_pretrained_r50_4x16x1_150e_kinetics400_rgb	短边 320	8x2	ResNet50	ImageNet	73.39	91.12	x	3168	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x8x1_150e_kinetics400_rgb	短边 320	8x4	ResNet50	ImageNet	75.55	92.04	x	5820	ckpt	log	json
slowonly_nl_embedded_gaussian_r50_4x16x1_150e_kinetics400_rgb	短边 320	8x2	ResNet50	ImageNet	74.54	91.73	x	4435	ckpt	log	json
slowonly_nl_embedded_gaussian_r50_8x8x1_150e_kinetics400_rgb	短边 320	8x4	ResNet50	ImageNet	76.07	92.42	x	8895	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_flow	短边 320	8x2	ResNet50	ImageNet	61.79	83.62	x	8450	ckpt	log	json
slowonly_r50_8x8x1_196e_kinetics400_flow	短边 320	8x4	ResNet50	ImageNet	65.76	86.25	x	8455	ckpt	log	json

Kinetics-400 数据基准测试¶

在数据基准测试中，比较两种不同的数据预处理方法 (1) 视频分辨率为 340x256, (2) 视频分辨率为短边 320px, (3) 视频分辨率为短边 256px.

配置文件	分辨率	GPU 数量	主干网络	输入	预训练	top1 准确率	top5 准确率	测试方案	ckpt	log	json
slowonly_r50_randomresizedcrop_340x256_4x16x1_256e_kinetics400_rgb	340x256	8x2	ResNet50	4x16	None	71.61	90.05	10 clips x 3 crops	ckpt	log	json
slowonly_r50_randomresizedcrop_320p_4x16x1_256e_kinetics400_rgb	短边 320	8x2	ResNet50	4x16	None	73.02	90.77	10 clips x 3 crops	ckpt	log	json
slowonly_r50_randomresizedcrop_256p_4x16x1_256e_kinetics400_rgb	短边 256	8x4	ResNet50	4x16	None	72.76	90.51	10 clips x 3 crops	ckpt	log	json

Kinetics-400 OmniSource Experiments¶

配置文件	分辨率	主干网络	预训练	w. OmniSource	top1 准确率	top5 准确率	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_rgb	短边 320	ResNet50	None	:x:	73.0	90.8	ckpt	log	json
x	x	ResNet50	None	:heavy_check_mark:	76.8	92.5	ckpt	x	x
slowonly_r101_8x8x1_196e_kinetics400_rgb	x	ResNet101	None	:x:	76.5	92.7	ckpt	x	x
x	x	ResNet101	None	:heavy_check_mark:	80.4	94.4	ckpt	x	x

Kinetics-600¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	ckpt	log	json
slowonly_r50_video_8x8x1_256e_kinetics600_rgb	短边 256	8x4	ResNet50	None	77.5	93.7	ckpt	log	json

Kinetics-700¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	ckpt	log	json
slowonly_r50_video_8x8x1_256e_kinetics700_rgb	短边 256	8x4	ResNet50	None	65.0	86.1	ckpt	log	json

GYM99¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	类别平均准确率	ckpt	log	json
slowonly_imagenet_pretrained_r50_4x16x1_120e_gym99_rgb	短边 256	8x2	ResNet50	ImageNet	79.3	70.2	ckpt	log	json
slowonly_kinetics_pretrained_r50_4x16x1_120e_gym99_flow	短边 256	8x2	ResNet50	Kinetics	80.3	71.0	ckpt	log	json
1: 1 融合					83.7	74.8

Jester¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x8x1_64e_jester_rgb	高 100	8	ResNet50	ImageNet	97.2	ckpt	log	json

HMDB51¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x4x1_64e_hmdb51_rgb	8	ResNet50	ImageNet	37.52	71.50	5812	ckpt	log	json
slowonly_k400_pretrained_r50_8x4x1_40e_hmdb51_rgb	8	ResNet50	Kinetics400	65.95	91.05	5812	ckpt	log	json

UCF101¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x4x1_64e_ucf101_rgb	8	ResNet50	ImageNet	71.35	89.35	5812	ckpt	log	json
slowonly_k400_pretrained_r50_8x4x1_40e_ucf101_rgb	8	ResNet50	Kinetics400	92.78	99.42	5812	ckpt	log	json

Something-Something V1¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x4x1_64e_sthv1_rgb	8	ResNet50	ImageNet	47.76	77.49	7759	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 SlowOnly 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py \
    --work-dir work_dirs/slowonly_r50_4x16x1_256e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 SlowOnly 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips=prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TANet¶

简介¶

@article{liu2020tam,
  title={TAM: Temporal Adaptive Module for Video Recognition},
  author={Liu, Zhaoyang and Wang, Limin and Wu, Wayne and Qian, Chen and Lu, Tong},
  journal={arXiv preprint arXiv:2005.06803},
  year={2020}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tanet_r50_dense_1x1x8_100e_kinetics400_rgb	短边 320	8	TANet	ImageNet	76.28	92.60	76.22	92.53	x	7124	ckpt	log	json

Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	top5 准确率 (efficient/accurate)	GPU 显存占用 (M)	ckpt	log	json
tanet_r50_1x1x8_50e_sthv1_rgb	高 100	8	TANet	ImageNet	47.34/49.58	75.72/77.31	7127	ckpt	log	ckpt
tanet_r50_1x1x16_50e_sthv1_rgb	高 100	8	TANet	ImageNet	49.05/50.91	77.90/79.13	7127	ckpt	log	ckpt

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
参考代码的结果是通过使用相同的模型配置在原来的代码库上训练得到的。对应的模型权重文件可从这里下载。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TANet 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/tanet/tanet_r50_dense_1x1x8_100e_kinetics400_rgb.py \
    --work-dir work_dirs/tanet_r50_dense_1x1x8_100e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 TANet 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/tanet/tanet_r50_dense_1x1x8_100e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TimeSformer¶

简介¶

@misc{bertasius2021spacetime,
    title   = {Is Space-Time Attention All You Need for Video Understanding?},
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    year    = {2021},
    eprint  = {2102.05095},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
timesformer_divST_8x32x1_15e_kinetics400_rgb	短边 320	8	TimeSformer	ImageNet-21K	77.92	93.29	x	17874	ckpt	log	json
timesformer_jointST_8x32x1_15e_kinetics400_rgb	短边 320	8	TimeSformer	ImageNet-21K	77.01	93.08	x	25658	ckpt	log	json
timesformer_sapceOnly_8x32x1_15e_kinetics400_rgb	短边 320	8	TimeSformer	ImageNet-21K	76.93	92.90	x	12750	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数（32G V100）。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.005 对应 8 GPUs x 8 video/gpu，以及 lr=0.004375 对应 8 GPUs x 7 video/gpu。
MMAction2 保持与原代码的测试设置一致（three crop x 1 clip）。
TimeSformer 使用的预训练模型 vit_base_patch16_224.pth 转换自 vision_transformer。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TimeSformer 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/timesformer/timesformer_divST_8x32x1_15e_kinetics400_rgb.py \
    --work-dir work_dirs/timesformer_divST_8x32x1_15e_kinetics400_rgb.py \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 TimeSformer 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/timesformer/timesformer_divST_8x32x1_15e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TIN¶

简介¶

@article{shao2020temporal,
    title={Temporal Interlacing Network},
    author={Hao Shao and Shengju Qian and Yu Liu},
    year={2020},
    journal={AAAI},
}

模型库¶

Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tin_r50_1x1x8_40e_sthv1_rgb	高 100	8x4	ResNet50	ImageNet	44.25	73.94	44.04	72.72	6181	ckpt	log	json

Something-Something V2¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tin_r50_1x1x8_40e_sthv2_rgb	高 240	8x4	ResNet50	ImageNet	56.70	83.62	56.48	83.45	6185	ckpt	log	json

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tin_tsm_finetune_r50_1x1x8_50e_kinetics400_rgb	短边 256	8x4	ResNet50	TSM-Kinetics400	70.89	89.89	6187	ckpt	log	json

这里，MMAction2 使用 finetune 一词表示 TIN 模型使用 Kinetics400 上的 TSM 模型进行微调。

注：

参考代码的结果是通过原始 repo 解决 AverageMeter 相关问题后训练得到的，该问题会导致错误的精度计算。
这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
参考代码的结果是通过使用相同的模型配置在原来的代码库上训练得到的。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400, Something-Something V1 and Something-Something V2 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TIN 模型在 Something-Something V1 数据集上的训练。

python tools/train.py configs/recognition/tin/tin_r50_1x1x8_40e_sthv1_rgb.py \
    --work-dir work_dirs/tin_r50_1x1x8_40e_sthv1_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Something-Something V1 数据集上测试 TIN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/tin/tin_r50_1x1x8_40e_sthv1_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TPN¶

简介¶

@inproceedings{yang2020tpn,
  title={Temporal Pyramid Network for Action Recognition},
  author={Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tpn_slowonly_r50_8x8x1_150e_kinetics_rgb	短边 320	8x2	ResNet50	None	73.58	91.35	x	x	x	6916	ckpt	log	json
tpn_imagenet_pretrained_slowonly_r50_8x8x1_150e_kinetics_rgb	短边 320	8	ResNet50	ImageNet	76.59	92.72	75.49	92.05	x	6916	ckpt	log	json

Something-Something V1¶

|配置文件 | GPU 数量 | 主干网络 | 预训练 | top1 准确率| top5 准确率 | GPU 显存占用 (M) | ckpt | log| json| |:–|:–:|:–:|:–:|:–:|:–:|:–:|:–:|:–:|:–:|:–:| |tpn_tsm_r50_1x1x8_150e_sthv1_rgb|height 100|8x6| ResNet50 | TSM | 51.50 | 79.15 | 8828 |ckpt |log|json|

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
参考代码的结果是通过使用相同的模型配置在原来的代码库上训练得到的。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TPN 模型在 Kinetics-400 数据集上的训练。

python tools/train.py configs/recognition/tpn/tpn_slowonly_r50_8x8x1_150e_kinetics_rgb.py \
    --work-dir work_dirs/tpn_slowonly_r50_8x8x1_150e_kinetics_rgb [--validate --seed 0 --deterministic]

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics-400 数据集上测试 TPN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/tpn/tpn_slowonly_r50_8x8x1_150e_kinetics_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TRN¶

简介¶

@article{zhou2017temporalrelation,
    title = {Temporal Relational Reasoning in Videos},
    author = {Zhou, Bolei and Andonian, Alex and Oliva, Aude and Torralba, Antonio},
    journal={European Conference on Computer Vision},
    year={2018}
}

模型库¶

Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	top5 准确率 (efficient/accurate)	GPU 显存占用 (M)	ckpt	log	json
trn_r50_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	31.62 / 33.88	60.01 / 62.12	11010	ckpt	log	json

Something-Something V2¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	top5 准确率 (efficient/accurate)	GPU 显存占用 (M)	ckpt	log	json
trn_r50_1x1x8_50e_sthv2_rgb	高 256	8	ResNet50	ImageNet	48.39 / 51.28	76.58 / 78.65	11010	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
对于 Something-Something 数据集，有两种测试方案：efficient（对应 center crop x 1 clip）和 accurate（对应 Three crop x 2 clip）。
在原代码库中，作者在 Something-Something 数据集上使用了随机水平翻转，但这种数据增强方法有一些问题，因为 Something-Something 数据集有一些方向性的动作，比如从左往右推。所以 MMAction2 把随机水平翻转改为带标签映射的水平翻转，同时修改了测试模型的数据处理方法，即把裁剪 10 个图像块（这里面包括 5 个翻转后的图像块）修改成采帧两次 & 裁剪 3 个图像块。
MMAction2 使用 ResNet50 代替 BNInception 作为 TRN 的主干网络。使用原代码，在 sthv1 数据集上训练 TRN-ResNet50 时，实验得到的 top1 (top5) 的准确度为 30.542 (58.627)，而 MMAction2 的精度为 31.62 (60.01)。

关于数据处理的更多细节，用户可以参照

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TRN 模型在 sthv1 数据集上的训练。

python tools/train.py configs/recognition/trn/trn_r50_1x1x8_50e_sthv1_rgb.py \
    --work-dir work_dirs/trn_r50_1x1x8_50e_sthv1_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 sthv1 数据集上测试 TRN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/trn/trn_r50_1x1x8_50e_sthv1_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TSM¶

简介¶

@inproceedings{lin2019tsm,
  title={TSM: Temporal Shift Module for Efficient Video Understanding},
  author={Lin, Ji and Gan, Chuang and Han, Song},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2019}
}

@article{NonLocal2018,
  author =   {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
  title =    {Non-local Neural Networks},
  journal =  {CVPR},
  year =     {2018}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsm_r50_1x1x8_50e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	70.24	89.56	70.36	89.49	74.0 (8x1 frames)	7079	ckpt	log	json
tsm_r50_1x1x8_50e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	70.59	89.52	x	x	x	7079	ckpt	log	json
tsm_r50_1x1x8_50e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	70.73	89.81	x	x	x	7079	ckpt	log	json
tsm_r50_1x1x8_100e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	71.90	90.03	x	x	x	7079	ckpt	log	json
tsm_r50_gpu_normalize_1x1x8_50e_kinetics400_rgb.py	短边 256	8	ResNet50	ImageNet	70.48	89.40	x	x	x	7076	ckpt	log	json
tsm_r50_video_1x1x8_50e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	70.25	89.66	70.36	89.49	74.0 (8x1 frames)	7077	ckpt	log	json
tsm_r50_dense_1x1x8_50e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	73.46	90.84	x	x	x	7079	ckpt	log	json
tsm_r50_dense_1x1x8_100e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	74.55	91.74	x	x	x	7079	ckpt	log	json
tsm_r50_1x1x16_50e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	72.09	90.37	70.67	89.98	47.0 (16x1 frames)	10404	ckpt	log	json
tsm_r50_1x1x16_50e_kinetics400_rgb	短边 256	8x4	ResNet50	ImageNet	71.89	90.73	x	x	x	10398	ckpt	log	json
tsm_r50_1x1x16_100e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	72.80	90.75	x	x	x	10398	ckpt	log	json
tsm_nl_embedded_gaussian_r50_1x1x8_50e_kinetics400_rgb	短边 320	8x4	ResNet50	ImageNet	72.03	90.25	71.81	90.36	x	8931	ckpt	log	json
tsm_nl_gaussian_r50_1x1x8_50e_kinetics400_rgb	短边 320	8x4	ResNet50	ImageNet	70.70	89.90	x	x	x	10125	ckpt	log	json
tsm_nl_dot_product_r50_1x1x8_50e_kinetics400_rgb	短边 320	8x4	ResNet50	ImageNet	71.60	90.34	x	x	x	8358	ckpt	log	json
tsm_mobilenetv2_dense_1x1x8_100e_kinetics400_rgb	短边 320	8	MobileNetV2	ImageNet	68.46	88.64	x	x	x	3385	ckpt	log	json
tsm_mobilenetv2_dense_1x1x8_kinetics400_rgb_port	短边 320	8	MobileNetV2	ImageNet	69.89	89.01	x	x	x	3385	infer_ckpt	x	x

Diving48¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsm_r50_video_1x1x8_50e_diving48_rgb	8	ResNet50	ImageNet	75.99	97.16	7070	ckpt	log	json
tsm_r50_video_1x1x16_50e_diving48_rgb	8	ResNet50	ImageNet	81.62	97.66	7070	ckpt	log	json

Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	top5 准确率 (efficient/accurate)	参考代码的 top1 准确率 (efficient/accurate)	参考代码的 top5 准确率 (efficient/accurate)	GPU 显存占用 (M)	ckpt	log	json
tsm_r50_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	45.58 / 47.70	75.02 / 76.12	45.50 / 47.33	74.34 / 76.60	7077	ckpt	log	json
tsm_r50_flip_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	47.10 / 48.51	76.02 / 77.56	45.50 / 47.33	74.34 / 76.60	7077	ckpt	log	json
tsm_r50_randaugment_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	47.16 / 48.90	76.07 / 77.92	45.50 / 47.33	74.34 / 76.60	7077	ckpt	log	json
tsm_r50_ptv_randaugment_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	47.65 / 48.66	76.67 / 77.41	45.50 / 47.33	74.34 / 76.60	7077	ckpt	log	json
tsm_r50_ptv_augmix_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	46.26 / 47.68	75.92 / 76.49	45.50 / 47.33	74.34 / 76.60	7077	ckpt	log	json
tsm_r50_flip_randaugment_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	47.85 / 50.31	76.78 / 78.18	45.50 / 47.33	74.34 / 76.60	7077	ckpt	log	json
tsm_r50_1x1x16_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	47.77 / 49.03	76.82 / 77.83	47.05 / 48.61	76.40 / 77.96	10390	ckpt	log	json
tsm_r101_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	46.09 / 48.59	75.41 / 77.10	46.64 / 48.13	75.40 / 77.31	9800	ckpt	log	json

Something-Something V2¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	top5 准确率 (efficient/accurate)	参考代码的 top1 准确率 (efficient/accurate)	参考代码的 top5 准确率 (efficient/accurate)	GPU 显存占用 (M)	ckpt	log	json
tsm_r50_1x1x8_50e_sthv2_rgb	高 256	8	ResNet50	ImageNet	59.11 / 61.82	85.39 / 86.80	xx / 61.2	xx / xx	7069	ckpt	log	json
tsm_r50_1x1x16_50e_sthv2_rgb	高 256	8	ResNet50	ImageNet	61.06 / 63.19	86.66 / 87.93	xx / 63.1	xx / xx	10400	ckpt	log	json
tsm_r101_1x1x8_50e_sthv2_rgb	高 256	8	ResNet101	ImageNet	60.88 / 63.84	86.56 / 88.30	xx / 63.3	xx / xx	9727	ckpt	log	json

Diving48¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsm_r50_video_1x1x8_50e_diving48_rgb	8	ResNet50	ImageNet	75.99	97.16	7070	ckpt	log	json
tsm_r50_video_1x1x16_50e_diving48_rgb	8	ResNet50	ImageNet	81.62	97.66	7070	ckpt	log	json

MixUp & CutMix on Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	top5 准确率 (efficient/accurate)	top1 准确率变化 (efficient/accurate)	top5 准确率变化 (efficient/accurate)	ckpt	log	json
tsm_r50_mixup_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	46.35 / 48.49	75.07 / 76.88	+0.77 / +0.79	+0.05 / +0.70	ckpt	log	json
tsm_r50_cutmix_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	45.92 / 47.46	75.23 / 76.71	+0.34 / -0.24	+0.21 / +0.59	ckpt	log	json

Jester¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率 (efficient/accurate)	ckpt	log	json
tsm_r50_1x1x8_50e_jester_rgb	高 100	8	ResNet50	ImageNet	96.5 / 97.2	ckpt	log	json

HMDB51¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsm_k400_pretrained_r50_1x1x8_25e_hmdb51_rgb	8	ResNet50	Kinetics400	72.68	92.03	10388	ckpt	log	json
tsm_k400_pretrained_r50_1x1x16_25e_hmdb51_rgb	8	ResNet50	Kinetics400	74.77	93.86	10388	ckpt	log	json

UCF101¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsm_k400_pretrained_r50_1x1x8_25e_ucf101_rgb	8	ResNet50	Kinetics400	94.50	99.58	10389	ckpt	log	json
tsm_k400_pretrained_r50_1x1x16_25e_ucf101_rgb	8	ResNet50	Kinetics400	94.58	99.37	10389	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
参考代码的结果是通过使用相同的模型配置在原来的代码库上训练得到的。对应的模型权重文件可从这里下载。
对于 Something-Something 数据集，有两种测试方案：efficient（对应 center crop x 1 clip）和 accurate（对应 Three crop x 2 clip）。两种方案参考自原始代码库。 MMAction2 使用 efficient 方案作为配置文件中的默认选择，用户可以通过以下方式转变为 accurate 方案：

...
test_pipeline = [
    dict(
        type='SampleFrames',
        clip_len=1,
        frame_interval=1,
        num_clips=16,   ## 当使用 8 个 视频段时，设置 `num_clips = 8`
        twice_sample=True,    ## 设置 `twice_sample=True` 用于 accurate 方案中的 Twice Sample
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    ## dict(type='CenterCrop', crop_size=224), 用于 efficient 方案
    dict(type='ThreeCrop', crop_size=256),  ## 用于 accurate 方案
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]

当采用 Mixup 和 CutMix 的数据增强时，使用超参 alpha=0.2。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。
这里的 infer_ckpt 表示该模型权重文件是从 TSM 导入的。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400, Something-Something V1 and Something-Something V2 部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TSM 模型在 Kinetics-400 数据集上的训练。

python tools/train.py configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py \
    --work-dir work_dirs/tsm_r50_1x1x8_100e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics-400 数据集上测试 TSM 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

TSN¶

简介¶

@inproceedings{wang2016temporal,
  title={Temporal segment networks: Towards good practices for deep action recognition},
  author={Wang, Limin and Xiong, Yuanjun and Wang, Zhe and Qiao, Yu and Lin, Dahua and Tang, Xiaoou and Van Gool, Luc},
  booktitle={European conference on computer vision},
  pages={20--36},
  year={2016},
  organization={Springer}
}

模型库¶

UCF-101¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x3_75e_ucf101_rgb [1]	8	ResNet50	ImageNet	83.03	96.78	8332	ckpt	log	json

[1] 这里汇报的是 UCF-101 的 split1 部分的结果。

Diving48¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_video_1x1x8_100e_diving48_rgb	8	ResNet50	ImageNet	71.27	95.74	5699	ckpt	log	json
tsn_r50_video_1x1x16_100e_diving48_rgb	8	ResNet50	ImageNet	76.75	96.95	5705	ckpt	log	json

HMDB51¶

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x8_50e_hmdb51_imagenet_rgb	8	ResNet50	ImageNet	48.95	80.19	21535	ckpt	log	json
tsn_r50_1x1x8_50e_hmdb51_kinetics400_rgb	8	ResNet50	Kinetics400	56.08	84.31	21535	ckpt	log	json
tsn_r50_1x1x8_50e_hmdb51_mit_rgb	8	ResNet50	Moments	54.25	83.86	21535	ckpt	log	json

Kinetics-400¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x3_100e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	70.60	89.26	x	x	4.3 (25x10 frames)	8344	ckpt	log	json
tsn_r50_1x1x3_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	70.42	89.03	x	x	x	8343	ckpt	log	json
tsn_r50_dense_1x1x5_50e_kinetics400_rgb	340x256	8x3	ResNet50	ImageNet	70.18	89.10	69.15	88.56	12.7 (8x10 frames)	7028	ckpt	log	json
tsn_r50_320p_1x1x3_100e_kinetics400_rgb	短边 320	8x2	ResNet50	ImageNet	70.91	89.51	x	x	10.7 (25x3 frames)	8344	ckpt	log	json
tsn_r50_320p_1x1x3_110e_kinetics400_flow	短边 320	8x2	ResNet50	ImageNet	55.70	79.85	x	x	x	8471	ckpt	log	json
tsn_r50_320p_1x1x3_kinetics400_twostream [1: 1]*	x	x	ResNet50	ImageNet	72.76	90.52	x	x	x	x	x	x	x
tsn_r50_1x1x8_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	71.80	90.17	x	x	x	8343	ckpt	log	json
tsn_r50_320p_1x1x8_100e_kinetics400_rgb	短边 320	8x3	ResNet50	ImageNet	72.41	90.55	x	x	11.1 (25x3 frames)	8344	ckpt	log	json
tsn_r50_320p_1x1x8_110e_kinetics400_flow	短边 320	8x4	ResNet50	ImageNet	57.76	80.99	x	x	x	8473	ckpt	log	json
tsn_r50_320p_1x1x8_kinetics400_twostream [1: 1]*	x	x	ResNet50	ImageNet	74.64	91.77	x	x	x	x	x	x	x
tsn_r50_video_320p_1x1x3_100e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	71.11	90.04	x	x	x	8343	ckpt	log	json
tsn_r50_dense_1x1x8_100e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	70.77	89.3	68.75	88.42	12.2 (8x10 frames)	8344	ckpt	log	json
tsn_r50_video_1x1x8_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	71.14	89.63	x	x	x	21558	ckpt	log	json
tsn_r50_video_dense_1x1x8_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	70.40	89.12	x	x	x	21553	ckpt	log	json

这里，MMAction2 使用 [1: 1] 表示以 1: 1 的比例融合 RGB 和光流两分支的融合结果（融合前不经过 softmax）

在 TSN 模型中使用第三方的主干网络¶

用户可在 MMAction2 的框架中使用第三方的主干网络训练 TSN，例如：

[x] MMClassification 中的主干网络
[x] TorchVision 中的主干网络
[x] pytorch-image-models(timm) 中的主干网络

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	ckpt	log	json
tsn_rn101_32x4d_320p_1x1x3_100e_kinetics400_rgb	短边 320	8x2	ResNeXt101-32x4d [MMCls]	ImageNet	73.43	91.01	ckpt	log	json
tsn_dense161_320p_1x1x3_100e_kinetics400_rgb	短边 320	8x2	Densenet-161 [TorchVision]	ImageNet	72.78	90.75	ckpt	log	json
tsn_swin_transformer_video_320p_1x1x3_100e_kinetics400_rgb	short-side 320	8	Swin Transformer Base [timm]	ImageNet	77.51	92.92	ckpt	log	json

由于多种原因，TIMM 中的一些模型未能收到支持，详情请参考 PR ##880。

Kinetics-400 数据基准测试 (8 块 GPU, ResNet50, ImageNet 预训练; 3 个视频段)¶

在数据基准测试中，比较：

不同的数据预处理方法：(1) 视频分辨率为 340x256, (2) 视频分辨率为短边 320px, (3) 视频分辨率为短边 256px;
不同的数据增强方法：(1) MultiScaleCrop, (2) RandomResizedCrop;
不同的测试方法：(1) 25 帧 x 10 裁剪片段, (2) 25 frames x 3 裁剪片段.

配置文件	分辨率	训练时的数据增强	测试时的策略	top1 准确率	top5 准确率	ckpt	log	json
tsn_r50_multiscalecrop_340x256_1x1x3_100e_kinetics400_rgb	340x256	MultiScaleCrop	25x10 frames	70.60	89.26	ckpt	log	json
x	340x256	MultiScaleCrop	25x3 frames	70.52	89.39	x	x	x
tsn_r50_randomresizedcrop_340x256_1x1x3_100e_kinetics400_rgb	340x256	RandomResizedCrop	25x10 frames	70.11	89.01	ckpt	log	json
x	340x256	RandomResizedCrop	25x3 frames	69.95	89.02	x	x	x
tsn_r50_multiscalecrop_320p_1x1x3_100e_kinetics400_rgb	短边 320	MultiScaleCrop	25x10 frames	70.32	89.25	ckpt	log	json
x	短边 320	MultiScaleCrop	25x3 frames	70.54	89.39	x	x	x
tsn_r50_randomresizedcrop_320p_1x1x3_100e_kinetics400_rgb	短边 320	RandomResizedCrop	25x10 frames	70.44	89.23	ckpt	log	json
x	短边 320	RandomResizedCrop	25x3 frames	70.91	89.51	x	x	x
tsn_r50_multiscalecrop_256p_1x1x3_100e_kinetics400_rgb	短边 256	MultiScaleCrop	25x10 frames	70.42	89.03	ckpt	log	json
x	短边 256	MultiScaleCrop	25x3 frames	70.79	89.42	x	x	x
tsn_r50_randomresizedcrop_256p_1x1x3_100e_kinetics400_rgb	短边 256	RandomResizedCrop	25x10 frames	69.80	89.06	ckpt	log	json
x	短边 256	RandomResizedCrop	25x3 frames	70.48	89.89	x	x	x

Kinetics-400 OmniSource 实验¶

配置文件	分辨率	主干网络	预训练	w. OmniSource	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x3_100e_kinetics400_rgb	340x256	ResNet50	ImageNet	:x:	70.6	89.3	4.3 (25x10 frames)	8344	ckpt	log	json
x	340x256	ResNet50	ImageNet	:heavy_check_mark:	73.6	91.0	x	8344	ckpt	x	x
x	短边 320	ResNet50	IG-1B [1]	:x:	73.1	90.4	x	8344	ckpt	x	x
x	短边 320	ResNet50	IG-1B [1]	:heavy_check_mark:	75.7	91.9	x	8344	ckpt	x	x

[1] MMAction2 使用 torch-hub 提供的 resnet50_swsl 预训练模型。

Kinetics-600¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_video_1x1x8_100e_kinetics600_rgb	短边 256	8x2	ResNet50	ImageNet	74.8	92.3	11.1 (25x3 frames)	8344	ckpt	log	json

Kinetics-700¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_video_1x1x8_100e_kinetics700_rgb	短边 256	8x2	ResNet50	ImageNet	61.7	83.6	11.1 (25x3 frames)	8344	ckpt	log	json

Something-Something V1¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x8_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	18.55	44.80	17.53	44.29	10978	ckpt	log	json
tsn_r50_1x1x16_50e_sthv1_rgb	高 100	8	ResNet50	ImageNet	15.77	39.85	13.33	35.58	5691	ckpt	log	json

Something-Something V2¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x8_50e_sthv2_rgb	高 256	8	ResNet50	ImageNet	28.59	59.56	x	x	10966	ckpt	log	json
tsn_r50_1x1x16_50e_sthv2_rgb	高 256	8	ResNet50	ImageNet	20.89	49.16	x	x	8337	ckpt	log	json

Moments in Time¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x6_100e_mit_rgb	短边 256	8x2	ResNet50	ImageNet	26.84	51.6	8339	ckpt	log	json

Multi-Moments in Time¶

配置文件	分辨率	GPU 数量	主干网络	预训练	mAP	GPU 显存占用 (M)	ckpt	log	json
tsn_r101_1x1x5_50e_mmit_rgb	短边 256	8x2	ResNet101	ImageNet	61.09	10467	ckpt	log	json

ActivityNet v1.3¶

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_320p_1x1x8_50e_activitynet_video_rgb	短边 320	8x1	ResNet50	Kinetics400	73.93	93.44	5692	ckpt	log	json
tsn_r50_320p_1x1x8_50e_activitynet_clip_rgb	短边 320	8x1	ResNet50	Kinetics400	76.90	94.47	5692	ckpt	log	json
tsn_r50_320p_1x1x8_150e_activitynet_video_flow	340x256	8x2	ResNet50	Kinetics400	57.51	83.02	5780	ckpt	log	json
tsn_r50_320p_1x1x8_150e_activitynet_clip_flow	340x256	8x2	ResNet50	Kinetics400	59.51	82.69	5780	ckpt	log	json

HVU¶

配置文件[1]	tag 类别	分辨率	GPU 数量	主干网络	预训练	mAP	HATNet[2]	HATNet-multi[2]	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_action_rgb	action	短边 256	8x2	ResNet18	ImageNet	57.5	51.8	53.5	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_scene_rgb	scene	短边 256	8	ResNet18	ImageNet	55.2	55.8	57.2	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_object_rgb	object	短边 256	8	ResNet18	ImageNet	45.7	34.2	35.1	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_event_rgb	event	短边 256	8	ResNet18	ImageNet	63.7	38.5	39.8	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_concept_rgb	concept	短边 256	8	ResNet18	ImageNet	47.5	26.1	27.3	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_attribute_rgb	attribute	短边 256	8	ResNet18	ImageNet	46.1	33.6	34.9	ckpt	log	json
-	所有 tag	短边 256	-	ResNet18	ImageNet	52.6	40.0	41.3	-	-	-

[1] 简单起见，MMAction2 对每个 tag 类别训练特定的模型，作为 HVU 的基准模型。

[2] 这里 HATNet 和 HATNet-multi 的结果来自于 paper: Large Scale Holistic Video Understanding。 HATNet 的时序动作候选是一个双分支的卷积网络（一个 2D 分支，一个 3D 分支），并且和 MMAction2 有相同的主干网络（ResNet18）。HATNet 的输入是 16 帧或 32 帧的长视频片段（这样的片段比 MMAction2 使用的要长），同时输入分辨率更粗糙（112px 而非 224px）。 HATNet 是在每个独立的任务（对应每个 tag 类别）上进行训练的，HATNet-multi 是在多个任务上进行训练的。由于目前没有 HATNet 的开源代码和模型，这里仅汇报了原 paper 的精度。

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
参考代码的结果是通过使用相同的模型配置在原来的代码库上训练得到的。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考：

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TSN 模型在 Kinetics-400 数据集上的训练。

python tools/train.py configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py \
    --work-dir work_dirs/tsn_r50_1x1x3_100e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics-400 数据集上测试 TSN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

X3D¶

简介¶

@misc{feichtenhofer2020x3d,
      title={X3D: Expanding Architectures for Efficient Video Recognition},
      author={Christoph Feichtenhofer},
      year={2020},
      eprint={2004.04730},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

模型库¶

Kinetics-400¶

配置文件	分辨率	主干网络	top1 10-view	top1 30-view	参考代码的 top1 10-view	参考代码的 top1 30-view	ckpt
x3d_s_13x6x1_facebook_kinetics400_rgb	短边 320	X3D_S	72.7	73.2	73.1 [SlowFast]	73.5 [SlowFast]	ckpt[1]
x3d_m_16x5x1_facebook_kinetics400_rgb	短边 320	X3D_M	75.0	75.6	75.1 [SlowFast]	76.2 [SlowFast]	ckpt[1]

[1] 这里的模型是从 SlowFast 代码库中导入并在 MMAction2 使用的数据上进行测试的。目前仅支持 X3D 模型的测试，训练部分将会在近期提供。

注：

参考代码的结果是通过使用相同的数据和原来的代码库所提供的模型进行测试得到的。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics-400 数据集上测试 X3D 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/x3d/x3d_s_13x6x1_facebook_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

ResNet for Audio¶

简介¶

@article{xiao2020audiovisual,
  title={Audiovisual SlowFast Networks for Video Recognition},
  author={Xiao, Fanyi and Lee, Yong Jae and Grauman, Kristen and Malik, Jitendra and Feichtenhofer, Christoph},
  journal={arXiv preprint arXiv:2001.08740},
  year={2020}
}

模型库¶

Kinetics-400¶

配置文件	n_fft	GPU 数量	主干网络	预训练	top1 acc/delta	top5 acc/delta	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r18_64x1x1_100e_kinetics400_audio_feature	1024	8	ResNet18	None	19.7	35.75	x	1897	ckpt	log	json
tsn_r18_64x1x1_100e_kinetics400_audio_feature + tsn_r50_video_320p_1x1x3_100e_kinetics400_rgb	1024	8	ResNet(18+50)	None	71.50(+0.39)	90.18(+0.14)	x	x	x	x	x

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考数据集准备文档中的准备音频部分。

如何训练¶

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: 以一个确定性的训练方式，辅以定期的验证过程进行 ResNet 模型在 Kinetics400 音频数据集上的训练。

python tools/train.py configs/audio_recognition/tsn_r50_64x1x1_100e_kinetics400_audio_feature.py \
    --work-dir work_dirs/tsn_r50_64x1x1_100e_kinetics400_audio_feature \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试¶

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 音频数据集上测试 ResNet 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/audio_recognition/tsn_r50_64x1x1_100e_kinetics400_audio_feature.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

融合¶

对于多模态融合，用户可以使用这个脚本，其命令大致为：

python tools/analysis/report_accuracy.py --scores ${AUDIO_RESULT_PKL} ${VISUAL_RESULT_PKL} --datalist data/kinetics400/kinetics400_val_list_rawframes.txt --coefficient 1 1

AUDIO_RESULT_PKL: tools/test.py 脚本通过 --out 选项存储的输出文件。
VISUAL_RESULT_PKL: tools/test.py 脚本通过 --out 选项存储的输出文件。