mmaction.apis¶
- mmaction.apis.detection_inference(det_config: str | Path | Config | Module, det_checkpoint: str, frame_paths: List[str], det_score_thr: float = 0.9, det_cat_id: int = 0, device: str | device = 'cuda:0', with_score: bool = False) tuple [源代码]¶
Detect human boxes given frame paths.
- 参数:
(Union[str (det_config) –
torch.nn.Module
]): Det config file path or Detection model object. It can be aPath
, a config object, or a module object.
- :param
Path
:torch.nn.Module
]): Det config file path or Detection model object. It can be a
Path
, a config object, or a module object.- :param
mmengine.Config
:torch.nn.Module
]): Det config file path or Detection model object. It can be a
Path
, a config object, or a module object.- :param
torch.nn.Module
]): Det config file path or Detection model object. It can be a
Path
, a config object, or a module object.
- 参数:
det_checkpoint – Checkpoint path/url.
frame_paths (List[str]) – The paths of frames to do detection inference.
det_score_thr (float) – The threshold of human detection score. Defaults to 0.9.
det_cat_id (int) – The category id for human detection. Defaults to 0.
device (Union[str, torch.device]) – The desired device of returned tensor. Defaults to
'cuda:0'
.with_score (bool) – Whether to append detection score after box. Defaults to None.
- 返回:
List of detected human boxes. List[
DetDataSample
]: List of data samples, generally usedto visualize data.
- 返回类型:
List[np.ndarray]
- mmaction.apis.inference_recognizer(model: Module, video: str | dict, test_pipeline: Compose | None = None) ActionDataSample [源代码]¶
Inference a video with the recognizer.
- 参数:
model (nn.Module) – The loaded recognizer.
video (Union[str, dict]) – The video file path or the results dictionary (the input of pipeline).
test_pipeline (
Compose
, optional) – The test pipeline. If not specified, the test pipeline in the config will be used. Defaults to None.
- 返回:
The inference results. Specifically, the predicted scores are saved at
result.pred_score
.- 返回类型:
ActionDataSample
- mmaction.apis.inference_skeleton(model: Module, pose_results: List[dict], img_shape: Tuple[int], test_pipeline: Compose | None = None) ActionDataSample [源代码]¶
Inference a pose results with the skeleton recognizer.
- 参数:
model (nn.Module) – The loaded recognizer.
pose_results (List[dict]) – The pose estimation results dictionary (the results of pose_inference)
img_shape (Tuple[int]) – The original image shape used for inference skeleton recognizer.
test_pipeline (
Compose
, optional) – The test pipeline. If not specified, the test pipeline in the config will be used. Defaults to None.
- 返回:
The inference results. Specifically, the predicted scores are saved at
result.pred_score
.- 返回类型:
ActionDataSample
- mmaction.apis.init_recognizer(config: str | Path | Config, checkpoint: str | None = None, device: str | device = 'cuda:0') Module [源代码]¶
Initialize a recognizer from config file.
- 参数:
config (str or
Path
ormmengine.Config
) – Config file path,Path
or the config object.checkpoint (str, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Defaults to None.
device (str | torch.device) – The desired device of returned tensor. Defaults to
'cuda:0'
.
- 返回:
The constructed recognizer.
- 返回类型:
nn.Module
- mmaction.apis.pose_inference(pose_config: str | Path | Config | Module, pose_checkpoint: str, frame_paths: List[str], det_results: List[ndarray], device: str | device = 'cuda:0') tuple [源代码]¶
Perform Top-Down pose estimation.
- 参数:
(Union[str (pose_config) –
torch.nn.Module
]): Pose config file path or pose model object. It can be aPath
, a config object, or a module object.
- :param
Path
:torch.nn.Module
]): Pose config file path or pose model object. It can be a
Path
, a config object, or a module object.- :param
mmengine.Config
:torch.nn.Module
]): Pose config file path or pose model object. It can be a
Path
, a config object, or a module object.- :param
torch.nn.Module
]): Pose config file path or pose model object. It can be a
Path
, a config object, or a module object.
- 参数:
pose_checkpoint – Checkpoint path/url.
frame_paths (List[str]) – The paths of frames to do pose inference.
det_results (List[np.ndarray]) – List of detected human boxes.
device (Union[str, torch.device]) – The desired device of returned tensor. Defaults to
'cuda:0'
.
- 返回:
List of pose estimation results. List[
PoseDataSample
]: List of data samples, generally usedto visualize data.
- 返回类型:
List[List[Dict[str, np.ndarray]]]
mmaction.datasets¶
datasets¶
- class mmaction.datasets.AVADataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], exclude_file: str | None = None, label_file: str | None = None, filename_tmpl: str = 'img_{:05}.jpg', start_index: int = 1, proposal_file: str | None = None, person_det_score_thr: float = 0.9, num_classes: int = 81, custom_classes: List[int] | None = None, data_prefix: ConfigDict | dict = {'img': ''}, modality: str = 'RGB', test_mode: bool = False, num_max_proposals: int = 1000, timestamp_start: int = 900, timestamp_end: int = 1800, use_frames: bool = True, fps: int = 30, multilabel: bool = True, **kwargs)[源代码]¶
STAD dataset for spatial temporal action detection.
The dataset loads raw frames/video files, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.
This datasets can load information from the following files:
ann_file -> ava_{train, val}_{v2.1, v2.2}.csv exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv label_file -> ava_action_list_{v2.1, v2.2}.pbtxt / ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl
Particularly, the proposal_file is a pickle file which contains
img_key
(in format of{video_id},{timestamp}
). Example of a pickle file:{ ... '0f39OWEqJ24,0902': array([[0.011 , 0.157 , 0.655 , 0.983 , 0.998163]]), '0f39OWEqJ24,0912': array([[0.054 , 0.088 , 0.91 , 0.998 , 0.068273], [0.016 , 0.161 , 0.519 , 0.974 , 0.984025], [0.493 , 0.283 , 0.981 , 0.984 , 0.983621]]), ... }
- 参数:
ann_file (str) – Path to the annotation file like
ava_{train, val}_{v2.1, v2.2}.csv
.exclude_file (str) – Path to the excluded timestamp file like
ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
.pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.
label_file (str) – Path to the label file like
ava_action_list_{v2.1, v2.2}.pbtxt
orava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
. Defaults to None.filename_tmpl (str) – Template for each filename. Defaults to ‘img_{:05}.jpg’.
start_index (int) – Specify a start index for frames in consideration of different filename format. It should be set to 1 for AVA, since frame index start from 1 in AVA dataset. Defaults to 1.
proposal_file (str) – Path to the proposal file like
ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl
. Defaults to None.person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used. Default: 0.9.
num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)
custom_classes (List[int], optional) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and
num_classes
should be equal tolen(custom_classes) + 1
.data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to
dict(img='')
.test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
modality (str) – Modality of data. Support
RGB
,Flow
. Defaults toRGB
.num_max_proposals (int) – Max proposals number to store. Defaults to 1000.
timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Defaults to 902.
timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Defaults to 1798.
use_frames (bool) – Whether to use rawframes as input. Defaults to True.
fps (int) – Overrides the default FPS for the dataset. If set to 1, means counting timestamp by frame, e.g. MultiSports dataset. Otherwise by second. Defaults to 30.
multilabel (bool) – Determines whether it is a multilabel recognition task. Defaults to True.
- class mmaction.datasets.AVAKineticsDataset(ann_file: str, exclude_file: str, pipeline: List[ConfigDict | dict | Callable], label_file: str, filename_tmpl: str = 'img_{:05}.jpg', start_index: int = 0, proposal_file: str | None = None, person_det_score_thr: float = 0.9, num_classes: int = 81, custom_classes: List[int] | None = None, data_prefix: ConfigDict | dict = {'img': ''}, modality: str = 'RGB', test_mode: bool = False, num_max_proposals: int = 1000, timestamp_start: int = 900, timestamp_end: int = 1800, fps: int = 30, **kwargs)[源代码]¶
AVA-Kinetics dataset for spatial temporal detection.
Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.
This datasets can load information from the following files:
ann_file -> ava_{train, val}_{v2.1, v2.2}.csv exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv label_file -> ava_action_list_{v2.1, v2.2}.pbtxt / ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl
Particularly, the proposal_file is a pickle file which contains
img_key
(in format of{video_id},{timestamp}
). Example of a pickle file:{ ... '0f39OWEqJ24,0902': array([[0.011 , 0.157 , 0.655 , 0.983 , 0.998163]]), '0f39OWEqJ24,0912': array([[0.054 , 0.088 , 0.91 , 0.998 , 0.068273], [0.016 , 0.161 , 0.519 , 0.974 , 0.984025], [0.493 , 0.283 , 0.981 , 0.984 , 0.983621]]), ... }
- 参数:
ann_file (str) – Path to the annotation file like
ava_{train, val}_{v2.1, v2.2}.csv
.exclude_file (str) – Path to the excluded timestamp file like
ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
.pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.
label_file (str) – Path to the label file like
ava_action_list_{v2.1, v2.2}.pbtxt
orava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
. Defaults to None.filename_tmpl (str) – Template for each filename. Defaults to ‘img_{:05}.jpg’.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking frames as input, it should be set to 0, since frames from 0. Defaults to 0.
proposal_file (str) – Path to the proposal file like
ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl
. Defaults to None.person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used. Default: 0.9.
num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)
custom_classes (List[int], optional) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and
num_classes
should be equal tolen(custom_classes) + 1
.data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to
dict(img='')
.test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
modality (str) – Modality of data. Support
RGB
,Flow
. Defaults toRGB
.num_max_proposals (int) – Max proposals number to store. Defaults to 1000.
timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Defaults to 902.
timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Defaults to 1798.
fps (int) – Overrides the default FPS for the dataset. Defaults to 30.
- class mmaction.datasets.ActivityNetDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict | None = {'video': ''}, test_mode: bool = False, **kwargs)[源代码]¶
ActivityNet dataset for temporal action localization. The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information. The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to
dict(video='')
.test_mode (bool) – Store True when building test or validation dataset. Default: False.
- class mmaction.datasets.AudioDataset(ann_file: str, pipeline: List[Dict | Callable], data_prefix: Dict = {'audio': ''}, multi_class: bool = False, num_classes: int | None = None, **kwargs)[源代码]¶
Audio dataset for action recognition.
The ann_file is a text file with multiple lines, and each line indicates a sample audio or extracted audio feature with the filepath, total frames of the raw video and label, which are split with a whitespace. Example of a annotation file:
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (dict) – Path to a directory where audios are held. Defaults to
dict(audio='')
.multi_class (bool) – Determines whether it is a multi-class recognition dataset. Defaults to False.
num_classes (int, optional) – Number of classes in the dataset. Defaults to None.
- class mmaction.datasets.BaseActionDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]¶
Base class for datasets.
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.
data_prefix (dict or ConfigDict, optional) – Path to a directory where videos are held. Defaults to None.
test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.
num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.
modality (str) – Modality of data. Support
RGB
,Flow
,Pose
,Audio
. Defaults toRGB
.
- class mmaction.datasets.CharadesSTADataset(ann_file: str, pipeline: List[dict | Callable], word2id_file: str, fps_file: str, duration_file: str, num_frames_file: str, window_size: int, ft_overlap: float, data_prefix: ConfigDict | dict | None = {'video': ''}, test_mode: bool = False, **kwargs)[源代码]¶
- class mmaction.datasets.MSRVTTRetrieval(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]¶
MSR-VTT Retrieval dataset.
- class mmaction.datasets.MSRVTTVQA(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]¶
MSR-VTT Video Question Answering dataset.
- class mmaction.datasets.MSRVTTVQAMC(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]¶
MSR-VTT VQA multiple choices dataset.
- class mmaction.datasets.PoseDataset(ann_file: str, pipeline: List[Dict | Callable], split: str | None = None, valid_ratio: float | None = None, box_thr: float = 0.5, **kwargs)[源代码]¶
Pose dataset for action recognition.
The dataset loads pose and apply specified transforms to return a dict containing pose information.
The ann_file is a pickle file, the json file contains a list of annotations, the fields of an annotation include frame_dir(video_id), total_frames, label, kp, kpscore.
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
split (str, optional) – The dataset split used. For UCF101 and HMDB51, allowed choices are ‘train1’, ‘test1’, ‘train2’, ‘test2’, ‘train3’, ‘test3’. For NTURGB+D, allowed choices are ‘xsub_train’, ‘xsub_val’, ‘xview_train’, ‘xview_val’. For NTURGB+D 120, allowed choices are ‘xsub_train’, ‘xsub_val’, ‘xset_train’, ‘xset_val’. For FineGYM, allowed choices are ‘train’, ‘val’. Defaults to None.
valid_ratio (float, optional) – The valid_ratio for videos in KineticsPose. For a video with n frames, it is a valid training sample only if n * valid_ratio frames have human pose. None means not applicable (only applicable to Kinetics Pose).Defaults to None.
box_thr (float) – The threshold for human proposals. Only boxes with confidence score larger than box_thr is kept. None means not applicable (only applicable to Kinetics). Allowed choices are 0.5, 0.6, 0.7, 0.8, 0.9. Defaults to 0.5.
- class mmaction.datasets.RawframeDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict = {'img': ''}, filename_tmpl: str = 'img_{:05}.jpg', with_offset: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 1, modality: str = 'RGB', test_mode: bool = False, **kwargs)[源代码]¶
Rawframe dataset for action recognition.
The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:
some/directory-1 163 1 some/directory-2 122 1 some/directory-3 258 2 some/directory-4 234 2 some/directory-5 295 3 some/directory-6 121 3
Example of a multi-class annotation file:
some/directory-1 163 1 3 5 some/directory-2 122 1 2 some/directory-3 258 2 some/directory-4 234 2 4 6 8 some/directory-5 295 3 some/directory-6 121 3
Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.
some/directory-1 12 163 3 some/directory-2 213 122 4 some/directory-3 100 258 5 some/directory-4 98 234 2 some/directory-5 0 295 3 some/directory-6 50 121 3
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.
data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to
dict(img='')
.filename_tmpl (str) – Template for each filename. Defaults to
img_{:05}.jpg
.with_offset (bool) – Determines whether the offset information is in ann_file. Defaults to False.
multi_class (bool) – Determines whether it is a multi-class recognition dataset. Defaults to False.
num_classes (int, optional) – Number of classes in the dataset. Defaults to None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking frames as input, it should be set to 1, since raw frames count from 1. Defaults to 1.
modality (str) – Modality of data. Support
RGB
,Flow
. Defaults toRGB
.test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
- class mmaction.datasets.RepeatAugDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict = {'video': ''}, num_repeats: int = 4, sample_once: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]¶
Video dataset for action recognition use repeat augment. https://arxiv.org/pdf/1901.09335.pdf.
The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:
some/path/000.mp4 1 some/path/001.mp4 1 some/path/002.mp4 2 some/path/003.mp4 2 some/path/004.mp4 3 some/path/005.mp4 3
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.
data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to
dict(video='')
.num_repeats (int) – Number of repeat time of one video in a batch. Defaults to 4.
sample_once (bool) – Determines whether use same frame index for repeat samples. Defaults to False.
multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.
num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.
modality (str) – Modality of data. Support
RGB
,Flow
. Defaults toRGB
.test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
- class mmaction.datasets.VideoDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict = {'video': ''}, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', test_mode: bool = False, delimiter: str = ' ', **kwargs)[源代码]¶
Video dataset for action recognition.
The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:
some/path/000.mp4 1 some/path/001.mp4 1 some/path/002.mp4 2 some/path/003.mp4 2 some/path/004.mp4 3 some/path/005.mp4 3
- 参数:
ann_file (str) – Path to the annotation file.
pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.
data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to
dict(video='')
.multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.
num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.
modality (str) – Modality of data. Support
'RGB'
,'Flow'
. Defaults to'RGB'
.test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
delimiter (str) – Delimiter for the annotation file. Defaults to
' '
(whitespace).
- class mmaction.datasets.VideoTextDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]¶
Video dataset for video-text task like video retrieval.
transforms¶
- class mmaction.datasets.transforms.ArrayDecode[源代码]¶
Load and decode frames with given indices from a 4D array.
Required keys are “array and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- class mmaction.datasets.transforms.AudioFeatureSelector(fixed_length: int = 128)[源代码]¶
Sample the audio feature w.r.t. the frames selected.
Required Keys:
audios
frame_inds
num_clips
length
total_frames
Modified Keys:
audios
Added Keys:
audios_shape
- 参数:
fixed_length (int) – As the features selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Defaults to 128.
- class mmaction.datasets.transforms.BuildPseudoClip(clip_len)[源代码]¶
Build pseudo clips with one single image by repeating it n times.
- Required key is “imgs”, added or modified key is “imgs”, “num_clips”,
“clip_len”.
- 参数:
clip_len (int) – Frames of the generated pseudo clips.
- class mmaction.datasets.transforms.CLIPTokenize[源代码]¶
Tokenize text and convert to tensor.
- transform(results: Dict) Dict [源代码]¶
The transform function of
CLIPTokenize
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.CenterCrop(crop_size, lazy=False)[源代码]¶
Crop the center area from images.
Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.
- 参数:
crop_size (int | tuple[int]) – (w, h) of crop size.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1)[源代码]¶
Perform ColorJitter to each img.
Required keys are “imgs”, added or modified keys are “imgs”.
- 参数:
brightness (float | tuple[float]) – The jitter range for brightness, if set as a float, the range will be (1 - brightness, 1 + brightness). Default: 0.5.
contrast (float | tuple[float]) – The jitter range for contrast, if set as a float, the range will be (1 - contrast, 1 + contrast). Default: 0.5.
saturation (float | tuple[float]) – The jitter range for saturation, if set as a float, the range will be (1 - saturation, 1 + saturation). Default: 0.5.
hue (float | tuple[float]) – The jitter range for hue, if set as a float, the range will be (-hue, hue). Default: 0.1.
- class mmaction.datasets.transforms.DecompressPose(squeeze: bool = True, max_person: int = 10)[源代码]¶
Load Compressed Pose.
Required Keys:
frame_inds
total_frames
keypoint
anno_inds (optional)
Modified Keys:
keypoint
frame_inds
Added Keys:
keypoint_score
num_person
- 参数:
squeeze (bool) – Whether to remove frames with no human pose. Defaults to True.
max_person (int) – The max number of persons in a frame. Defaults to 10.
- class mmaction.datasets.transforms.DecordDecode(mode: str = 'accurate')[源代码]¶
Using decord to decode the video.
Decord: https://github.com/dmlc/decord
Required Keys:
video_reader
frame_inds
Added Keys:
imgs
original_shape
img_shape
- 参数:
mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Defaults to
'accurate'
.
- class mmaction.datasets.transforms.DecordInit(io_backend: str = 'disk', num_threads: int = 1, **kwargs)[源代码]¶
Using decord to initialize the video_reader.
Decord: https://github.com/dmlc/decord
Required Keys:
filename
Added Keys:
video_reader
total_frames
fps
- 参数:
io_backend (str) – io backend where frames are store. Defaults to
'disk'
.num_threads (int) – Number of thread to decode the video. Defaults to 1.
kwargs (dict) – Args for file client.
- class mmaction.datasets.transforms.DenseSampleFrames(*args, sample_range: int = 64, num_sample_positions: int = 10, **kwargs)[源代码]¶
Select frames from the video by dense sample strategy.
Required keys:
total_frames
start_index
Added keys:
frame_inds
clip_len
frame_interval
num_clips
- 参数:
clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.
num_clips (int) – Number of clips to be sampled. Defaults to 1.
sample_range (int) – Total sample range for dense sample. Defaults to 64.
num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Defaults to 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.
temporal_jitter (bool) – Whether to apply temporal jittering. Defaults to False.
test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
- class mmaction.datasets.transforms.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, left_kp=None, right_kp=None, lazy=False)[源代码]¶
Flip the input images with a probability.
Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered.
Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.
- 参数:
flip_ratio (float) – Probability of implementing flip. Default: 0.5.
direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.
flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.
left_kp (list[int]) – Indexes of left keypoints, used to flip keypoints. Default: None.
right_kp (list[ind]) – Indexes of right keypoints, used to flip keypoints. Default: None.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.transforms.FormatAudioShape(input_format: str)[源代码]¶
Format final audio shape to the given input_format.
Required Keys:
audios
Modified Keys:
audios
Added Keys:
input_shape
- 参数:
input_format (str) – Define the final imgs format.
- class mmaction.datasets.transforms.FormatGCNInput(num_person: int = 2, mode: str = 'zero')[源代码]¶
Format final skeleton shape.
Required Keys:
keypoint
keypoint_score (optional)
num_clips (optional)
Modified Key:
keypoint
- 参数:
num_person (int) – The maximum number of people. Defaults to 2.
mode (str) – The padding mode. Defaults to
'zero'
.
- transform(results: Dict) Dict [源代码]¶
The transform function of
FormatGCNInput
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.FormatShape(input_format: str, collapse: bool = False)[源代码]¶
Format final imgs shape to the given input_format.
Required keys:
imgs (optional)
heatmap_imgs (optional)
modality (optional)
num_clips
clip_len
Modified Keys:
imgs
Added Keys:
input_shape
heatmap_input_shape (optional)
- 参数:
input_format (str) – Define the final data format.
collapse (bool) – To collapse input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Defaults to False.
- class mmaction.datasets.transforms.Fuse[源代码]¶
Fuse lazy operations.
- Fusion order:
crop -> resize -> flip
Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.
- class mmaction.datasets.transforms.GenSkeFeat(dataset: str = 'nturgb+d', feats: List[str] = ['j'], axis: int = -1)[源代码]¶
Unified interface for generating multi-stream skeleton features.
Required Keys:
keypoint
keypoint_score (optional)
- 参数:
dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to
'nturgb+d'
.feats (list[str]) – The list of the keys of features. Defaults to
['j']
.axis (int) – The axis along which the features will be joined. Defaults to -1.
- transform(results: Dict) Dict [源代码]¶
The transform function of
GenSkeFeat
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.GenerateLocalizationLabels[源代码]¶
Load video label for localizer with given video_name list.
Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.
- class mmaction.datasets.transforms.GeneratePoseTarget(sigma: float = 0.6, use_score: bool = True, with_kp: bool = True, with_limb: bool = False, skeletons: Tuple[Tuple[int]] = ((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)), double: bool = False, left_kp: Tuple[int] = (1, 3, 5, 7, 9, 11, 13, 15), right_kp: Tuple[int] = (2, 4, 6, 8, 10, 12, 14, 16), left_limb: Tuple[int] = (0, 2, 4, 5, 6, 10, 11, 12), right_limb: Tuple[int] = (1, 3, 7, 8, 9, 13, 14, 15), scaling: float = 1.0)[源代码]¶
Generate pseudo heatmaps based on joint coordinates and confidence.
Required Keys:
keypoint
keypoint_score (optional)
img_shape
Added Keys:
imgs (optional)
heatmap_imgs (optional)
- 参数:
sigma (float) – The sigma of the generated gaussian map. Defaults to 0.6.
use_score (bool) – Use the confidence score of keypoints as the maximum of the gaussian maps. Defaults to True.
with_kp (bool) – Generate pseudo heatmaps for keypoints. Defaults to True.
with_limb (bool) – Generate pseudo heatmaps for limbs. At least one of ‘with_kp’ and ‘with_limb’ should be True. Defaults to False.
skeletons (tuple[tuple]) –
The definition of human skeletons. Defaults to ``((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7),
(7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12))``,
which is the definition of COCO-17p skeletons.
double (bool) – Output both original heatmaps and flipped heatmaps. Defaults to False.
left_kp (tuple[int]) – Indexes of left keypoints, which is used when flipping heatmaps. Defaults to (1, 3, 5, 7, 9, 11, 13, 15), which is left keypoints in COCO-17p.
right_kp (tuple[int]) – Indexes of right keypoints, which is used when flipping heatmaps. Defaults to (2, 4, 6, 8, 10, 12, 14, 16), which is right keypoints in COCO-17p.
left_limb (tuple[int]) – Indexes of left limbs, which is used when flipping heatmaps. Defaults to (0, 2, 4, 5, 6, 10, 11, 12), which is left limbs of skeletons we defined for COCO-17p.
right_limb (tuple[int]) – Indexes of right limbs, which is used when flipping heatmaps. Defaults to (1, 3, 7, 8, 9, 13, 14, 15), which is right limbs of skeletons we defined for COCO-17p.
scaling (float) – The ratio to scale the heatmaps. Defaults to 1.
- gen_an_aug(results: Dict) ndarray [源代码]¶
Generate pseudo heatmaps for all frames.
- 参数:
results (dict) – The dictionary that contains all info of a sample.
- 返回:
The generated pseudo heatmaps.
- 返回类型:
np.ndarray
- generate_a_heatmap(arr: ndarray, centers: ndarray, max_values: ndarray) None [源代码]¶
Generate pseudo heatmap for one keypoint in one frame.
- 参数:
arr (np.ndarray) – The array to store the generated heatmaps. Shape: img_h * img_w.
centers (np.ndarray) – The coordinates of corresponding keypoints (of multiple persons). Shape: M * 2.
max_values (np.ndarray) – The max values of each keypoint. Shape: M.
- generate_a_limb_heatmap(arr: ndarray, starts: ndarray, ends: ndarray, start_values: ndarray, end_values: ndarray) None [源代码]¶
Generate pseudo heatmap for one limb in one frame.
- 参数:
arr (np.ndarray) – The array to store the generated heatmaps. Shape: img_h * img_w.
starts (np.ndarray) – The coordinates of one keypoint in the corresponding limbs. Shape: M * 2.
ends (np.ndarray) – The coordinates of the other keypoint in the corresponding limbs. Shape: M * 2.
start_values (np.ndarray) – The max values of one keypoint in the corresponding limbs. Shape: M.
end_values (np.ndarray) – The max values of the other keypoint in the corresponding limbs. Shape: M.
- generate_heatmap(arr: ndarray, kps: ndarray, max_values: ndarray) None [源代码]¶
Generate pseudo heatmap for all keypoints and limbs in one frame (if needed).
- 参数:
arr (np.ndarray) – The array to store the generated heatmaps. Shape: V * img_h * img_w.
kps (np.ndarray) – The coordinates of keypoints in this frame. Shape: M * V * 2.
max_values (np.ndarray) – The confidence score of each keypoint. Shape: M * V.
- class mmaction.datasets.transforms.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[源代码]¶
Load and decode images.
Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- 参数:
io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.
kwargs (dict, optional) – Arguments for FileClient.
- class mmaction.datasets.transforms.ImgAug(transforms)[源代码]¶
Imgaug augmentation.
Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.
It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.
Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.
It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.
Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways
to create transforms. 1) string: only support default for now.
e.g. transforms=’default’
- list[dict]: create a list of augmenters by a list of dicts, each
dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]
- iaa.Augmenter: create an imgaug.Augmenter object.
e.g. transforms=iaa.Rotate(rotate=(-20, 20))
- Add Imgaug in dataset pipeline. It is recommended to insert imgaug
pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [
- dict(
type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,
), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(
type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),
dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])
- 参数:
transforms (str | list[dict] |
iaa.Augmenter
) – Three different ways to create imgaug augmenter.
- static default_transforms()[源代码]¶
Default transforms for imgaug.
Implement RandAugment by imgaug. Please visit https://arxiv.org/abs/1909.13719 for more information.
Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa
Miss one augmenter
SolarizeAdd
since imgaug doesn’t support this.- 返回:
The constructed RandAugment transforms.
- 返回类型:
dict
- class mmaction.datasets.transforms.JointToBone(dataset: str = 'nturgb+d', target: str = 'keypoint')[源代码]¶
Convert the joint information to bone information.
Required Keys:
keypoint
Modified Keys:
keypoint
- 参数:
dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to
'nturgb+d'
.target (str) – The target key for the bone information. Defaults to
'keypoint'
.
- transform(results: Dict) Dict [源代码]¶
The transform function of
JointToBone
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.LoadAudioFeature(pad_method: str = 'zero')[源代码]¶
Load offline extracted audio features.
Required Keys:
audio_path
Added Keys:
length
audios
- 参数:
pad_method (str) – Padding method. Defaults to
'zero'
.
- class mmaction.datasets.transforms.LoadHVULabel(**kwargs)[源代码]¶
Convert the HVU label from dictionaries to torch tensors.
Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.
- class mmaction.datasets.transforms.LoadLocalizationFeature[源代码]¶
Load Video features for localizer with given video_name list.
The required key is “feature_path”, added or modified keys are “raw_feature”.
- 参数:
raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.
- class mmaction.datasets.transforms.LoadProposals(top_k, pgm_proposals_dir, pgm_features_dir, proposal_ext='.csv', feature_ext='.npy')[源代码]¶
Loading proposals with given proposal results.
Required keys are “video_name”, added or modified keys are ‘bsp_feature’, ‘tmin’, ‘tmax’, ‘tmin_score’, ‘tmax_score’ and ‘reference_temporal_iou’.
- 参数:
top_k (int) – The top k proposals to be loaded.
pgm_proposals_dir (str) – Directory to load proposals.
pgm_features_dir (str) – Directory to load proposal features.
proposal_ext (str) – Proposal file extension. Default: ‘.csv’.
feature_ext (str) – Feature file extension. Default: ‘.npy’.
- class mmaction.datasets.transforms.LoadRGBFromFile(to_float32: bool = False, color_type: str = 'color', imdecode_backend: str = 'cv2', io_backend: str = 'disk', ignore_empty: bool = False, **kwargs)[源代码]¶
Load a RGB image from file.
Required Keys:
img_path
Modified Keys:
img
img_shape
ori_shape
- 参数:
to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.
color_type (str) – The flag argument for :func:
mmcv.imfrombytes
. Defaults to ‘color’.imdecode_backend (str) – The image decoding backend type. The backend argument for :func:
mmcv.imfrombytes
. See :func:mmcv.imfrombytes
for details. Defaults to ‘cv2’.io_backend (str) – io backend where frames are store. Default: ‘disk’.
ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.
kwargs (dict) – Args for file client.
- class mmaction.datasets.transforms.MMCompact(padding: float = 0.25, threshold: int = 10, hw_ratio: float | Tuple[float] = 1, allow_imgpad: bool = True)[源代码]¶
Convert the coordinates of keypoints and crop the images to make them more compact.
Required Keys:
imgs
keypoint
img_shape
Modified Keys:
imgs
keypoint
img_shape
- 参数:
padding (float) – The padding size. Defaults to 0.25.
threshold (int) – The threshold for the tight bounding box. If the width or height of the tight bounding box is smaller than the threshold, we do not perform the compact operation. Defaults to 10.
hw_ratio (float | tuple[float]) – The hw_ratio of the expanded box. Float indicates the specific ratio and tuple indicates a ratio range. If set as None, it means there is no requirement on hw_ratio. Defaults to 1.
allow_imgpad (bool) – Whether to allow expanding the box outside the image to meet the hw_ratio requirement. Defaults to True.
- class mmaction.datasets.transforms.MMDecode(io_backend: str = 'disk', **kwargs)[源代码]¶
Decode RGB videos and skeletons.
- class mmaction.datasets.transforms.MMUniformSampleFrames(clip_len: int, num_clips: int = 1, test_mode: bool = False, seed: int = 255)[源代码]¶
Uniformly sample frames from the multi-modal data.
- transform(results: Dict) Dict [源代码]¶
The transform function of
MMUniformSampleFrames
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.MergeSkeFeat(feat_list: List[str] = ['keypoint'], target: str = 'keypoint', axis: int = -1)[源代码]¶
Merge multi-stream features.
- 参数:
feat_list (list[str]) – The list of the keys of features. Defaults to
['keypoint']
.target (str) – The target key for the merged multi-stream information. Defaults to
'keypoint'
.axis (int) – The axis along which the features will be joined. Defaults to -1.
- transform(results: Dict) Dict [源代码]¶
The transform function of
MergeSkeFeat
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.MultiScaleCrop(input_size, scales=(1,), max_wh_scale_gap=1, random_crop=False, num_fixed_crops=5, lazy=False)[源代码]¶
Crop images with a list of randomly selected scales.
Randomly select the w and h scales from a list of scales. Scale of 1 means the base size, which is the minimal of image width and height. The scale level of w and h is controlled to be smaller than a certain value to prevent too large or small aspect ratio.
Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “crop_bbox”, “img_shape”, “lazy” and “scales”. Required keys in “lazy” are “crop_bbox”, added or modified key is “crop_bbox”.
- 参数:
input_size (int | tuple[int]) – (w, h) of network input.
scales (tuple[float]) – width and height scales to be selected.
max_wh_scale_gap (int) – Maximum gap of w and h scale levels. Default: 1.
random_crop (bool) – If set to True, the cropping bbox will be randomly sampled, otherwise it will be sampler from fixed regions. Default: False.
num_fixed_crops (int) – If set to 5, the cropping bbox will keep 5 basic fixed regions: “upper left”, “upper right”, “lower left”, “lower right”, “center”. If set to 13, the cropping bbox will append another 8 fix regions: “center left”, “center right”, “lower center”, “upper center”, “upper left quarter”, “upper right quarter”, “lower left quarter”, “lower right quarter”. Default: 5.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.transforms.OpenCVDecode[源代码]¶
Using OpenCV to decode the video.
Required keys are
'video_reader'
,'filename'
and'frame_inds'
, added or modified keys are'imgs'
,'img_shape'
and'original_shape'
.
- class mmaction.datasets.transforms.OpenCVInit(io_backend: str = 'disk', **kwargs)[源代码]¶
Using OpenCV to initialize the video_reader.
Required keys are
'filename'
, added or modified keys are ` ‘new_path’`,'video_reader'
and'total_frames'
.- 参数:
io_backend (str) – io backend where frames are store. Defaults to
'disk'
.
- class mmaction.datasets.transforms.PIMSDecode[源代码]¶
Using PIMS to decode the videos.
PIMS: https://github.com/soft-matter/pims
Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- class mmaction.datasets.transforms.PIMSInit(io_backend='disk', mode='accurate', **kwargs)[源代码]¶
Use PIMS to initialize the video.
PIMS: https://github.com/soft-matter/pims
- 参数:
io_backend (str) – io backend where frames are store. Default: ‘disk’.
mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will always use
pims.PyAVReaderIndexed
to decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking by usingpims.PyAVReaderTimed
. Both will return the accurate frames in most cases. Default: ‘accurate’.kwargs (dict) – Args for file client.
- class mmaction.datasets.transforms.PackActionInputs(collect_keys: Tuple[str] | None = None, meta_keys: Sequence[str] = ('img_shape', 'img_key', 'video_id', 'timestamp'), algorithm_keys: Sequence[str] = ())[源代码]¶
Pack the inputs data.
- 参数:
collect_keys (tuple[str], optional) – The keys to be collected to
packed_results['inputs']
. Defaults to ``meta_keys (Sequence[str]) – The meta keys to saved in the metainfo of the data_sample. Defaults to
('img_shape', 'img_key', 'video_id', 'timestamp')
.algorithm_keys (Sequence[str]) – The keys of custom elements to be used in the algorithm. Defaults to an empty tuple.
- transform(results: Dict) Dict [源代码]¶
The transform function of
PackActionInputs
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.PadTo(length: int, mode: str = 'loop')[源代码]¶
Sample frames from the video.
To sample an n-frame clip from the video, PadTo samples the frames from zero index, and loop or zero pad the frames if the length of video frames is less than the value of length.
Required Keys:
keypoint
total_frames
start_index (optional)
Modified Keys:
keypoint
total_frames
- 参数:
length (int) – The maximum length of the sampled output clip.
mode (str) – The padding mode. Defaults to
'loop'
.
- class mmaction.datasets.transforms.PoseCompact(padding: float = 0.25, threshold: int = 10, hw_ratio: float | Tuple[float] | None = None, allow_imgpad: bool = True)[源代码]¶
Convert the coordinates of keypoints to make it more compact. Specifically, it first find a tight bounding box that surrounds all joints in each frame, then we expand the tight box by a given padding ratio. For example, if ‘padding == 0.25’, then the expanded box has unchanged center, and 1.25x width and height.
Required Keys:
keypoint
img_shape
Modified Keys:
img_shape
keypoint
Added Keys:
crop_quadruple
- 参数:
padding (float) – The padding size. Defaults to 0.25.
threshold (int) – The threshold for the tight bounding box. If the width or height of the tight bounding box is smaller than the threshold, we do not perform the compact operation. Defaults to 10.
hw_ratio (float | tuple[float] | None) – The hw_ratio of the expanded box. Float indicates the specific ratio and tuple indicates a ratio range. If set as None, it means there is no requirement on hw_ratio. Defaults to None.
allow_imgpad (bool) – Whether to allow expanding the box outside the image to meet the hw_ratio requirement. Defaults to True.
- class mmaction.datasets.transforms.PoseDecode[源代码]¶
Load and decode pose with given indices.
Required Keys:
keypoint
total_frames (optional)
frame_inds (optional)
offset (optional)
keypoint_score (optional)
Modified Keys:
keypoint
keypoint_score (optional)
- transform(results: Dict) Dict [源代码]¶
The transform function of
PoseDecode
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.PreNormalize2D(img_shape: Tuple[int, int] = (1080, 1920))[源代码]¶
Normalize the range of keypoint values.
Required Keys:
keypoint
img_shape (optional)
Modified Keys:
keypoint
- 参数:
img_shape (tuple[int, int]) – The resolution of the original video. Defaults to
(1080, 1920)
.
- transform(results: Dict) Dict [源代码]¶
The transform function of
PreNormalize2D
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.PreNormalize3D(zaxis: List[int] = [0, 1], xaxis: List[int] = [8, 4], align_spine: bool = True, align_shoulder: bool = True, align_center: bool = True)[源代码]¶
PreNormalize for NTURGB+D 3D keypoints (x, y, z).
PreNormalize3D first subtracts the coordinates of each joint from the coordinates of the ‘spine’ (joint #1 in ntu) of the first person in the first frame. Subsequently, it performs a 3D rotation to fix the Z axis parallel to the 3D vector from the ‘hip’ (joint #0) and the ‘spine’ (joint #1) and the X axis toward the 3D vector from the ‘right shoulder’ (joint #8) and the ‘left shoulder’ (joint #4). Codes adapted from https://github.com/lshiwjx/2s-AGCN.
Required Keys:
keypoint
total_frames (optional)
Modified Keys:
keypoint
Added Keys:
body_center
- 参数:
zaxis (list[int]) – The target Z axis for the 3D rotation. Defaults to
[0, 1]
.xaxis (list[int]) – The target X axis for the 3D rotation. Defaults to
[8, 4]
.align_spine (bool) – Whether to perform a 3D rotation to align the spine. Defaults to True.
align_shoulder (bool) – Whether to perform a 3D rotation to align the shoulder. Defaults to True.
align_center (bool) – Whether to align the body center. Defaults to True.
- angle_between(v1: ndarray, v2: ndarray) float [源代码]¶
Returns the angle in radians between vectors ‘v1’ and ‘v2’.
- rotation_matrix(axis: ndarray, theta: float) ndarray [源代码]¶
Returns the rotation matrix associated with counterclockwise rotation about the given axis by theta radians.
- transform(results: Dict) Dict [源代码]¶
The transform function of
PreNormalize3D
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.PyAVDecode(multi_thread=False, mode='accurate')[源代码]¶
Using PyAV to decode the video.
PyAV: https://github.com/mikeboers/PyAV
Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- 参数:
multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.
mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return the nearest key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Default: ‘accurate’.
- class mmaction.datasets.transforms.PyAVDecodeMotionVector(multi_thread=False, mode='accurate')[源代码]¶
Using pyav to decode the motion vectors from video.
- Reference: https://github.com/PyAV-Org/PyAV/
blob/main/tests/test_decode.py
Required keys are “video_reader” and “frame_inds”, added or modified keys are “motion_vectors”, “frame_inds”.
- class mmaction.datasets.transforms.PyAVInit(io_backend='disk', **kwargs)[源代码]¶
Using pyav to initialize the video.
PyAV: https://github.com/mikeboers/PyAV
Required keys are “filename”, added or modified keys are “video_reader”, and “total_frames”.
- 参数:
io_backend (str) – io backend where frames are store. Default: ‘disk’.
kwargs (dict) – Args for file client.
- class mmaction.datasets.transforms.PytorchVideoWrapper(op, **kwargs)[源代码]¶
PytorchVideoTrans Augmentations, under pytorchvideo.transforms.
- 参数:
op (str) – The name of the pytorchvideo transformation.
- class mmaction.datasets.transforms.RandomCrop(size, lazy=False)[源代码]¶
Vanilla square random crop that specifics the output size.
Required keys in results are “img_shape”, “keypoint” (optional), “imgs” (optional), added or modified keys are “keypoint”, “imgs”, “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.
- 参数:
size (int) – The output size of the images.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.transforms.RandomRescale(scale_range, interpolation='bilinear')[源代码]¶
Randomly resize images so that the short_edge is resized to a specific size in a given range. The scale ratio is unchanged after resizing.
Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “resize_size”, “short_edge”.
- 参数:
scale_range (tuple[int]) – The range of short edge length. A closed interval.
interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.
- class mmaction.datasets.transforms.RandomResizedCrop(area_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333333333333333), lazy=False)[源代码]¶
Random crop that specifics the area and height-weight ratio range.
Required keys in results are “img_shape”, “crop_bbox”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox” and “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.
- 参数:
area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).
aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3).
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- static get_crop_bbox(img_shape, area_range, aspect_ratio_range, max_attempts=10)[源代码]¶
Get a crop bbox given the area range and aspect ratio range.
- 参数:
img_shape (Tuple[int]) – Image shape
area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).
aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3). max_attempts (int): The maximum of attempts. Default: 10.
max_attempts (int) – Max attempts times to generate random candidate bounding box. If it doesn’t qualified one, the center bounding box will be used.
- 返回:
(list[int]) A random crop bbox within the area range and aspect ratio range.
- class mmaction.datasets.transforms.RawFrameDecode(io_backend: str = 'disk', decoding_backend: str = 'cv2', **kwargs)[源代码]¶
Load and decode frames with given indices.
Required Keys:
frame_dir
filename_tmpl
frame_inds
modality
offset (optional)
Added Keys:
img
img_shape
original_shape
- 参数:
io_backend (str) – IO backend where frames are stored. Defaults to
'disk'
.decoding_backend (str) – Backend used for image decoding. Defaults to
'cv2'
.
- class mmaction.datasets.transforms.Resize(scale, keep_ratio=True, interpolation='bilinear', lazy=False)[源代码]¶
Resize images to a specific size.
Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.
- 参数:
scale (float | Tuple[int]) – If keep_ratio is True, it serves as scaling factor or maximum size: If it is a float number, the image will be rescaled by this factor, else if it is a tuple of 2 integers, the image will be rescaled as large as possible within the scale. Otherwise, it serves as (w, h) of output size.
keep_ratio (bool) – If set to True, Images will be resized without changing the aspect ratio. Otherwise, it will resize images to a given size. Default: True.
interpolation (str) – Algorithm used for interpolation, accepted values are “nearest”, “bilinear”, “bicubic”, “area”, “lanczos”. Default: “bilinear”.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.transforms.SampleAVAFrames(clip_len, frame_interval=2, test_mode=False)[源代码]¶
- class mmaction.datasets.transforms.SampleFrames(clip_len: int, frame_interval: int = 1, num_clips: int = 1, temporal_jitter: bool = False, twice_sample: bool = False, out_of_bound_opt: str = 'loop', test_mode: bool = False, keep_tail_frames: bool = False, target_fps: int | None = None, **kwargs)[源代码]¶
Sample frames from the video.
Required Keys:
total_frames
start_index
Added Keys:
frame_inds
frame_interval
num_clips
- 参数:
clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.
num_clips (int) – Number of clips to be sampled. Default: 1.
temporal_jitter (bool) – Whether to apply temporal jittering. Defaults to False.
twice_sample (bool) – Whether to use twice sample when testing. If set to True, it will sample frames with and without fixed shift, which is commonly used for testing in TSM model. Defaults to False.
out_of_bound_opt (str) – The way to deal with out of bounds frame indexes. Available options are ‘loop’, ‘repeat_last’. Defaults to ‘loop’.
test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
keep_tail_frames (bool) – Whether to keep tail frames when sampling. Defaults to False.
target_fps (optional, int) – Convert input videos with arbitrary frame rates to the unified target FPS before sampling frames. If
None
, the frame rate will not be adjusted. Defaults toNone
.
- class mmaction.datasets.transforms.TenCrop(crop_size)[源代码]¶
Crop the images into 10 crops (corner + center + flip).
Crop the four corners and the center part of the image with the same given crop_size, and flip it horizontally. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.
- 参数:
crop_size (int | tuple[int]) – (w, h) of crop size.
- class mmaction.datasets.transforms.ThreeCrop(crop_size)[源代码]¶
Crop images into three crops.
Crop the images equally into three crops with equal intervals along the shorter side. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.
- 参数:
crop_size (int | tuple[int]) – (w, h) of crop size.
- class mmaction.datasets.transforms.ToMotion(dataset: str = 'nturgb+d', source: str = 'keypoint', target: str = 'motion')[源代码]¶
Convert the joint information or bone information to corresponding motion information.
Required Keys:
keypoint
Added Keys:
motion
- 参数:
dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to
'nturgb+d'
.source (str) – The source key for the joint or bone information. Defaults to
'keypoint'
.target (str) – The target key for the motion information. Defaults to
'motion'
.
- class mmaction.datasets.transforms.TorchVisionWrapper(op, **kwargs)[源代码]¶
Torchvision Augmentations, under torchvision.transforms.
- 参数:
op (str) – The name of the torchvision transformation.
- class mmaction.datasets.transforms.Transpose(keys, order)[源代码]¶
Transpose image channels to a given order.
- 参数:
keys (Sequence[str]) – Required keys to be converted.
order (Sequence[int]) – Image channel order.
- class mmaction.datasets.transforms.UniformSample(clip_len: int, num_clips: int = 1, test_mode: bool = False)[源代码]¶
Uniformly sample frames from the video.
Modified from https://github.com/facebookresearch/SlowFast/blob/64a bcc90ccfdcbb11cf91d6e525bed60e92a8796/slowfast/datasets/ssv2.py#L159.
To sample an n-frame clip from the video. UniformSample basically divides the video into n segments of equal length and randomly samples one frame from each segment.
Required keys:
total_frames
start_index
Added keys:
frame_inds
clip_len
frame_interval
num_clips
- 参数:
clip_len (int) – Frames of each sampled output clip.
num_clips (int) – Number of clips to be sampled. Defaults to 1.
test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
- class mmaction.datasets.transforms.UniformSampleFrames(clip_len: int, num_clips: int = 1, test_mode: bool = False, seed: int = 255)[源代码]¶
Uniformly sample frames from the video.
To sample an n-frame clip from the video. UniformSampleFrames basically divide the video into n segments of equal length and randomly sample one frame from each segment. To make the testing results reproducible, a random seed is set during testing, to make the sampling results deterministic.
Required Keys:
total_frames
start_index (optional)
Added Keys:
frame_inds
frame_interval
num_clips
clip_len
- 参数:
clip_len (int) – Frames of each sampled output clip.
num_clips (int) – Number of clips to be sampled. Defaults to 1.
test_mode (bool) – Store True when building test or validation dataset. Defaults to False.
seed (int) – The random seed used during test time. Defaults to 255.
- transform(results: Dict) Dict [源代码]¶
The transform function of
UniformSampleFrames
.- 参数:
results (dict) – The result dict.
- 返回:
The result dict.
- 返回类型:
dict
- class mmaction.datasets.transforms.UntrimmedSampleFrames(clip_len=1, clip_interval=16, frame_interval=1)[源代码]¶
Sample frames from the untrimmed video.
Required keys are “filename”, “total_frames”, added or modified keys are “frame_inds”, “clip_interval” and “num_clips”.
- 参数:
clip_len (int) – The length of sampled clips. Defaults to 1.
clip_interval (int) – Clip interval of adjacent center of sampled clips. Defaults to 16.
frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.
mmaction.engine¶
hooks¶
- class mmaction.engine.hooks.OutputHook(module, outputs=None, as_tensor=False)[源代码]¶
Output feature map of some layers.
- 参数:
module (nn.Module) – The whole module to get layers.
outputs (tuple[str] | list[str]) – Layer name to output. Default: None.
as_tensor (bool) – Determine to return a tensor or a numpy array. Default: False.
- class mmaction.engine.hooks.VisualizationHook(enable=False, interval: int = 5000, show: bool = False, out_dir: str | None = None, **kwargs)[源代码]¶
Classification Visualization Hook. Used to visualize validation and testing prediction results.
If
out_dir
is specified, all storage backends are ignored and save the image to theout_dir
.If
show
is True, plot the result image in a window, please confirm you are able to access the graphical interface.
- 参数:
enable (bool) – Whether to enable this hook. Defaults to False.
interval (int) – The interval of samples to visualize. Defaults to 5000.
show (bool) – Whether to display the drawn image. Defaults to False.
out_dir (str, optional) – directory where painted images will be saved in the testing process. If None, handle with the backends of the visualizer. Defaults to None.
**kwargs – other keyword arguments of
mmcls.visualization.ClsVisualizer.add_datasample()
.
- after_test_iter(runner: Runner, batch_idx: int, data_batch: dict, outputs: Sequence[ActionDataSample]) None [源代码]¶
Visualize every
self.interval
samples during test.- 参数:
runner (
Runner
) – The runner of the testing process.batch_idx (int) – The index of the current batch in the test loop.
data_batch (dict) – Data from dataloader.
outputs (Sequence[
DetDataSample
]) – Outputs from model.
- after_val_iter(runner: Runner, batch_idx: int, data_batch: dict, outputs: Sequence[ActionDataSample]) None [源代码]¶
Visualize every
self.interval
samples during validation.- 参数:
runner (
Runner
) – The runner of the validation process.batch_idx (int) – The index of the current batch in the val loop.
data_batch (dict) – Data from dataloader.
outputs (Sequence[
ActionDataSample
]) – Outputs from model.
optimizers¶
- class mmaction.engine.optimizers.LearningRateDecayOptimizerConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[源代码]¶
Different learning rates are set for different layers of backbone. Note: Currently, this optimizer constructor is built for MViT.
Inspiration from the implementation in PySlowFast and MMDetection <https://github.com/open-mmlab/mmdetection/tree/dev-3.x>`_
- add_params(params: List[dict], module: Module, **kwargs) None [源代码]¶
Add all parameters of module to the params list.
The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.
- 参数:
params (list[dict]) – A list of param groups, it will be modified in place.
module (nn.Module) – The module to be added.
- class mmaction.engine.optimizers.SwinOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[源代码]¶
- add_params(params: List[dict], module: Module, prefix: str = 'base', **kwargs) None [源代码]¶
Add all parameters of module to the params list.
The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.
- 参数:
params (list[dict]) – A list of param groups, it will be modified in place.
module (nn.Module) – The module to be added.
prefix (str) – The prefix of the module. Defaults to
'base'
.
- class mmaction.engine.optimizers.TSMOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[源代码]¶
Optimizer constructor in TSM model.
This constructor builds optimizer in different ways from the default one.
Parameters of the first conv layer have default lr and weight decay.
Parameters of BN layers have default lr and zero weight decay.
If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.
Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.
runner¶
- class mmaction.engine.runner.MultiLoaderEpochBasedTrainLoop(runner, dataloader: Dict | DataLoader, other_loaders: List[Dict | DataLoader], max_epochs: int, val_begin: int = 1, val_interval: int = 1)[源代码]¶
EpochBasedTrainLoop with multiple dataloaders.
- 参数:
runner (Runner) – A reference of runner.
dataloader (Dataloader or Dict) – A dataloader object or a dict to build a dataloader for training the model.
other_loaders (List of Dataloader or Dict) – A list of other loaders. Each item in the list is a dataloader object or a dict to build a dataloader.
max_epochs (int) – Total training epochs.
val_begin (int) – The epoch that begins validating. Defaults to 1.
val_interval (int) – Validation interval. Defaults to 1.
- class mmaction.engine.runner.RetrievalTestLoop(runner, dataloader: DataLoader | Dict, evaluator: Evaluator | Dict | List, fp16: bool = False)[源代码]¶
Loop for multimodal retrieval test.
- 参数:
runner (Runner) – A reference of runner.
dataloader (Dataloader or dict) – A dataloader object or a dict to build a dataloader.
evaluator (Evaluator or dict or list) – Used for computing metrics.
fp16 (bool) – Whether to enable fp16 testing. Defaults to False.
- class mmaction.engine.runner.RetrievalValLoop(runner, dataloader: DataLoader | Dict, evaluator: Evaluator | Dict | List, fp16: bool = False)[源代码]¶
Loop for multimodal retrieval val.
- 参数:
runner (Runner) – A reference of runner.
dataloader (Dataloader or dict) – A dataloader object or a dict to build a dataloader.
evaluator (Evaluator or dict or list) – Used for computing metrics.
fp16 (bool) – Whether to enable fp16 valing. Defaults to False.
mmaction.evaluation¶
functional¶
- class mmaction.evaluation.functional.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[源代码]¶
Class to evaluate detection results on ActivityNet.
- 参数:
ground_truth_filename (str | None) – The filename of groundtruth. Default: None.
prediction_filename (str | None) – The filename of action detection results. Default: None.
tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default:
np.linspace(0.5, 0.95, 10)
.verbose (bool) – Whether to print verbose logs. Default: False.
- mmaction.evaluation.functional.ava_eval(result_file, result_type, label_file, ann_file, exclude_file, verbose=True, ignore_empty_frames=True, custom_classes=None)[源代码]¶
Perform ava evaluation.
- mmaction.evaluation.functional.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[源代码]¶
Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.
- 参数:
ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.
prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default:
np.linspace(0.5, 0.95, 10)
.
- 返回:
1D array of average precision score.
- 返回类型:
np.ndarray
- mmaction.evaluation.functional.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[源代码]¶
Computes the average recall given an average number (percentile) of proposals per video.
- 参数:
ground_truth (dict) – Dict containing the ground truth instances.
proposals (dict) – Dict containing the proposal instances.
total_num_proposals (int) – Total number of proposals in the proposal dict.
max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default:
np.linspace(0.5, 0.95, 10)
.
- 返回:
(recall, average_recall, proposals_per_video, auc) In recall,
recall[i,j]
is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent torecall.mean(axis=0)
. Theproposals_per_video
is the average number of proposals per video. The auc is the area underAR@AN
curve.- 返回类型:
tuple([np.ndarray, np.ndarray, np.ndarray, float])
- mmaction.evaluation.functional.confusion_matrix(y_pred, y_real, normalize=None)[源代码]¶
Compute confusion matrix.
- 参数:
y_pred (list[int] | np.ndarray[int]) – Prediction labels.
y_real (list[int] | np.ndarray[int]) – Ground truth labels.
normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.
- 返回:
Confusion matrix.
- 返回类型:
np.ndarray
- mmaction.evaluation.functional.get_weighted_score(score_list, coeff_list)[源代码]¶
Get weighted score with given scores and coefficients.
Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n
- 参数:
score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes
coeff_list (list[float]) – List of coefficients, with shape n.
- 返回:
List of weighted scores.
- 返回类型:
list[np.ndarray]
- mmaction.evaluation.functional.interpolated_precision_recall(precision, recall)[源代码]¶
Interpolated AP - VOCdevkit from VOC 2011.
- 参数:
precision (np.ndarray) – The precision of different thresholds.
recall (np.ndarray) – The recall of different thresholds.
- Returns:
float: Average precision score.
- mmaction.evaluation.functional.mean_average_precision(scores, labels)[源代码]¶
Mean average precision for multi-label recognition.
- 参数:
scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.
- 返回:
The mean average precision.
- 返回类型:
np.float64
- mmaction.evaluation.functional.mean_class_accuracy(scores, labels)[源代码]¶
Calculate mean class accuracy.
- 参数:
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
- 返回:
Mean class accuracy.
- 返回类型:
np.ndarray
- mmaction.evaluation.functional.mmit_mean_average_precision(scores, labels)[源代码]¶
Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.
- 参数:
scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.
- 返回:
The MMIT style mean average precision.
- 返回类型:
np.float64
- mmaction.evaluation.functional.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[源代码]¶
Compute intersection over union between segments.
- 参数:
candidate_segments (np.ndarray) – 1-dim/2-dim array in format
[init, end]/[m x 2:=[init, end]]
.target_segments (np.ndarray) – 2-dim array in format
[n x 2:=[init, end]]
.calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.
- 返回:
- 1-dim array [n] /
2-dim array [n x m] with IoU ratio.
- t_overlap_self (np.ndarray, optional): 1-dim array [n] /
2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.
- 返回类型:
t_iou (np.ndarray)
- mmaction.evaluation.functional.read_labelmap(labelmap_file)[源代码]¶
Reads a labelmap without the dependency on protocol buffers.
- 参数:
labelmap_file – A file object containing a label map protocol buffer.
- 返回:
The label map in the form used by the object_detection_evaluation module - a list of {“id”: integer, “name”: classname } dicts. class_ids: A set containing all of the valid class id integers.
- 返回类型:
labelmap
- mmaction.evaluation.functional.results2csv(results, out_file, custom_classes=None)[源代码]¶
Convert detection results to csv file.
- mmaction.evaluation.functional.softmax(x, dim=1)[源代码]¶
Compute softmax values for each sets of scores in x.
- mmaction.evaluation.functional.top_k_accuracy(scores, labels, topk=(1,))[源代码]¶
Calculate top k accuracy score.
- 参数:
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).
- 返回:
Top k accuracy score for each k.
- 返回类型:
list[float]
- mmaction.evaluation.functional.top_k_classes(scores, labels, k=10, mode='accurate')[源代码]¶
Calculate the most K accurate (inaccurate) classes.
Given the prediction scores, ground truth label and top-k value, compute the top K accurate (inaccurate) classes.
- 参数:
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int] | np.ndarray) – Ground truth labels.
k (int) – Top-k values. Default: 10.
mode (str) – Comparison mode for Top-k. Options are ‘accurate’ and ‘inaccurate’. Default: ‘accurate’.
- 返回:
- List of sorted (from high accuracy to low accuracy for
’accurate’ mode, and from low accuracy to high accuracy for inaccurate mode) top K classes in format of (label_id, acc_ratio).
- 返回类型:
list
metrics¶
- class mmaction.evaluation.metrics.ANetMetric(metric_type: str = 'TEM', collect_device: str = 'cpu', prefix: str | None = None, metric_options: dict = {}, dump_config: ConfigDict | dict = {'out': ''})[源代码]¶
ActivityNet dataset evaluation metric.
- compute_metrics(results: list) dict [源代码]¶
Compute the metrics from processed results.
If metric_type is ‘TEM’, only dump middle results and do not compute any metrics. :param results: The processed results of each batch. :type results: list
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- process(data_batch: Sequence[Tuple[Any, dict]], predictions: Sequence[dict]) None [源代码]¶
Process one batch of data samples and predictions. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.
predictions (Sequence[dict]) – A batch of outputs from the model.
- static proposals2json(results, show_progress=False)[源代码]¶
Convert all proposals to a final dict(json) format. :param results: All proposals. :type results: list[dict] :param show_progress: Whether to show the progress bar.
Defaults: False.
- 返回:
The final result dict. E.g. .. code-block:: Python
- dict(video-1=[dict(segment=[1.1,2.0]. score=0.9),
dict(segment=[50.1, 129.3], score=0.6)])
- 返回类型:
dict
- class mmaction.evaluation.metrics.AVAMetric(ann_file: str, exclude_file: str, label_file: str, options: Tuple[str] = ('mAP',), action_thr: float = 0.002, num_classes: int = 81, custom_classes: List[int] | None = None, collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
AVA evaluation metric.
- compute_metrics(results: list) dict [源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- process(data_batch: Sequence[Tuple[Any, dict]], data_samples: Sequence[dict]) None [源代码]¶
Process one batch of data samples and predictions. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.AccMetric(metric_list: str | Tuple[str] | None = ('top_k_accuracy', 'mean_class_accuracy'), collect_device: str = 'cpu', metric_options: Dict | None = {'top_k_accuracy': {'topk': (1, 5)}}, prefix: str | None = None)[源代码]¶
Accuracy evaluation metric.
- calculate(preds: List[ndarray], labels: List[int | ndarray]) Dict [源代码]¶
Compute the metrics from processed results.
- 参数:
preds (list[np.ndarray]) – List of the prediction scores.
labels (list[int | np.ndarray]) – List of the labels.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- compute_metrics(results: List) Dict [源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- process(data_batch: Sequence[Tuple[Any, Dict]], data_samples: Sequence[Dict]) None [源代码]¶
Process one batch of data samples and data_samples. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (Sequence[dict]) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.ConfusionMatrix(num_classes: int | None = None, collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
A metric to calculate confusion matrix for single-label tasks.
- 参数:
num_classes (int, optional) – The number of classes. Defaults to None.
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.
示例
The basic usage.
>>> import torch >>> from mmaction.evaluation import ConfusionMatrix >>> y_pred = [0, 1, 1, 3] >>> y_true = [0, 2, 1, 3] >>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4) tensor([[1, 0, 0, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]]) >>> # plot the confusion matrix >>> import matplotlib.pyplot as plt >>> y_score = torch.rand((1000, 10)) >>> y_true = torch.randint(10, (1000, )) >>> matrix = ConfusionMatrix.calculate(y_score, y_true) >>> ConfusionMatrix().plot(matrix) >>> plt.show()
In the config file
val_evaluator = dict(type='ConfusionMatrix') test_evaluator = dict(type='ConfusionMatrix')
- static calculate(pred, target, num_classes=None) dict [源代码]¶
Calculate the confusion matrix for single-label task.
- 参数:
pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. It can be labels (N, ), or scores of every class (N, C).
target (torch.Tensor | np.ndarray | Sequence) – The target of each prediction with shape (N, ).
num_classes (Optional, int) – The number of classes. If the
pred
is label instead of scores, this argument is required. Defaults to None.
- 返回:
The confusion matrix.
- 返回类型:
torch.Tensor
- compute_metrics(results: list) dict [源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- static plot(confusion_matrix: Tensor, include_values: bool = False, cmap: str = 'viridis', classes: List[str] | None = None, colorbar: bool = True, show: bool = True)[源代码]¶
Draw a confusion matrix by matplotlib.
Modified from Scikit-Learn
- 参数:
confusion_matrix (torch.Tensor) – The confusion matrix to draw.
include_values (bool) – Whether to draw the values in the figure. Defaults to False.
cmap (str) – The color map to use. Defaults to use “viridis”.
classes (list[str], optional) – The names of categories. Defaults to None, which means to use index number.
colorbar (bool) – Whether to show the colorbar. Defaults to True.
show (bool) – Whether to show the figure immediately. Defaults to True.
- process(data_batch, data_samples: Sequence[dict]) None [源代码]¶
Process one batch of data samples and predictions. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (Any) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.MultiSportsMetric(ann_file: str, metric_options: dict | None = {'F_mAP': {'thr': 0.5}, 'V_mAP': {'all': True, 'thr': (0.2, 0.5), 'tube_thr': 15}}, collect_device: str = 'cpu', verbose: bool = True, prefix: str | None = None)[源代码]¶
MAP Metric for MultiSports dataset.
- compute_metrics(results: list) dict [源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- process(data_batch: Sequence[Tuple[Any, dict]], data_samples: Sequence[dict]) None [源代码]¶
Process one batch of data samples and predictions. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.RecallatTopK(topK_list: Tuple[int] = (1, 5), threshold: float = 0.5, collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
ActivityNet dataset evaluation metric.
- compute_metrics(results: list) dict [源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- process(data_batch: Sequence[Tuple[Any, dict]], predictions: Sequence[dict]) None [源代码]¶
Process one batch of data samples and predictions. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.
predictions (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.ReportVQA(file_path: str, collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
Dump VQA result to the standard json format for VQA evaluation.
- 参数:
file_path (str) – The file path to save the result file.
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.
- class mmaction.evaluation.metrics.RetrievalMetric(metric_list: Tuple[str] | str = ('R1', 'R5', 'R10', 'MdR', 'MnR'), collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
Metric for video retrieval task.
- 参数:
metric_list (str | tuple[str]) – The list of the metrics to be computed. Defaults to
('R1', 'R5', 'R10', 'MdR', 'MnR')
.collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.
- compute_metrics(results: List) Dict [源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
dict
- process(data_batch: Dict | None, data_samples: Sequence[Dict]) None [源代码]¶
Process one batch of data samples and data_samples. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数:
data_batch (dict, optional) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.RetrievalRecall(topk: int | Sequence[int], collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
Recall evaluation metric for image retrieval.
- 参数:
topk (int | Sequence[int]) – If the ground truth label matches one of the best k predictions, the sample will be regard as a positive prediction. If the parameter is a tuple, all of top-k recall will be calculated and outputted together. Defaults to 1.
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.
- static calculate(pred: ndarray | Tensor, target: ndarray | Tensor, topk: int | Sequence[int], pred_indices: bool = False, target_indices: bool = False) float [源代码]¶
Calculate the average recall.
- 参数:
pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. A
torch.Tensor
ornp.ndarray
with shape(N, M)
or a sequence of index/onehot format labels.target (torch.Tensor | np.ndarray | Sequence) – The prediction results. A
torch.Tensor
ornp.ndarray
with shape(N, M)
or a sequence of index/onehot format labels.topk (int, Sequence[int]) – Predictions with the k-th highest scores are considered as positive.
pred_indices (bool) – Whether the
pred
is a sequence of category index labels. Defaults to False.target_indices (bool) – Whether the
target
is a sequence of category index labels. Defaults to False.
- 返回:
the average recalls.
- 返回类型:
List[float]
- compute_metrics(results: List)[源代码]¶
Compute the metrics from processed results.
- 参数:
results (list) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
Dict
- process(data_batch: Sequence[dict], data_samples: Sequence[dict])[源代码]¶
Process one batch of data and predictions.
The processed results should be stored in
self.results
, which will be used to computed the metrics when all batches have been processed.- 参数:
data_batch (Sequence[dict]) – A batch of data from the dataloader.
predictions (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.VQAAcc(full_score_weight: float = 0.3, collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
VQA Acc metric. :param collect_device: Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
- 参数:
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.
- compute_metrics(results: List)[源代码]¶
Compute the metrics from processed results.
- 参数:
results (dict) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
Dict
- process(data_batch, data_samples)[源代码]¶
Process one batch of data samples.
The processed results should be stored in
self.results
, which will be used to computed the metrics when all batches have been processed.- 参数:
data_batch – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
- class mmaction.evaluation.metrics.VQAMCACC(collect_device: str = 'cpu', prefix: str | None = None)[源代码]¶
VQA multiple choice Acc metric. :param collect_device: Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
- 参数:
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.
- compute_metrics(results: List)[源代码]¶
Compute the metrics from processed results.
- 参数:
results (dict) – The processed results of each batch.
- 返回:
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型:
Dict
- process(data_batch, data_samples)[源代码]¶
Process one batch of data samples.
The processed results should be stored in
self.results
, which will be used to computed the metrics when all batches have been processed.- 参数:
data_batch – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from the model.
mmaction.models¶
backbones¶
- class mmaction.models.backbones.AAGCN(graph_cfg: Dict, in_channels: int = 3, base_channels: int = 64, data_bn_type: str = 'MVC', num_person: int = 2, num_stages: int = 10, inflate_stages: List[int] = [5, 8], down_stages: List[int] = [5, 8], init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]¶
AAGCN backbone, the attention-enhanced version of 2s-AGCN.
Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. More details can be found in the paper .
Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. More details can be found in the paper .
- 参数:
graph_cfg (dict) – Config for building the graph.
in_channels (int) – Number of input channels. Defaults to 3.
base_channels (int) – Number of base channels. Defaults to 64.
data_bn_type (str) – Type of the data bn layer. Defaults to
'MVC'
.num_person (int) – Maximum number of people. Only used when data_bn_type == ‘MVC’. Defaults to 2.
num_stages (int) – Total number of stages. Defaults to 10.
inflate_stages (list[int]) – Stages to inflate the number of channels. Defaults to
[5, 8]
.down_stages (list[int]) – Stages to perform downsampling in the time dimension. Defaults to
[5, 8]
.init_cfg (dict or list[dict], optional) – Config to control the initialization. Defaults to None.
Examples –
torch (>>> import) –
AAGCN (>>> model =) –
register_all_modules (>>> from mmaction.utils import) –
>>> –
register_all_modules() (>>>) –
'stgcn_spatial' (>>> mode =) –
batch_size (>>>) –
num_person –
2 (num_frames =) –
2 –
150 –
>>> –
layout (>>> # openpose-18) –
18 (>>> num_joints =) –
AAGCN –
model.init_weights() (>>>) –
torch.randn(batch_size (>>> inputs =) –
num_person –
:param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # nturgb+d layout: :param >>> num_joints = 25: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’nturgb+d’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # coco layout: :param >>> num_joints = 17: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’coco’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # custom settings: :param >>> # disable the attention module to degenerate AAGCN to AGCN: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’coco’, mode=mode :param … gcn_attention=False): :param >>> model.init_weights(): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param torch.Size: :type torch.Size: [2, 2, 256, 38, 18] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 25] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17]
- class mmaction.models.backbones.C2D(depth: int, pretrained: str | None = None, torchvision_pretrain: bool = True, in_channels: int = 3, num_stages: int = 4, out_indices: Sequence[int] = (3,), strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), style: str = 'pytorch', frozen_stages: int = -1, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, partial_bn: bool = False, with_cp: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': 'BatchNorm2d', 'val': 1.0}])[源代码]¶
C2D backbone.
Compared to ResNet-50, a temporal-pool is added after the first bottleneck. Detailed structure is kept same as “video-nonlocal-net” repo. Please refer to https://github.com/facebookresearch/video-nonlocal-net/blob /main/scripts/run_c2d_baseline_400k.sh. Please note that there are some improvements compared to “Non-local Neural Networks” paper (https://arxiv.org/abs/1711.07971). Differences are noted at https://github.com/facebookresearch/video-nonlocal -net#modifications-for-improving-speed.
- class mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[源代码]¶
C3D backbone.
- 参数:
pretrained (str | None) – Name of pretrained model.
style (str) –
pytorch
orcaffe
. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses
dict(type='Conv3d')
to construct layers. Default: None.norm_cfg (dict | None) – Config for norm layers. required keys are
type
, Default: None.act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses
dict(type='ReLU')
to construct layers. Default: None.out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.
- class mmaction.models.backbones.MViT(arch: str = 'base', spatial_size: int = 224, temporal_size: int = 16, in_channels: int = 3, pretrained: str | None = None, pretrained_type: str | None = None, out_scales: int | Sequence[int] = -1, drop_path_rate: float = 0.0, use_abs_pos_embed: bool = False, interpolate_mode: str = 'trilinear', pool_kernel: tuple = (3, 3, 3), dim_mul: int = 2, head_mul: int = 2, adaptive_kv_stride: tuple = (1, 8, 8), rel_pos_embed: bool = True, residual_pooling: bool = True, dim_mul_in_attention: bool = True, with_cls_token: bool = True, output_cls_token: bool = True, rel_pos_zero_init: bool = False, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_cfg: Dict = {'eps': 1e-06, 'type': 'LN'}, patch_cfg: Dict = {'kernel_size': (3, 7, 7), 'padding': (1, 3, 3), 'stride': (2, 4, 4)}, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': ['Conv2d', 'Conv3d'], 'std': 0.02}, {'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.02}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.02}])[源代码]¶
Multi-scale ViT v2.
A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Inspiration from the official implementation and the mmclassification implementation
- 参数:
arch (str | dict) –
MViT architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:
embed_dims (int): The dimensions of embedding.
num_layers (int): The number of layers.
num_heads (int): The number of heads in attention modules of the initial layer.
downscale_indices (List[int]): The layer indices to downscale the feature map.
Defaults to ‘base’.
spatial_size (int) – The expected input spatial_size shape. Defaults to 224.
temporal_size (int) – The expected input temporal_size shape. Defaults to 224.
in_channels (int) – The num of input channels. Defaults to 3.
pretrained (str, optional) – Name of pretrained model. Defaults to None.
pretrained_type (str, optional) – Type of pretrained model. choose from ‘imagenet’, ‘maskfeat’, None. Defaults to None, which means load from same architecture.
out_scales (int | Sequence[int]) – The output scale indices. They should not exceed the length of
downscale_indices
. Defaults to -1, which means the last scale.drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.
use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.
interpolate_mode (str) – Select the interpolate mode for absolute position embedding vector resize. Defaults to “trilinear”.
pool_kernel (tuple) – kernel size for qkv pooling layers. Defaults to (3, 3, 3).
dim_mul (int) – The magnification for
embed_dims
in the downscale layers. Defaults to 2.head_mul (int) – The magnification for
num_heads
in the downscale layers. Defaults to 2.adaptive_kv_stride (int) – The stride size for kv pooling in the initial layer. Defaults to (1, 8, 8).
rel_pos_embed (bool) – Whether to enable the spatial and temporal relative position embedding. Defaults to True.
residual_pooling (bool) – Whether to enable the residual connection after attention pooling. Defaults to True.
dim_mul_in_attention (bool) – Whether to multiply the
embed_dims
in attention layers. If False, multiply it in MLP layers. Defaults to True.with_cls_token (bool) – Whether concatenating class token into video tokens as transformer input. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True,
with_cls_token
must be True. Defaults to True.rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters. Defaults to False.
mlp_ratio (float) – Ratio of hidden dimensions in MLP layers. Defaults to 4.0.
qkv_bias (bool) – enable bias for qkv if True. Defaults to True.
norm_cfg (dict) – Config dict for normalization layer for all output features. Defaults to
dict(type='LN', eps=1e-6)
.patch_cfg (dict) –
Config dict for the patch embedding layer. Defaults to ``dict(kernel_size=(3, 7, 7),
stride=(2, 4, 4), padding=(1, 3, 3))``.
init_cfg (dict, optional) – The Config for initialization. Defaults to
[ dict(type='TruncNormal', layer=['Conv2d', 'Conv3d'], std=0.02), dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.02), ]
示例
>>> import torch >>> from mmaction.registry import MODELS >>> from mmaction.utils import register_all_modules >>> register_all_modules() >>> >>> cfg = dict(type='MViT', arch='tiny', out_scales=[0, 1, 2, 3]) >>> model = MODELS.build(cfg) >>> model.init_weights() >>> inputs = torch.rand(1, 3, 16, 224, 224) >>> outputs = model(inputs) >>> for i, output in enumerate(outputs): >>> print(f'scale{i}: {output.shape}') scale0: torch.Size([1, 96, 8, 56, 56]) scale1: torch.Size([1, 192, 8, 28, 28]) scale2: torch.Size([1, 384, 8, 14, 14]) scale3: torch.Size([1, 768, 8, 7, 7])
- class mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7,), frozen_stages=-1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': ['GroupNorm', '_BatchNorm'], 'val': 1.0}])[源代码]¶
MobileNetV2 backbone.
- 参数:
pretrained (str | None) – Name of pretrained model. Defaults to None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Defaults to 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Defaults to (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in
MobileNetV2
isconv2
. Defaults to -1, which means not freezing any parameters.conv_cfg (dict) – Config dict for convolution layer. Defaults to None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Defaults to dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
init_cfg (dict or list[dict]) – Initialization config dict. Defaults to
[ dict(type='Kaiming', layer='Conv2d',), dict(type='Constant', layer=['GroupNorm', '_BatchNorm'], val=1.) ]
.
- forward(x)[源代码]¶
Defines the computation performed at every call.
- 参数:
x (Tensor) – The input data.
- 返回:
The feature of the input samples extracted by the backbone.
- 返回类型:
Tensor or Tuple[Tensor]
- make_layer(out_channels, num_blocks, stride, expand_ratio)[源代码]¶
Stack InvertedResidual blocks to build a layer for MobileNetV2.
- 参数:
out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Defaults to 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Defaults to 6.
- class mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, pretrained2d=True, **kwargs)[源代码]¶
MobileNetV2 backbone for TSM.
- 参数:
num_segments (int) – Number of frame segments. Defaults to 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Defaults to True.
shift_div (int) – Number of div for shift. Defaults to 8.
pretraind2d (bool) – Whether to load pretrained 2D model. Defaults to True.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.
- class mmaction.models.backbones.OmniResNet(layers: List[int] = [3, 4, 6, 3], pretrain_2d: str | None = None, init_cfg: ConfigDict | dict | None = None)[源代码]¶
Omni-ResNet that accepts both image and video inputs.
- 参数:
layers (List[int]) – number of layers in each residual stages. Defaults to [3, 4, 6, 3].
pretrain_2d (str, optional) – path to the 2D pretraining checkpoints. Defaults to None.
init_cfg (dict or ConfigDict, optional) – The Config for initialization. Defaults to None.
- class mmaction.models.backbones.RGBPoseConv3D(pretrained: str | None = None, speed_ratio: int = 4, channel_ratio: int = 4, rgb_detach: bool = False, pose_detach: bool = False, rgb_drop_path: float = 0, pose_drop_path: float = 0, rgb_pathway: Dict = {'base_channels': 64, 'conv1_kernel': (1, 7, 7), 'fusion_kernel': 7, 'inflate': (0, 0, 1, 1), 'lateral': True, 'lateral_activate': (0, 0, 1, 1), 'lateral_infl': 1, 'num_stages': 4, 'with_pool2': False}, pose_pathway: Dict = {'base_channels': 32, 'conv1_kernel': (1, 7, 7), 'conv1_stride_s': 1, 'conv1_stride_t': 1, 'dilations': (1, 1, 1), 'fusion_kernel': 7, 'in_channels': 17, 'inflate': (0, 1, 1), 'lateral': True, 'lateral_activate': (0, 1, 1), 'lateral_infl': 16, 'lateral_inv': True, 'num_stages': 3, 'out_indices': (2,), 'pool1_stride_s': 1, 'pool1_stride_t': 1, 'spatial_strides': (2, 2, 2), 'stage_blocks': (4, 6, 3), 'temporal_strides': (1, 1, 1), 'with_pool2': False}, init_cfg: Dict | List[Dict] | None = None)[源代码]¶
RGBPoseConv3D backbone.
- 参数:
pretrained (str) – The file path to a pretrained model. Defaults to None.
speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Defaults to 4.
channel_ratio (int) – Reduce the channel number of fast pathway by
channel_ratio
, corresponding to \(\beta\) in the paper. Defaults to 4.rgb_detach (bool) – Whether to detach the gradients from the pose path. Defaults to False.
pose_detach (bool) – Whether to detach the gradients from the rgb path. Defaults to False.
rgb_drop_path (float) – The drop rate for dropping the features from the pose path. Defaults to 0.
pose_drop_path (float) – The drop rate for dropping the features from the rgb path. Defaults to 0.
rgb_pathway (dict) – Configuration of rgb branch. Defaults to
dict(num_stages=4, lateral=True, lateral_infl=1, lateral_activate=(0, 0, 1, 1), fusion_kernel=7, base_channels=64, conv1_kernel=(1, 7, 7), inflate=(0, 0, 1, 1), with_pool2=False)
.pose_pathway (dict) – Configuration of pose branch. Defaults to
dict(num_stages=3, stage_blocks=(4, 6, 3), lateral=True, lateral_inv=True, lateral_infl=16, lateral_activate=(0, 1, 1), fusion_kernel=7, in_channels=17, base_channels=32, out_indices=(2, ), conv1_kernel=(1, 7, 7), conv1_stride_s=1, conv1_stride_t=1, pool1_stride_s=1, pool1_stride_t=1, inflate=(0, 1, 1), spatial_strides=(2, 2, 2), temporal_strides=(1, 1, 1), with_pool2=False)
.init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmaction.models.backbones.ResNet(depth: int, pretrained: str | None = None, torchvision_pretrain: bool = True, in_channels: int = 3, num_stages: int = 4, out_indices: Sequence[int] = (3,), strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), style: str = 'pytorch', frozen_stages: int = -1, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, partial_bn: bool = False, with_cp: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': 'BatchNorm2d', 'val': 1.0}])[源代码]¶
ResNet backbone.
- 参数:
depth (int) – Depth of resnet, from
{18, 34, 50, 101, 152}
.pretrained (str, optional) – Name of pretrained model. Defaults to None.
torchvision_pretrain (bool) – Whether to load pretrained model from torchvision. Defaults to True.
in_channels (int) – Channel num of input features. Defaults to 3.
num_stages (int) – Resnet stages. Defaults to 4.
out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).
strides (Sequence[int]) – Strides of the first block of each stage. Defaults to
(1, 2, 2, 2)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.style (str) –
pytorch
orcaffe
. If set topytorch
, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults topytorch
.frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.
conv_cfg (dict or ConfigDict) – Config for norm layers. Defaults
dict(type='Conv')
.norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. required keys are
type
andrequires_grad
. Defaults todict(type='BN2d', requires_grad=True)
.act_cfg (Union[dict, ConfigDict]) – Config for activate layers. Defaults to
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.
partial_bn (bool) – Whether to use partial bn. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
init_cfg (dict or list[dict]) – Initialization config dict. Defaults to
[ dict(type='Kaiming', layer='Conv2d',), dict(type='Constant', layer='BatchNorm', val=1.) ]
.
- class mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[源代码]¶
ResNet (2+1)d backbone.
This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition
- class mmaction.models.backbones.ResNet3d(depth: int = 50, pretrained: str | None = None, stage_blocks: Tuple | None = None, pretrained2d: bool = True, in_channels: int = 3, num_stages: int = 4, base_channels: int = 64, out_indices: Sequence[int] = (3,), spatial_strides: Sequence[int] = (1, 2, 2, 2), temporal_strides: Sequence[int] = (1, 1, 1, 1), dilations: Sequence[int] = (1, 1, 1, 1), conv1_kernel: Sequence[int] = (3, 7, 7), conv1_stride_s: int = 2, conv1_stride_t: int = 1, pool1_stride_s: int = 2, pool1_stride_t: int = 1, with_pool1: bool = True, with_pool2: bool = True, style: str = 'pytorch', frozen_stages: int = -1, inflate: Sequence[int] = (1, 1, 1, 1), inflate_style: str = '3x1x1', conv_cfg: Dict = {'type': 'Conv3d'}, norm_cfg: Dict = {'requires_grad': True, 'type': 'BN3d'}, act_cfg: Dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, with_cp: bool = False, non_local: Sequence[int] = (0, 0, 0, 0), non_local_cfg: Dict = {}, zero_init_residual: bool = True, init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]¶
ResNet 3d backbone.
- 参数:
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}. Defaults to 50.
pretrained (str, optional) – Name of pretrained model. Defaults to None.
stage_blocks (tuple, optional) – Set number of stages for each res layer. Defaults to None.
pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.
in_channels (int) – Channel num of input features. Defaults to 3.
num_stages (int) – Resnet stages. Defaults to 4.
base_channels (int) – Channel num of stem output features. Defaults to 64.
out_indices (Sequence[int]) – Indices of output feature. Defaults to
(3, )
.spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Defaults to
(1, 2, 2, 2)
.temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Defaults to
(1, 1, 1, 1)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Defaults to
(3, 7, 7)
.conv1_stride_s (int) – Spatial stride of the first conv layer. Defaults to 2.
conv1_stride_t (int) – Temporal stride of the first conv layer. Defaults to 1.
pool1_stride_s (int) – Spatial stride of the first pooling layer. Defaults to 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Defaults to 1.
with_pool2 (bool) – Whether to use pool2. Defaults to True.
style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to
'pytorch'
.frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.
inflate (Sequence[int]) – Inflate Dims of each block. Defaults to
(1, 1, 1, 1)
.inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Defaults to3x1x1
.conv_cfg (dict) – Config for conv layers. Required keys are
type
. Defaults todict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. Required keys are
type
andrequires_grad
. Defaults todict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Defaults to
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (
mean
andvar
). Defaults to False.with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to
(0, 0, 0, 0)
.non_local_cfg (dict) – Config for non-local module. Defaults to
dict()
.zero_init_residual (bool) – Whether to use zero initialization for residual block, Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- forward(x: Tensor) Tensor | Tuple[Tensor] [源代码]¶
Defines the computation performed at every call.
- 参数:
x (torch.Tensor) – The input data.
- 返回:
The feature of the input samples extracted by the backbone.
- 返回类型:
torch.Tensor or tuple[torch.Tensor]
- static make_res_layer(block: Module, inplanes: int, planes: int, blocks: int, spatial_stride: int | Sequence[int] = 1, temporal_stride: int | Sequence[int] = 1, dilation: int = 1, style: str = 'pytorch', inflate: int | Sequence[int] = 1, inflate_style: str = '3x1x1', non_local: int | Sequence[int] = 0, non_local_cfg: Dict = {}, norm_cfg: Dict | None = None, act_cfg: Dict | None = None, conv_cfg: Dict | None = None, with_cp: bool = False, **kwargs) Module [源代码]¶
Build residual layer for ResNet3D.
- 参数:
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Defaults to 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Defaults to 1.
dilation (int) – Spacing between kernel elements. Defaults to 1.
style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer,otherwise the stride-two layer is the first 1x1 conv layer. Defaults to
'pytorch'
.inflate (int | Sequence[int]) – Determine whether to inflate for each block. Defaults to 1.
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default:'3x1x1'
.non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to 0.
non_local_cfg (dict) – Config for non-local module. Defaults to
dict()
.conv_cfg (dict, optional) – Config for conv layers. Defaults to None.
norm_cfg (dict, optional) – Config for norm layers. Defaults to None.
act_cfg (dict, optional) – Config for activate layers. Defaults to None.
with_cp (bool, optional) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
- 返回:
A residual layer for the given config.
- 返回类型:
nn.Module
- class mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[源代码]¶
ResNet backbone for CSN.
- 参数:
depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.
If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.
Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- class mmaction.models.backbones.ResNet3dLayer(depth: int, pretrained: str | None = None, pretrained2d: bool = True, stage: int = 3, base_channels: int = 64, spatial_stride: int = 2, temporal_stride: int = 1, dilation: int = 1, style: str = 'pytorch', all_frozen: bool = False, inflate: int = 1, inflate_style: str = '3x1x1', conv_cfg: Dict = {'type': 'Conv3d'}, norm_cfg: Dict = {'requires_grad': True, 'type': 'BN3d'}, act_cfg: Dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = True, init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]¶
ResNet 3d Layer.
- 参数:
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str, optional) – Name of pretrained model. Defaults to None.
pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.
stage (int) – The index of Resnet stage. Defaults to 3.
base_channels (int) – Channel num of stem output features. Defaults to 64.
spatial_stride (int) – The 1st res block’s spatial stride. Defaults to 2.
temporal_stride (int) – The 1st res block’s temporal stride. Defaults to 1.
dilation (int) – The dilation. Defaults to 1.
style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to
'pytorch'
.all_frozen (bool) – Frozen all modules in the layer. Defaults to False.
inflate (int) – Inflate dims of each block. Defaults to 1.
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Defaults to'3x1x1'
.conv_cfg (dict) – Config for conv layers. Required keys are
type
. Defaults todict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. Required keys are
type
andrequires_grad
. Defaults todict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Defaults to
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (
mean
andvar
). Defaults to False.with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmaction.models.backbones.ResNet3dSlowFast(pretrained: str | None = None, resample_rate: int = 8, speed_ratio: int = 8, channel_ratio: int = 8, slow_pathway: Dict = {'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway: Dict = {'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, init_cfg: Dict | List[Dict] | None = None)[源代码]¶
Slowfast backbone.
This module is proposed in SlowFast Networks for Video Recognition
- 参数:
pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride
resample_rate
on input frames. The actual resample rate is calculated by multipling theinterval
inSampleFrames
in the pipeline withresample_rate
, equivalent to the \(\tau\) in the paper, i.e. it processes only one out ofresample_rate * interval
frames. Defaults to 8.speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Defaults to 8.
channel_ratio (int) – Reduce the channel number of fast pathway by
channel_ratio
, corresponding to \(\beta\) in the paper. Defaults to 8.slow_pathway (dict) – Configuration of slow branch. Defaults to
dict(type='resnet3d', depth=50, pretrained=None, lateral=True, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
.fast_pathway (dict) – Configuration of fast branch. Defaults to
dict(type='resnet3d', depth=50, pretrained=None, lateral=False, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
.init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.
- class mmaction.models.backbones.ResNet3dSlowOnly(conv1_kernel: Sequence[int] = (1, 7, 7), conv1_stride_t: int = 1, pool1_stride_t: int = 1, inflate: Sequence[int] = (0, 0, 1, 1), with_pool2: bool = False, **kwargs)[源代码]¶
SlowOnly backbone based on ResNet3dPathway.
- 参数:
conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Defaults to
(1, 7, 7)
.conv1_stride_t (int) – Temporal stride of the first conv layer. Defaults to 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Defaults to 1.
inflate (Sequence[int]) – Inflate dims of each block. Defaults to
(0, 0, 1, 1)
.with_pool2 (bool) – Whether to use pool2. Defaults to False.
- class mmaction.models.backbones.ResNetAudio(depth: int, pretrained: str | None = None, in_channels: int = 1, num_stages: int = 4, base_channels: int = 32, strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), conv1_kernel: int = 9, conv1_stride: int = 1, frozen_stages: int = -1, factorize: Sequence[int] = (1, 1, 0, 0), norm_eval: bool = False, with_cp: bool = False, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, zero_init_residual: bool = True)[源代码]¶
ResNet 2d audio backbone. Reference:
- 参数:
depth (int) – Depth of resnet, from
{50, 101, 152}
.pretrained (str, optional) – Name of pretrained model. Defaults to None.
in_channels (int) – Channel num of input features. Defaults to 1.
base_channels (int) – Channel num of stem output features. Defaults to 32.
num_stages (int) – Resnet stages. Defaults to 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Defaults to
(1, 2, 2, 2)
.dilations (Sequence[int]) – Dilation of each stage. Defaults to
(1, 1, 1, 1)
.conv1_kernel (int) – Kernel size of the first conv layer. Defaults to 9.
conv1_stride (Union[int, Tuple[int]]) – Stride of the first conv layer. Defaults to 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.
factorize (Sequence[int]) – factorize Dims of each block for audio. Defaults to
(1, 1, 0, 0)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
conv_cfg (Union[dict, ConfigDict]) – Config for norm layers. Defaults to
dict(type='Conv')
.norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. required keys are
type
andrequires_grad
. Defaults todict(type='BN2d', requires_grad=True)
.act_cfg (Union[dict, ConfigDict]) – Config for activate layers. Defaults to
dict(type='ReLU', inplace=True)
.zero_init_residual (bool) – Whether to use zero initialization for residual block. Defaults to True.
- forward(x: Tensor) Tensor [源代码]¶
Defines the computation performed at every call.
- 参数:
x (torch.Tensor) – The input data.
- 返回:
- The feature of the input samples extracted
by the backbone.
- 返回类型:
torch.Tensor
- static make_res_layer(block: Module, inplanes: int, planes: int, blocks: int, stride: int = 1, dilation: int = 1, factorize: int = 1, norm_cfg: ConfigDict | dict | None = None, with_cp: bool = False) Module [源代码]¶
Build residual layer for ResNetAudio.
- 参数:
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
stride (int) – Strides of residual blocks of each stage. Defaults to 1.
dilation (int) – Spacing between kernel elements. Defaults to 1.
factorize (Uninon[int, Sequence[int]]) – Determine whether to factorize for each block. Defaults to 1.
norm_cfg (Union[dict, ConfigDict], optional) – Config for norm layers. Defaults to None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
- 返回:
A residual layer for the given config.
- 返回类型:
nn.Module
- class mmaction.models.backbones.ResNetTIN(depth, is_tin=True, **kwargs)[源代码]¶
ResNet backbone for TIN.
- 参数:
depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.
- class mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, pretrained2d=True, **kwargs)[源代码]¶
ResNet backbone for TSM.
- 参数:
num_segments (int) – Number of frame segments. Defaults to 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Defaults to True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Defaults to
dict()
.shift_div (int) – Number of div for shift. Defaults to 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Defaults to ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Defaults to False.
pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.
**kwargs (keyword arguments, optional) – Arguments for ResNet.
- load_original_weights(logger)[源代码]¶
Load weights from original checkpoint, which required converting keys.
- class mmaction.models.backbones.STGCN(graph_cfg: Dict, in_channels: int = 3, base_channels: int = 64, data_bn_type: str = 'VC', ch_ratio: int = 2, num_person: int = 2, num_stages: int = 10, inflate_stages: List[int] = [5, 8], down_stages: List[int] = [5, 8], init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]¶
STGCN backbone.
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. More details can be found in the paper .
- 参数:
graph_cfg (dict) – Config for building the graph.
in_channels (int) – Number of input channels. Defaults to 3.
base_channels (int) – Number of base channels. Defaults to 64.
data_bn_type (str) – Type of the data bn layer. Defaults to
'VC'
.ch_ratio (int) – Inflation ratio of the number of channels. Defaults to 2.
num_person (int) – Maximum number of people. Only used when data_bn_type == ‘MVC’. Defaults to 2.
num_stages (int) – Total number of stages. Defaults to 10.
inflate_stages (list[int]) – Stages to inflate the number of channels. Defaults to
[5, 8]
.down_stages (list[int]) – Stages to perform downsampling in the time dimension. Defaults to
[5, 8]
.stage_cfgs (dict) – Extra config dict for each stage. Defaults to
dict()
.init_cfg (dict or list[dict], optional) – Config to control the initialization. Defaults to None.
Examples –
torch (>>> import) –
STGCN (>>> model =) –
>>> –
'stgcn_spatial' (>>> mode =) –
batch_size (>>>) –
num_person –
2 (num_frames =) –
2 –
150 –
>>> –
layout (>>> # openpose-18) –
18 (>>> num_joints =) –
STGCN –
model.init_weights() (>>>) –
torch.randn(batch_size (>>> inputs =) –
num_person –
:param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # nturgb+d layout: :param >>> num_joints = 25: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’nturgb+d’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # coco layout: :param >>> num_joints = 17: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’coco’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # custom settings: :param >>> # instantiate STGCN++: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’coco’, mode=’spatial’ :param … gcn_adaptive=’init’: :param gcn_with_res=True: :param : :param … tcn_type=’mstcn’): :param >>> model.init_weights(): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param torch.Size: :type torch.Size: [2, 2, 256, 38, 18] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 25] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17]
- class mmaction.models.backbones.SwinTransformer3D(arch: str | Dict, pretrained: str | None = None, pretrained2d: bool = True, patch_size: int | Sequence[int] = (2, 4, 4), in_channels: int = 3, window_size: Sequence[int] = (8, 7, 7), mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, act_cfg: Dict = {'type': 'GELU'}, norm_cfg: Dict = {'type': 'LN'}, patch_norm: bool = True, frozen_stages: int = -1, with_cp: bool = False, out_indices: Sequence[int] = (3,), out_after_downsample: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[源代码]¶
Video Swin Transformer backbone.
A pytorch implement of: Video Swin Transformer
- 参数:
arch (str or dict) – Video Swin Transformer architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys: - embed_dims (int): The dimensions of embedding. - depths (Sequence[int]): The number of blocks in each stage. - num_heads (Sequence[int]): The number of heads in attention modules of each stage.
pretrained (str, optional) – Name of pretrained model. Defaults to None.
pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.
patch_size (int or Sequence(int)) – Patch size. Defaults to
(2, 4, 4)
.in_channels (int) – Number of input image channels. Defaults to 3.
window_size (Sequence[int]) – Window size. Defaults to
(8, 7, 7)
.mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Defaults to 4.
qkv_bias (bool) – If True, add a learnable bias to query, key, value. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.drop_rate (float) – Dropout rate. Defaults to 0.0.
attn_drop_rate (float) – Attention dropout rate. Defaults to 0.0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.
act_cfg (dict) – Config dict for activation layer. Defaults to
dict(type='GELU')
.norm_cfg (dict) – Config dict for norm layer. Defaults to
dict(type='LN')
.patch_norm (bool) – If True, add normalization after patch embedding. Defaults to True.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
out_indices (Sequence[int]) – Indices of output feature. Defaults to
(3, )
.out_after_downsample (bool) – Whether to output the feature map of a stage after the following downsample layer. Defaults to False.
init_cfg (dict or list[dict]) – Initialization config dict. Defaults to
[ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ]
.
- inflate_weights(logger: MMLogger) None [源代码]¶
Inflate the swin2d parameters to swin3d.
The differences between swin3d and swin2d mainly lie in an extra axis. To utilize the pretrained parameters in 2d model, the weight of swin2d models should be inflated to fit in the shapes of the 3d counterpart.
- 参数:
logger (MMLogger) – The logger used to print debugging information.
- class mmaction.models.backbones.TANet(depth: int, num_segments: int, tam_cfg: dict | None = None, **kwargs)[源代码]¶
Temporal Adaptive Network (TANet) backbone.
This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.
- 参数:
depth (int) – Depth of resnet, from
{18, 34, 50, 101, 152}
.num_segments (int) – Number of frame segments.
tam_cfg (dict, optional) – Config for temporal adaptive module (TAM). Defaults to None.
- class mmaction.models.backbones.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[源代码]¶
TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?
- 参数:
num_frames (int) – Number of frames in the video.
img_size (int | tuple) – Size of input image.
patch_size (int) – Size of one patch.
pretrained (str | None) – Name of pretrained model. Default: None.
embed_dims (int) – Dimensions of embedding. Defaults to 768.
num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.
num_transformer_layers (int) – Number of transformer layers. Defaults to 12.
in_channels (int) – Channel num of input features. Defaults to 3.
dropout_ratio (float) – Probability of dropout layer. Defaults to 0..
(list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.
attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.
norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).
- class mmaction.models.backbones.UniFormer(depth: List[int] = [5, 8, 20, 7], img_size: int = 224, in_chans: int = 3, embed_dim: List[int] = [64, 128, 320, 512], head_dim: int = 64, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, pretrained2d: bool = True, pretrained: str | None = None, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[源代码]¶
UniFormer.
A pytorch implement of: UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning <https://arxiv.org/abs/2201.04676>
- 参数:
depth (List[int]) – List of depth in each stage. Defaults to [5, 8, 20, 7].
img_size (int) – Number of input size. Defaults to 224.
in_chans (int) – Number of input features. Defaults to 3.
head_dim (int) – Dimension of attention head. Defaults to 64.
embed_dim (List[int]) – List of embedding dimension in each layer. Defaults to [64, 128, 320, 512].
mlp_ratio (float) – Ratio of mlp hidden dimension to embedding dimension. Defaults to 4.
qkv_bias (bool) – If True, add a learnable bias to query, key, value. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.drop_rate (float) – Dropout rate. Defaults to 0.0.
attn_drop_rate (float) – Attention dropout rate. Defaults to 0.0.
drop_path_rate (float) – Stochastic depth rates. Defaults to 0.0.
pretrained2d (bool) – Whether to load pretrained from 2D model. Defaults to True.
pretrained (str) – Name of pretrained model. Defaults to None.
init_cfg (dict or list[dict]) – Initialization config dict. Defaults to
[ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ]
.
- forward(x: Tensor) Tensor [源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
备注
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.backbones.UniFormerV2(input_resolution: int = 224, patch_size: int = 16, width: int = 768, layers: int = 12, heads: int = 12, backbone_drop_path_rate: float = 0.0, t_size: int = 8, kernel_size: int = 3, dw_reduction: float = 1.5, temporal_downsample: bool = False, no_lmhra: bool = True, double_lmhra: bool = False, return_list: List[int] = [8, 9, 10, 11], n_layers: int = 4, n_dim: int = 768, n_head: int = 12, mlp_factor: float = 4.0, drop_path_rate: float = 0.0, mlp_dropout: List[float] = [0.5, 0.5, 0.5, 0.5], clip_pretrained: bool = True, pretrained: str | None = None, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[源代码]¶
UniFormerV2:
A pytorch implement of: UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer <https://arxiv.org/abs/2211.09552>
- 参数:
input_resolution (int) – Number of input resolution. Defaults to 224.
patch_size (int) – Number of patch size. Defaults to 16.
width (int) – Number of input channels in local UniBlock. Defaults to 768.
layers (int) – Number of layers of local UniBlock. Defaults to 12.
heads (int) – Number of attention head in local UniBlock. Defaults to 12.
backbone_drop_path_rate (float) – Stochastic depth rate in local UniBlock. Defaults to 0.0.
t_size (int) – Number of temporal dimension after patch embedding. Defaults to 8.
temporal_downsample (bool) – Whether downsampling temporal dimentison. Defaults to False.
dw_reduction (float) – Downsample ratio of input channels in local MHRA. Defaults to 1.5.
no_lmhra (bool) – Whether removing local MHRA in local UniBlock. Defaults to False.
double_lmhra (bool) – Whether using double local MHRA in local UniBlock. Defaults to True.
return_list (List[int]) – Layer index of input features for global UniBlock. Defaults to [8, 9, 10, 11].
n_dim (int) – Number of layers of global UniBlock. Defaults to 4.
n_dim – Number of layers of global UniBlock. Defaults to 4.
n_dim – Number of input channels in global UniBlock. Defaults to 768.
n_head (int) – Number of attention head in global UniBlock. Defaults to 12.
mlp_factor (float) – Ratio of hidden dimensions in MLP layers in global UniBlock. Defaults to 4.0.
drop_path_rate (float) – Stochastic depth rate in global UniBlock. Defaults to 0.0.
mlp_dropout (List[float]) – Stochastic dropout rate in each MLP layer in global UniBlock. Defaults to [0.5, 0.5, 0.5, 0.5].
clip_pretrained (bool) – Whether to load pretrained CLIP visual encoder. Defaults to True.
pretrained (str) – Name of pretrained model. Defaults to None.
init_cfg (dict or list[dict]) – Initialization config dict. Defaults to
[ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ]
.
- forward(x: Tensor) Tensor [源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
备注
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.backbones.VisionTransformer(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dims: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: int = 4.0, qkv_bias: bool = True, qk_scale: int | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: ConfigDict | dict = {'eps': 1e-06, 'type': 'LN'}, init_values: int = 0.0, use_learnable_pos_emb: bool = False, num_frames: int = 16, tubelet_size: int = 2, use_mean_pooling: int = True, pretrained: str | None = None, return_feat_map: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}], **kwargs)[源代码]¶
Vision Transformer with support for patch or hybrid CNN input stage. An impl of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- 参数:
img_size (int or tuple) – Size of input image. Defaults to 224.
patch_size (int) – Spatial size of one patch. Defaults to 16.
in_channels (int) – The number of channels of he input. Defaults to 3.
embed_dims (int) – Dimensions of embedding. Defaults to 768.
depth (int) – number of blocks in the transformer. Defaults to 12.
num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.
mlp_ratio (int) – The ratio between the hidden layer and the input layer in the FFN. Defaults to 4.
qkv_bias (bool) – If True, add a learnable bias to q and v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.drop_rate (float) – Dropout ratio of output. Defaults to 0.
attn_drop_rate (float) – Dropout ratio of attention weight. Defaults to 0.
drop_path_rate (float) – Dropout ratio of the residual branch. Defaults to 0.
norm_cfg (dict or Configdict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).
init_values (float) – Value to init the multiplier of the residual branch. Defaults to 0.
use_learnable_pos_emb (bool) – If True, use learnable positional embedding, othersize use sinusoid encoding. Defaults to False.
num_frames (int) – Number of frames in the video. Defaults to 16.
tubelet_size (int) – Temporal size of one patch. Defaults to 2.
use_mean_pooling (bool) – If True, take the mean pooling over all positions. Defaults to True.
pretrained (str, optional) – Name of pretrained model. Default: None.
return_feat_map (bool) – If True, return the feature in the shape of [B, C, T, H, W]. Defaults to False.
init_cfg (dict or list[dict]) – Initialization config dict. Defaults to
[ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ]
.
- class mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=-1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[源代码]¶
X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.
- 参数:
gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2)
.frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- forward(x)[源代码]¶
Defines the computation performed at every call.
- 参数:
x (torch.Tensor) – The input data.
- 返回:
The feature of the input samples extracted by the backbone.
- 返回类型:
torch.Tensor
- make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[源代码]¶
Build residual layer for ResNet3D.
- 参数:
block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- 返回:
A residual layer for the given config.
- 返回类型:
nn.Module
common¶
- class mmaction.models.common.Conv2plus1d(in_channels: int, out_channels: int, kernel_size: int | Tuple[int], stride: int | Tuple[int] = 1, padding: int | Tuple[int] = 0, dilation: int | Tuple[int] = 1, groups: int = 1, bias: bool | str = True, norm_cfg: ConfigDict | dict = {'type': 'BN3d'})[源代码]¶
(2+1)d Conv module for R(2+1)d backbone.
https://arxiv.org/pdf/1711.11248.pdf.
- 参数:
in_channels (int) – Same as
nn.Conv3d
.out_channels (int) – Same as
nn.Conv3d
.kernel_size (Union[int, Tuple[int]]) – Same as
nn.Conv3d
.stride (Union[int, Tuple[int]]) – Same as
nn.Conv3d
. Defaults to 1.padding (Union[int, Tuple[int]]) – Same as
nn.Conv3d
. Defaults to 0.dilation (Union[int, Tuple[int]]) – Same as
nn.Conv3d
. Defaults to 1.groups (int) – Same as
nn.Conv3d
. Defaults to 1.bias (Union[bool, str]) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. Defaults to
dict(type='BN3d')
.
- class mmaction.models.common.ConvAudio(in_channels: int, out_channels: int, kernel_size: int | Tuple[int], op: str = 'concat', stride: int | Tuple[int] = 1, padding: int | Tuple[int] = 0, dilation: int | Tuple[int] = 1, groups: int = 1, bias: bool | str = False)[源代码]¶
Conv2d module for AudioResNet backbone.
- 参数:
in_channels (int) – Same as
nn.Conv2d
.out_channels (int) – Same as
nn.Conv2d
.kernel_size (Union[int, Tuple[int]]) – Same as
nn.Conv2d
.op (str) – Operation to merge the output of freq and time feature map. Choices are
sum
andconcat
. Defaults toconcat
.stride (Union[int, Tuple[int]]) – Same as
nn.Conv2d
. Defaults to 1.padding (Union[int, Tuple[int]]) – Same as
nn.Conv2d
. Defaults to 0.dilation (Union[int, Tuple[int]]) – Same as
nn.Conv2d
. Defaults to 1.groups (int) – Same as
nn.Conv2d
. Defaults to 1.bias (Union[bool, str]) – If specified as
auto
, it will be decided by thenorm_cfg
. Bias will be set as True ifnorm_cfg
is None, otherwise False. Defaults to False.
- class mmaction.models.common.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[源代码]¶
Spatial Attention in Divided Space Time Attention.
- 参数:
embed_dims (int) – Dimensions of embedding.
num_heads (int) – Number of parallel attention heads in TransformerCoder.
num_frames (int) – Number of frames in the video.
attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..
proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..
dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
init_cfg (dict | None) – The Config for initialization. Defaults to None.
- class mmaction.models.common.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[源代码]¶
Temporal Attention in Divided Space Time Attention.
- 参数:
embed_dims (int) – Dimensions of embedding.
num_heads (int) – Number of parallel attention heads in TransformerCoder.
num_frames (int) – Number of frames in the video.
attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..
proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..
dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
init_cfg (dict | None) – The Config for initialization. Defaults to None.
- class mmaction.models.common.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[源代码]¶
FFN with pre normalization layer.
FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.
FFNWithNorm has one main difference with FFN:
- It apply one normalization layer before forwarding the input data to
feed-forward networks.
- 参数:
embed_dims (int) – Dimensions of embedding. Defaults to 256.
feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.
num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.
act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)
ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..
add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.
dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.
init_cfg (dict) – The Config for initialization. Defaults to None.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
- class mmaction.models.common.SubBatchNorm3D(num_features, **cfg)[源代码]¶
Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.
- 参数:
num_features (int) – Dimensions of BatchNorm.
- class mmaction.models.common.TAM(in_channels: int, num_segments: int, alpha: int = 2, adaptive_kernel_size: int = 3, beta: int = 4, conv1d_kernel_size: int = 3, adaptive_convolution_stride: int = 1, adaptive_convolution_padding: int = 1, init_std: float = 0.001)[源代码]¶
Temporal Adaptive Module(TAM) for TANet.
This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
- 参数:
in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) –
alpha
in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Defaults to 2.adaptive_kernel_size (int) –
K
in the paper and is the size of the adaptive kernel size in the global branch. Defaults to 3.beta (int) –
beta
in the paper and is set to control the model complexity in the local branch. Defaults to 4.conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Defaults to 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of
Temporal Adaptive Aggregation
. Defaults to 1.adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of
Temporal Adaptive Aggregation
. Defaults to 1.init_std (float) – Std value for initiation of nn.Linear. Defaults to 0.001.
data_preprocessors¶
- class mmaction.models.data_preprocessors.ActionDataPreprocessor(mean: Sequence[float | int] | None = None, std: Sequence[float | int] | None = None, to_rgb: bool = False, to_float32: bool = True, blending: dict | None = None, format_shape: str = 'NCHW')[源代码]¶
Data pre-processor for action recognition tasks.
- 参数:
mean (Sequence[float or int], optional) – The pixel mean of channels of images or stacked optical flow. Defaults to None.
std (Sequence[float or int], optional) – The pixel standard deviation of channels of images or stacked optical flow. Defaults to None.
to_rgb (bool) – Whether to convert image from BGR to RGB. Defaults to False.
to_float32 (bool) – Whether to convert data to float32. Defaults to True.
blending (dict, optional) – Config for batch blending. Defaults to None.
format_shape (str) – Format shape of input data. Defaults to
'NCHW'
.
- forward(data: dict | Tuple[dict], training: bool = False) dict | Tuple[dict] [源代码]¶
Perform normalization, padding, bgr2rgb conversion and batch augmentation based on
BaseDataPreprocessor
.- 参数:
data (dict or Tuple[dict]) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation.
- 返回:
Data in the same format as the model input.
- 返回类型:
dict or Tuple[dict]
- forward_onesample(data, training: bool = False) dict [源代码]¶
Perform normalization, padding, bgr2rgb conversion and batch augmentation on one data sample.
- 参数:
data (dict) – data sampled from dataloader.
training (bool) – Whether to enable training time augmentation.
- 返回:
Data in the same format as the model input.
- 返回类型:
dict
heads¶
- class mmaction.models.heads.BaseHead(num_classes: int, in_channels: int, loss_cls: Dict = {'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class: bool = False, label_smooth_eps: float = 0.0, topk: int | Tuple[int] = (1, 5), average_clips: Dict | None = None, init_cfg: Dict | None = None)[源代码]¶
Base class for head.
All Head should subclass it. All subclass should overwrite: -
forward()
, supporting to forward both for training and testing.- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Defaults to
dict(type='CrossEntropyLoss', loss_weight=1.0)
.multi_class (bool) – Determines whether it is a multi-class recognition task. Defaults to False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Defaults to 0.
topk (int or tuple) – Top-k accuracy. Defaults to
(1, 5)
.average_clips (dict, optional) – Config for averaging class scores over multiple clips. Defaults to None.
init_cfg (dict, optional) – Config to control the initialization. Defaults to None.
- average_clip(cls_scores: Tensor, num_segs: int = 1) Tensor [源代码]¶
Averaging class scores over multiple clips.
Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.
- 参数:
cls_scores (torch.Tensor) – Class scores to be averaged.
num_segs (int) – Number of clips for each input sample.
- 返回:
Averaged class scores.
- 返回类型:
torch.Tensor
- abstract forward(x, **kwargs) Dict[str, Tensor] | List[ActionDataSample] | Tuple[Tensor] | Tensor [源代码]¶
Defines the computation performed at every call.
- loss(feats: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) Dict [源代码]¶
Perform forward propagation of head and loss calculation on the features of the upstream network.
- 参数:
feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- loss_by_feat(cls_scores: Tensor, data_samples: List[ActionDataSample]) Dict [源代码]¶
Calculate the loss based on the features extracted by the head.
- 参数:
cls_scores (torch.Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- predict(feats: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) List[ActionDataSample] [源代码]¶
Perform forward propagation of head and predict recognition results on the features of the upstream network.
- 参数:
feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
- Recognition results wrapped
by
ActionDataSample
.
- 返回类型:
list[
ActionDataSample
]
- predict_by_feat(cls_scores: Tensor, data_samples: List[ActionDataSample]) List[ActionDataSample] [源代码]¶
Transform a batch of output features extracted from the head into prediction results.
- 参数:
cls_scores (torch.Tensor) – Classification scores, has a shape (B*num_segs, num_classes)
data_samples (list[
ActionDataSample
]) – The annotation data of every samples. It usually includes information such as gt_label.
- 返回:
- Recognition results wrapped
by
ActionDataSample
.
- 返回类型:
List[
ActionDataSample
]
- class mmaction.models.heads.FeatureHead(spatial_type: str = 'avg', temporal_type: str = 'avg', backbone_name: str | None = None, num_segments: str | None = None, **kwargs)[源代码]¶
General head for feature extraction.
- 参数:
spatial_type (str, optional) – Pooling type in spatial dimension. Default: ‘avg’. If set to None, means keeping spatial dimension, and for GCN backbone, keeping last two dimension(T, V).
temporal_type (str, optional) – Pooling type in temporal dimension. Default: ‘avg’. If set to None, meanse keeping temporal dimnsion, and for GCN backbone, keeping dimesion M. Please note that the channel order would keep same with the output of backbone, [N, T, C, H, W] for 2D recognizer, and [N, M, C, T, V] for GCN recognizer.
backbone_name (str, optional) – Backbone name to specifying special operations.Currently supports: ‘tsm’, ‘slowfast’, and ‘gcn’. Defaults to None, means take the input as normal feature.
num_segments (int, optional) – Number of frame segments for TSM backbone. Defaults to None.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x: Tensor, num_segs: int | None = None, **kwargs) Tensor [源代码]¶
Defines the computation performed at every call.
- 参数:
x (Tensor) – The input data.
num_segs (int) – For 2D backbone. Number of segments into which a video is divided. Defaults to None.
- 返回:
The output features after pooling.
- 返回类型:
Tensor
- predict_by_feat(feats: Tensor | Tuple[Tensor], data_samples) Tensor [源代码]¶
Integrate multi-view features into one tensor.
- 参数:
feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
The integrated multi-view features.
- 返回类型:
Tensor
- class mmaction.models.heads.GCNHead(num_classes: int, in_channels: int, loss_cls: Dict = {'type': 'CrossEntropyLoss'}, dropout: float = 0.0, average_clips: str = 'prob', init_cfg: Dict | List[Dict] = {'layer': 'Linear', 'std': 0.01, 'type': 'Normal'}, **kwargs)[源代码]¶
The classification head for GCN.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Defaults to
dict(type='CrossEntropyLoss')
.dropout (float) – Probability of dropout layer. Defaults to 0.
init_cfg (dict or list[dict]) – Config to control the initialization. Defaults to
dict(type='Normal', layer='Linear', std=0.01)
.
- class mmaction.models.heads.I3DHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, **kwargs)[源代码]¶
Classification head for I3D.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.MViTHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, dropout_ratio: float = 0.5, init_std: float = 0.02, init_scale: float = 1.0, with_cls_token: bool = True, **kwargs)[源代码]¶
Classification head for Multi-scale ViT.
A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).
dropout_ratio (float) – Probability of dropout layer. Defaults to 0.5.
init_std (float) – Std value for Initiation. Defaults to 0.02.
init_scale (float) – Scale factor for Initiation parameters. Defaults to 1.
with_cls_token (bool) – Whether the backbone output feature with cls_token. Defaults to True.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.OmniHead(image_classes: int, video_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, image_dropout_ratio: float = 0.2, video_dropout_ratio: float = 0.5, video_nl_head: bool = True, **kwargs)[源代码]¶
Classification head for OmniResNet that accepts both image and video inputs.
- 参数:
image_classes (int) – Number of image classes to be classified.
video_classes (int) – Number of video classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
image_dropout_ratio (float) – Probability of dropout layer for the image head. Defaults to 0.2.
video_dropout_ratio (float) – Probability of dropout layer for the video head. Defaults to 0.5.
video_nl_head (bool) – if true, use a non-linear head for the video head. Defaults to True.
- forward(x: Tensor, **kwargs) Tensor [源代码]¶
Defines the computation performed at every call.
- 参数:
x (Tensor) – The input data.
- 返回:
The classification scores for input samples.
- 返回类型:
Tensor
- loss_by_feat(cls_scores: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample]) dict [源代码]¶
Calculate the loss based on the features extracted by the head.
- 参数:
cls_scores (Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).
data_samples (List[
ActionDataSample
]) – The batch data samples.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- class mmaction.models.heads.RGBPoseHead(num_classes: int, in_channels: Tuple[int], loss_cls: Dict = {'type': 'CrossEntropyLoss'}, loss_components: List[str] = ['rgb', 'pose'], loss_weights: float | Tuple[float] = 1.0, dropout: float = 0.5, init_std: float = 0.01, **kwargs)[源代码]¶
The classification head for RGBPoseConv3D.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (tuple[int]) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Defaults to
dict(type='CrossEntropyLoss')
.loss_components (list[str]) – The components of the loss. Defaults to
['rgb', 'pose']
.loss_weights (float or tuple[float]) – The weights of the losses. Defaults to 1.
dropout (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
- loss(feats: Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) Dict [源代码]¶
Perform forward propagation of head and loss calculation on the features of the upstream network.
- 参数:
feats (tuple[torch.Tensor]) – Features from upstream network.
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- loss_by_feat(cls_scores: Dict[str, Tensor], data_samples: List[ActionDataSample]) Dict [源代码]¶
Calculate the loss based on the features extracted by the head.
- 参数:
cls_scores (dict[str, torch.Tensor]) – The dict of classification scores,
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- loss_by_scores(cls_scores: Tensor, labels: Tensor) Dict [源代码]¶
Calculate the loss based on the features extracted by the head.
- 参数:
cls_scores (torch.Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).
labels (torch.Tensor) – The labels used to calculate the loss.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- predict(feats: Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) List[ActionDataSample] [源代码]¶
Perform forward propagation of head and predict recognition results on the features of the upstream network.
- 参数:
feats (tuple[torch.Tensor]) – Features from upstream network.
data_samples (list[
ActionDataSample
]) – The batch data samples.
- 返回:
- Recognition results wrapped
by
ActionDataSample
.
- 返回类型:
list[
ActionDataSample
]
- predict_by_feat(cls_scores: Dict[str, Tensor], data_samples: List[ActionDataSample]) List[ActionDataSample] [源代码]¶
Transform a batch of output features extracted from the head into prediction results.
- 参数:
cls_scores (dict[str, torch.Tensor]) – The dict of classification scores,
data_samples (list[
ActionDataSample
]) – The annotation data of every samples. It usually includes information such as gt_label.
- 返回:
- Recognition results wrapped
by
ActionDataSample
.
- 返回类型:
list[
ActionDataSample
]
- predict_by_scores(cls_scores: Tensor, data_samples: List[ActionDataSample]) Tensor [源代码]¶
Transform a batch of output features extracted from the head into prediction results.
- 参数:
cls_scores (torch.Tensor) – Classification scores, has a shape (B*num_segs, num_classes)
data_samples (list[
ActionDataSample
]) – The annotation data of every samples.
- 返回:
The averaged classification scores.
- 返回类型:
torch.Tensor
- class mmaction.models.heads.SlowFastHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.8, init_std: float = 0.01, **kwargs)[源代码]¶
The classification head for SlowFast.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.TPNHead(*args, **kwargs)[源代码]¶
Class head for TPN.
- forward(x, num_segs: int | None = None, fcn_test: bool = False, **kwargs) Tensor [源代码]¶
Defines the computation performed at every call.
- 参数:
x (Tensor) – The input data.
num_segs (int, optional) – Number of segments into which a video is divided. Defaults to None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Defaults to False.
- 返回:
The classification scores for input samples.
- 返回类型:
Tensor
- class mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[源代码]¶
Class head for TRN.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x, num_segs, **kwargs)[源代码]¶
Defines the computation performed at every call.
- 参数:
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.
- 返回:
The classification scores for input samples.
- 返回类型:
torch.Tensor
- class mmaction.models.heads.TSMHead(num_classes: int, in_channels: int, num_segments: int = 8, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', consensus: ConfigDict | dict = {'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio: float = 0.8, init_std: float = 0.001, is_shift: bool = True, temporal_pool: bool = False, **kwargs)[源代码]¶
Class head for TSM.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict or ConfigDict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x: Tensor, num_segs: int, **kwargs) Tensor [源代码]¶
Defines the computation performed at every call.
- 参数:
x (Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.
- 返回:
The classification scores for input samples.
- 返回类型:
Tensor
- class mmaction.models.heads.TSNAudioHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.4, init_std: float = 0.01, **kwargs)[源代码]¶
Classification head for TSN on audio.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (Union[dict, ConfigDict]) – Config for building loss. Defaults to
dict(type='CrossEntropyLoss')
.spatial_type (str) – Pooling type in spatial dimension. Defaults to
avg
.dropout_ratio (float) – Probability of dropout layer. Defaults to 0.4.
init_std (float) – Std value for Initiation. Defaults to 0.01.
- class mmaction.models.heads.TSNHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', consensus: ConfigDict | dict = {'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio: float = 0.4, init_std: float = 0.01, **kwargs)[源代码]¶
Class head for TSN.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str or ConfigDict) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.TimeSformerHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, init_std: float = 0.02, dropout_ratio: float = 0.0, **kwargs)[源代码]¶
Classification head for TimeSformer.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).
init_std (float) – Std value for Initiation. Defaults to 0.02.
dropout_ratio (float) – Probability of dropout layer. Defaults to : 0.0.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.UniFormerHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, dropout_ratio: float = 0.0, channel_map: str | None = None, init_cfg: dict | None = {'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'}, **kwargs)[源代码]¶
Classification head for UniFormer. supports loading pretrained Kinetics-710 checkpoint to fine-tuning on other Kinetics dataset.
A pytorch implement of: UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer <https://arxiv.org/abs/2211.09552>
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).
dropout_ratio (float) – Probability of dropout layer. Defaults to : 0.0.
channel_map (str, optional) – Channel map file to selecting channels from pretrained head with extra channels. Defaults to None.
init_cfg (dict or ConfigDict, optional) –
Config to control the initialization. Defaults to ``[
dict(type=’TruncNormal’, layer=’Linear’, std=0.01)
]``.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.X3DHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, fc1_bias: bool = False, **kwargs)[源代码]¶
Classification head for I3D.
- 参数:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.
localizers¶
- class mmaction.models.localizers.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[源代码]¶
Boundary Matching Network for temporal action proposal generation.
Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network :param temporal_dim: Total frames selected for each video. :type temporal_dim: int :param boundary_ratio: Ratio for determining video boundaries. :type boundary_ratio: float :param num_samples: Number of samples for each proposal. :type num_samples: int :param num_samples_per_bin: Number of bin samples for each sample. :type num_samples_per_bin: int :param feat_dim: Feature dimension. :type feat_dim: int :param soft_nms_alpha: Soft NMS alpha. :type soft_nms_alpha: float :param soft_nms_low_threshold: Soft NMS low threshold. :type soft_nms_low_threshold: float :param soft_nms_high_threshold: Soft NMS high threshold. :type soft_nms_high_threshold: float :param post_process_top_k: Top k proposals in post process. :type post_process_top_k: int :param feature_extraction_interval: Interval used in feature extraction. Default: 16. :type feature_extraction_interval: int :param loss_cls: Config for building loss.
Default:
dict(type='BMNLoss')
.- 参数:
hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.
- forward(inputs, data_samples, mode, **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method should accept three modes:
tensor
: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. -
predict
: Forward and return the predictions, which are fully processed to a list ofActionDataSample
. -loss
: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- 参数:
inputs (Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[
ActionDataSample
], optional) – The annotation data of every samples. Defaults to None.mode (str) – Return what kind of value. Defaults to
tensor
.
- 返回:
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofActionDataSample
.If
mode="loss"
, return a dict of tensor.
- loss(batch_inputs, batch_data_samples, **kwargs)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数:
batch_inputs (Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
batch_data_samples (List[
ActionDataSample
]) – The batch data samples. It usually includes information such asgt_labels
.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- class mmaction.models.localizers.DRN(vocab_size: int = 1301, hidden_dim: int = 512, embed_dim: int = 300, bidirection: bool = True, first_output_dim: int = 256, fpn_feature_dim: int = 512, feature_dim: int = 4096, lstm_layers: int = 1, fcos_pre_nms_top_n: int = 32, fcos_inference_thr: float = 0.05, fcos_prior_prob: float = 0.01, focal_alpha: float = 0.25, focal_gamma: float = 2.0, fpn_stride: Sequence[int] = [1, 2, 4], fcos_nms_thr: float = 0.6, fcos_conv_layers: int = 1, fcos_num_class: int = 2, is_first_stage: bool = False, is_second_stage: bool = False, init_cfg: ConfigDict | dict | None = None, **kwargs)[源代码]¶
Dense Regression Network for Video Grounding.
- Please refer `Dense Regression Network for Video Grounding
Code Reference: https://github.com/Alvin-Zeng/DRN
- 参数:
vocab_size (int) – number of all possible words in the query. Defaults to 1301.
hidden_dim (int) – the hidden dimension of the LSTM in the language model. Defaults to 512.
embed_dim (int) – the embedding dimension of the query. Defaults to 300.
bidirection (bool) – if True, use bi-direction LSTM in the language model. Defaults to True.
first_output_dim (int) – the output dimension of the first layer in the backbone. Defaults to 256.
fpn_feature_dim (int) – the output dimension of the FPN. Defaults to 512.
feature_dim (int) – the dimension of the video clip feature.
lstm_layers (int) – the number of LSTM layers in the language model. Defaults to 1.
fcos_pre_nms_top_n (int) – value of Top-N in the FCOS module before nms. Defaults to 32.
fcos_inference_thr (float) – threshold in the FOCS inference. BBoxes with scores higher than this threshold are regarded as positive. Defaults to 0.05.
fcos_prior_prob (float) – A prior probability of the positive bboexes. Used to initialized the bias of the classification head. Defaults to 0.01.
focal_alpha (float) – Focal loss hyper-parameter alpha. Defaults to 0.25.
focal_gamma (float) – Focal loss hyper-parameter gamma. Defaults to 2.0.
fpn_stride (Sequence[int]) – the strides in the FPN. Defaults to [1, 2, 4].
fcos_nms_thr (float) – NMS threshold in the FOCS module. Defaults to 0.6.
fcos_conv_layers (int) – number of convolution layers in FCOS. Defaults to 1.
fcos_num_class (int) – number of classes in FCOS. Defaults to 2.
is_first_stage (bool) – if true, the model is in the first stage training.
is_second_stage (bool) – if true, the model is in the second stage training.
- forward(inputs, data_samples, mode, **kwargs)[源代码]¶
Returns losses or predictions of training, validation, testing, and simple inference process.
forward
method of BaseModel is an abstract method, its subclasses must implement this method.Accepts
batch_inputs
anddata_sample
processed bydata_preprocessor
, and returns results according to mode arguments.During non-distributed training, validation, and testing process,
forward
will be called byBaseModel.train_step
,BaseModel.val_step
andBaseModel.test_step
directly.During distributed data parallel training process,
MMSeparateDistributedDataParallel.train_step
will first callDistributedDataParallel.forward
to enable automatic gradient synchronization, and then callforward
to get training loss.- 参数:
inputs (torch.Tensor) – batch input tensor collated by
data_preprocessor
.data_samples (list, optional) – data samples collated by
data_preprocessor
.mode (str) –
mode should be one of
loss
,predict
andtensor
loss
: Called bytrain_step
and return lossdict
used for loggingpredict
: Called byval_step
andtest_step
and return list of results used for computing metric.tensor
: Called by custom use to getTensor
type results.
- 返回:
If
mode == loss
, return adict
of loss tensor used for backward and logging.If
mode == predict
, return alist
of inference results.If
mode == tensor
, return a tensor ortuple
of tensor ordict
of tensor for custom use.
- 返回类型:
dict or list
- class mmaction.models.localizers.PEM(pem_feat_dim: int, pem_hidden_dim: int, pem_u_ratio_m: float, pem_u_ratio_l: float, pem_high_temporal_iou_threshold: float, pem_low_temporal_iou_threshold: float, soft_nms_alpha: float, soft_nms_low_threshold: float, soft_nms_high_threshold: float, post_process_top_k: int, feature_extraction_interval: int = 16, fc1_ratio: float = 0.1, fc2_ratio: float = 0.1, output_dim: int = 1)[源代码]¶
Proposals Evaluation Model for Boundary Sensitive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network :param pem_feat_dim: Feature dimension. :type pem_feat_dim: int :param pem_hidden_dim: Hidden layer dimension. :type pem_hidden_dim: int :param pem_u_ratio_m: Ratio for medium score proprosals to balance
data.
- 参数:
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.
- forward(inputs, data_samples, mode, **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method should accept three modes:
tensor
: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. -
predict
: Forward and return the predictions, which are fully processed to a list ofActionDataSample
. -loss
: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- 参数:
batch_inputs (Tensor) – The input tensor with shape (N, C, …) in general.
batch_data_samples (List[
ActionDataSample
], optional) – The annotation data of every samples. Defaults to None.mode (str) – Return what kind of value. Defaults to
tensor
.
- 返回:
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofActionDataSample
.If
mode="loss"
, return a dict of tensor.
- loss(batch_inputs, batch_data_samples, **kwargs)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数:
batch_inputs (Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
batch_data_samples (List[
ActionDataSample
]) – The batch data samples. It usually includes information such asgt_labels
.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
- class mmaction.models.localizers.TCANet(feat_dim: int = 2304, se_sample_num: int = 32, action_sample_num: int = 64, temporal_dim: int = 100, window_size: int = 9, lgte_num: int = 2, soft_nms_alpha: float = 0.4, soft_nms_low_threshold: float = 0.0, soft_nms_high_threshold: float = 0.0, post_process_top_k: int = 100, feature_extraction_interval: int = 16, init_cfg: ConfigDict | dict | None = None, **kwargs)[源代码]¶
Temporal Context Aggregation Network.
Please refer Temporal Context Aggregation Network for Temporal Action Proposal Refinement. Code Reference: https://github.com/qinzhi-0110/Temporal-Context-Aggregation-Network-Pytorch
- forward(inputs, data_samples, mode, **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method should accept three modes:
tensor
: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. -
predict
: Forward and return the predictions, which are fully processed to a list ofActionDataSample
. -loss
: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- 参数:
inputs (Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[
ActionDataSample
], optional) – The annotation data of every samples. Defaults to None.mode (str) – Return what kind of value. Defaults to
tensor
.
- 返回:
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofActionDataSample
.If
mode="loss"
, return a dict of tensor.
- class mmaction.models.localizers.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[源代码]¶
Temporal Evaluation Model for Boundary Sensitive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network :param temporal_dim: Total frames selected for each video. :type temporal_dim: int :param tem_feat_dim: Feature dimension. :type tem_feat_dim: int :param tem_hidden_dim: Hidden layer dimension. :type tem_hidden_dim: int :param tem_match_threshold: Temporal evaluation match threshold. :type tem_match_threshold: float :param loss_cls: Config for building loss.
Default:
dict(type='BinaryLogisticRegressionLoss')
.- 参数:
loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.
- forward(inputs, data_samples, mode, **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method should accept three modes:
tensor
: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. -
predict
: Forward and return the predictions, which are fully processed to a list ofActionDataSample
. -loss
: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- 参数:
inputs (Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[
ActionDataSample
], optional) – The annotation data of every samples. Defaults to None.mode (str) – Return what kind of value. Defaults to
tensor
.
- 返回:
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofActionDataSample
.If
mode="loss"
, return a dict of tensor.
- loss(batch_inputs, batch_data_samples, **kwargs)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数:
batch_inputs (Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
batch_data_samples (List[
ActionDataSample
]) – The batch data samples. It usually includes information such asgt_labels
.
- 返回:
A dictionary of loss components.
- 返回类型:
dict
losses¶
- class mmaction.models.losses.BCELossWithLogits(loss_weight: float = 1.0, class_weight: List[float] | None = None)[源代码]¶
Binary Cross Entropy Loss with logits.
- 参数:
loss_weight (float) – Factor scalar multiplied on the loss. Defaults to 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Defaults to None.
- class mmaction.models.losses.BMNLoss(*args, **kwargs)[源代码]¶
BMN Loss.
From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of
1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.
- forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[源代码]¶
Calculate Boundary Matching Network Loss.
- 参数:
pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.
- 返回:
(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.
- 返回类型:
tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])
- static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[源代码]¶
Calculate Proposal Evaluation Module Classification Loss.
- 参数:
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5
- 返回:
Proposal evaluation classification loss.
- 返回类型:
torch.Tensor
- static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[源代码]¶
Calculate Proposal Evaluation Module Regression Loss.
- 参数:
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.
- 返回:
Proposal evaluation regression loss.
- 返回类型:
torch.Tensor
- static tem_loss(pred_start, pred_end, gt_start, gt_end)[源代码]¶
Calculate Temporal Evaluation Module Loss.
This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.
- 参数:
pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.
- 返回:
Returned binary logistic loss.
- 返回类型:
torch.Tensor
- class mmaction.models.losses.BaseWeightedLoss(loss_weight=1.0)[源代码]¶
Base class for loss.
All subclass should overwrite the
_forward()
method which returns the normal loss without loss weights.- 参数:
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
- class mmaction.models.losses.BinaryLogisticRegressionLoss(*args, **kwargs)[源代码]¶
Binary Logistic Regression Loss.
It will calculate binary logistic regression loss given reg_score and label.
- forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[源代码]¶
Calculate Binary Logistic Regression Loss.
- 参数:
reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.
- 返回:
Returned binary logistic loss.
- 返回类型:
torch.Tensor
- class mmaction.models.losses.CBFocalLoss(loss_weight: float = 1.0, samples_per_cls: List[int] = [], beta: float = 0.9999, gamma: float = 2.0)[源代码]¶
Class Balanced Focal Loss. Adapted from https://github.com/abhinanda- punnakkal/BABEL/. This loss is used in the skeleton-based action recognition baseline for BABEL.
- 参数:
loss_weight (float) – Factor scalar multiplied on the loss. Defaults to 1.0.
samples_per_cls (list[int]) – The number of samples per class. Defaults to [].
beta (float) – Hyperparameter that controls the per class loss weight. Defaults to 0.9999.
gamma (float) – Hyperparameter of the focal loss. Defaults to 2.0.
- class mmaction.models.losses.CrossEntropyLoss(loss_weight: float = 1.0, class_weight: List[float] | None = None)[源代码]¶
Cross Entropy Loss.
Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of
cls_score
andlabel
. 1) Hard label: This label is an integer array and all of the elements arein the range [0, num_classes - 1]. This label’s shape should be
cls_score
’s shape with the num_classes dimension removed.- Soft label(probability distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as
cls_score
. For now, only 2-dim soft label is supported.
- 参数:
loss_weight (float) – Factor scalar multiplied on the loss. Defaults to 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Defaults to None.
- class mmaction.models.losses.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[源代码]¶
Calculate the BCELoss for HVU.
- 参数:
categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.
- class mmaction.models.losses.NLLLoss(loss_weight=1.0)[源代码]¶
NLL Loss.
It will calculate NLL loss given cls_score and label.
- class mmaction.models.losses.OHEMHingeLoss(*args, **kwargs)[源代码]¶
This class is the core implementation for the completeness loss in paper.
It compute class-wise hinge loss and performs online hard example mining (OHEM).
- static backward(ctx, grad_output)[源代码]¶
Defines a formula for differentiating the operation with backward mode automatic differentiation.
- static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[源代码]¶
Calculate OHEM hinge loss.
- 参数:
pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.
- 返回:
Returned class-wise hinge loss.
- 返回类型:
torch.Tensor
- class mmaction.models.losses.SSNLoss(*args, **kwargs)[源代码]¶
- static activity_loss(activity_score, labels, activity_indexer)[源代码]¶
Activity Loss.
It will calculate activity loss given activity_score and label.
- Args:
activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.
- 返回:
Returned cross entropy loss.
- 返回类型:
torch.Tensor
- static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[源代码]¶
Classwise Regression Loss.
It will calculate classwise_regression loss given class_reg_pred and targets.
- Args:
- bbox_pred (torch.Tensor): Predicted interval center and span
of positive proposals.
labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span
of positive proposals.
- regression_indexer (torch.Tensor): Index slices of
positive proposals.
- 返回:
Returned class-wise regression loss.
- 返回类型:
torch.Tensor
- static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[源代码]¶
Completeness Loss.
It will calculate completeness loss given completeness_score and label.
- Args:
completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and
incomplete proposals.
- positive_per_video (int): Number of positive proposals sampled
per video.
- incomplete_per_video (int): Number of incomplete proposals sampled
pre video.
- ohem_ratio (float): Ratio of online hard example mining.
Default: 0.17.
- 返回:
Returned class-wise completeness loss.
- 返回类型:
torch.Tensor
- forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[源代码]¶
Calculate Boundary Matching Network Loss.
- 参数:
activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.
- 返回:
(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.
- 返回类型:
dict([torch.Tensor, torch.Tensor, torch.Tensor])
necks¶
- class mmaction.models.necks.TPN(in_channels: Tuple[int], out_channels: int, spatial_modulation_cfg: ConfigDict | dict | None = None, temporal_modulation_cfg: ConfigDict | dict | None = None, upsample_cfg: ConfigDict | dict | None = None, downsample_cfg: ConfigDict | dict | None = None, level_fusion_cfg: ConfigDict | dict | None = None, aux_head_cfg: ConfigDict | dict | None = None, flow_type: str = 'cascade')[源代码]¶
TPN neck.
This module is proposed in Temporal Pyramid Network for Action Recognition
- 参数:
in_channels (Tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict or ConfigDict, optional) – Config for spatial modulation layers. Required keys are
in_channels
andout_channels
. Defaults to None.temporal_modulation_cfg (dict or ConfigDict, optional) – Config for temporal modulation layers. Defaults to None.
upsample_cfg (dict or ConfigDict, optional) – Config for upsample layers. The keys are same as that in :class:
nn.Upsample
. Defaults to None.downsample_cfg (dict or ConfigDict, optional) – Config for downsample layers. Defaults to None.
level_fusion_cfg (dict or ConfigDict, optional) – Config for level fusion layers. Required keys are
in_channels
,mid_channels
,out_channels
. Defaults to None.aux_head_cfg (dict or ConfigDict, optional) – Config for aux head layers. Required keys are
out_channels
. Defaults to None.flow_type (str) – Flow type to combine the features. Options are
cascade
andparallel
. Defaults tocascade
.
- forward(x: Tuple[Tensor], data_samples: List[ActionDataSample] | None = None) tuple [源代码]¶
Defines the computation performed at every call.
roi_heads¶
recognizers¶
task_modules¶
utils¶
- class mmaction.models.utils.BaseMiniBatchBlending(num_classes: int)[源代码]¶
Base class for Image Aliasing.
- 参数:
num_classes (int) – Number of classes.
- class mmaction.models.utils.CutmixBlending(num_classes: int, alpha: float = 0.2)[源代码]¶
Implementing Cutmix in a mini-batch.
This module is proposed in CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. Code Reference https://github.com/clovaai/CutMix-PyTorch
- 参数:
num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.
- do_blending(imgs: Tensor, label: Tensor, **kwargs) Tuple [源代码]¶
Blending images with cutmix.
- 参数:
imgs (torch.Tensor) – Model input images, float tensor with the shape of (B, N, C, H, W) or (B, N, C, T, H, W).
label (torch.Tensor) – One hot labels, integer tensor with the shape of (B, num_classes).
- 返回:
A tuple of blended images and labels.
- 返回类型:
tuple
- class mmaction.models.utils.Graph(layout: str | dict = 'coco', mode: str = 'spatial', max_hop: int = 1)[源代码]¶
The Graph to model the skeletons.
- 参数:
layout (str or dict) – must be one of the following candidates: ‘openpose’, ‘nturgb+d’, ‘coco’, or a dict with the following keys: ‘num_node’, ‘inward’, and ‘center’. Defaults to
'coco'
.mode (str) – must be one of the following candidates: ‘stgcn_spatial’, ‘spatial’. Defaults to
'spatial'
.max_hop (int) – the maximal distance between two connected nodes. Defaults to 1.
- class mmaction.models.utils.MixupBlending(num_classes: int, alpha: float = 0.2)[源代码]¶
Implementing Mixup in a mini-batch.
This module is proposed in mixup: Beyond Empirical Risk Minimization. Code Reference https://github.com/open-mmlab/mmclassification/blob/master/mmcls/models/utils/mixup.py # noqa
- 参数:
num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.
- do_blending(imgs: Tensor, label: Tensor, **kwargs) Tuple [源代码]¶
Blending images with mixup.
- 参数:
imgs (torch.Tensor) – Model input images, float tensor with the shape of (B, N, C, H, W) or (B, N, C, T, H, W).
label (torch.Tensor) – One hot labels, integer tensor with the shape of (B, num_classes).
- 返回:
A tuple of blended images and labels.
- 返回类型:
tuple
- class mmaction.models.utils.RandomBatchAugment(augments: dict | list, probs: float | List[float] | None = None)[源代码]¶
Randomly choose one batch augmentation to apply.
- 参数:
augments (dict | list) – configs of batch augmentations.
probs (float | List[float] | None) – The probabilities of each batch augmentations. If None, choose evenly. Defaults to None.
示例
>>> augments_cfg = [ ... dict(type='CutmixBlending', alpha=1., num_classes=10), ... dict(type='MixupBlending', alpha=1., num_classes=10) ... ] >>> batch_augment = RandomBatchAugment(augments_cfg, probs=[0.5, 0.3]) >>> imgs = torch.randn(16, 3, 8, 32, 32) >>> label = torch.randint(0, 10, (16, )) >>> imgs, label = batch_augment(imgs, label)
备注
To decide which batch augmentation will be used, it picks one of
augments
based on the probabilities. In the example above, the probability to use CutmixBlending is 0.5, to use MixupBlending is 0.3, and to do nothing is 0.2.
mmaction.structures¶
structures¶
- class mmaction.structures.ActionDataSample(*, metainfo: dict | None = None, **kwargs)[源代码]¶
- property features¶
Setter of features
- property gt_instances¶
Property of gt_instances
- property proposals¶
Property of proposals
- set_gt_label(value: Tensor | ndarray | Sequence | int) ActionDataSample [源代码]¶
Set gt_label`.
- set_pred_label(value: Tensor | ndarray | Sequence | int) ActionDataSample [源代码]¶
Set
pred_label
.
- set_pred_score(value: Tensor | ndarray | Sequence | Dict) ActionDataSample [源代码]¶
Set score of
pred_label
.
- mmaction.structures.bbox2result(bboxes: Tensor, labels: Tensor, num_classes: int, thr: float = 0.01) list [源代码]¶
Convert detection results to a list of numpy arrays.
This identifies single-label classification (as opposed to multi-label) through the thr parameter which is set to a negative value.
ToDo: The ideal way would be for this to be automatically set when the Currently, the way to set this is to set
test_cfg.rcnn.action_thr=-1.0
model cfg uses multilabel=False, however this could be a breaking change and is left as a future exercise. NB - this should not interfere with the evaluation in any case.- 参数:
bboxes (torch.Tensor) – shape
(n, 4)
.labels (torch.Tensor) – shape
(n, num_classes)
.num_classes (int) – class number, including background class.
thr (float) – The score threshold used when converting predictions to detection results. If a single negative value, uses single-label classification.
- 返回:
bbox results of each class.
- 返回类型:
List(ndarray)
- mmaction.structures.bbox_target(pos_bboxes_list: List[Tensor], neg_bboxes_list: List[Tensor], gt_labels: List[Tensor], cfg: dict | ConfigDict) tuple [源代码]¶
Generate classification targets for bboxes.
- 参数:
pos_bboxes_list (List[torch.Tensor]) – Positive bboxes list.
neg_bboxes_list (List[torch.Tensor]) – Negative bboxes list.
gt_labels (List[torch.Tensor]) – Groundtruth classification label list.
cfg (dict | mmengine.ConfigDict) – RCNN config.
- 返回:
Label and label_weight for bboxes.
- 返回类型:
tuple
bbox¶
- mmaction.structures.bbox.bbox2result(bboxes: Tensor, labels: Tensor, num_classes: int, thr: float = 0.01) list [源代码]¶
Convert detection results to a list of numpy arrays.
This identifies single-label classification (as opposed to multi-label) through the thr parameter which is set to a negative value.
ToDo: The ideal way would be for this to be automatically set when the Currently, the way to set this is to set
test_cfg.rcnn.action_thr=-1.0
model cfg uses multilabel=False, however this could be a breaking change and is left as a future exercise. NB - this should not interfere with the evaluation in any case.- 参数:
bboxes (torch.Tensor) – shape
(n, 4)
.labels (torch.Tensor) – shape
(n, num_classes)
.num_classes (int) – class number, including background class.
thr (float) – The score threshold used when converting predictions to detection results. If a single negative value, uses single-label classification.
- 返回:
bbox results of each class.
- 返回类型:
List(ndarray)
- mmaction.structures.bbox.bbox_target(pos_bboxes_list: List[Tensor], neg_bboxes_list: List[Tensor], gt_labels: List[Tensor], cfg: dict | ConfigDict) tuple [源代码]¶
Generate classification targets for bboxes.
- 参数:
pos_bboxes_list (List[torch.Tensor]) – Positive bboxes list.
neg_bboxes_list (List[torch.Tensor]) – Negative bboxes list.
gt_labels (List[torch.Tensor]) – Groundtruth classification label list.
cfg (dict | mmengine.ConfigDict) – RCNN config.
- 返回:
Label and label_weight for bboxes.
- 返回类型:
tuple
mmaction.testing¶
- mmaction.testing.check_norm_state(modules, train_state)[源代码]¶
Check if norm layer is in correct train state.
mmaction.visualization¶
- class mmaction.visualization.ActionVisualizer(name='visualizer', vis_backends: List[Dict] | None = None, save_dir: str | None = None, fig_save_cfg={'frameon': False}, fig_show_cfg={'frameon': False})[源代码]¶
Universal Visualizer for classification task.
- 参数:
name (str) – Name of the instance. Defaults to ‘visualizer’.
vis_backends (list, optional) – Visual backend config list. Defaults to None.
save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.
fig_save_cfg (dict) – Keyword parameters of figure for saving. Defaults to empty dict.
fig_show_cfg (dict) – Keyword parameters of figure for showing. Defaults to empty dict.
示例
>>> import torch >>> import decord >>> from pathlib import Path >>> from mmaction.structures import ActionDataSample, ActionVisualizer >>> from mmengine.structures import LabelData >>> # Example frame >>> video = decord.VideoReader('./demo/demo.mp4') >>> video = video.get_batch(range(32)).asnumpy() >>> # Example annotation >>> data_sample = ActionDataSample() >>> data_sample.gt_label = LabelData(item=torch.tensor([2])) >>> # Setup the visualizer >>> vis = ActionVisualizer( ... save_dir="./outputs", ... vis_backends=[dict(type='LocalVisBackend')]) >>> # Set classes names >>> vis.dataset_meta = {'classes': ['running', 'standing', 'sitting']} >>> # Save the visualization result by the specified storage backends. >>> vis.add_datasample('demo', video) >>> assert Path('outputs/vis_data/demo/frames_0/1.png').exists() >>> assert Path('outputs/vis_data/demo/frames_0/2.png').exists() >>> # Save another visualization result with the same name. >>> vis.add_datasample('demo', video, step=1) >>> assert Path('outputs/vis_data/demo/frames_1/2.png').exists()
- add_datasample(name: str, video: ndarray | Sequence[ndarray] | str, data_sample: ActionDataSample | None = None, draw_gt: bool = True, draw_pred: bool = True, draw_score: bool = True, rescale_factor: float | None = None, show_frames: bool = False, text_cfg: dict = {}, wait_time: float = 0.1, out_path: str | None = None, out_type: str = 'img', target_resolution: Tuple[int] | None = None, step: int = 0, fps: int = 4) None [源代码]¶
Draw datasample and save to all backends.
If
out_path
is specified, all storage backends are ignored and save the videos to theout_path
.If
show_frames
is True, plot the frames in a window sequentially, please confirm you are able to access the graphical interface.
- 参数:
name (str) – The frame identifier.
video (np.ndarray, str) – The video to draw. supports decoded np.ndarray, video file path, rawframes folder path.
data_sample (
ActionDataSample
, optional) – The annotation of the frame. Defaults to None.draw_gt (bool) – Whether to draw ground truth labels. Defaults to True.
draw_pred (bool) – Whether to draw prediction labels. Defaults to True.
draw_score (bool) – Whether to draw the prediction scores of prediction categories. Defaults to True.
rescale_factor (float, optional) – Rescale the frame by the rescale factor before visualization. Defaults to None.
show_frames (bool) – Whether to display the frames of the video. Defaults to False.
text_cfg (dict) – Extra text setting, which accepts arguments of
mmengine.Visualizer.draw_texts
. Defaults to an empty dict.wait_time (float) – Delay in seconds. 0 is the special value that means “forever”. Defaults to 0.1.
out_path (str, optional) – Extra folder to save the visualization result. If specified, the visualizer will only save the result frame to the out_path and ignore its storage backends. Defaults to None.
out_type (str) – Output format type, choose from ‘img’, ‘gif’, ‘video’. Defaults to
'img'
.target_resolution (Tuple[int], optional) – Set to (desired_width desired_height) to have resized frames. If either dimension is None, the frames are resized by keeping the existing aspect ratio. Defaults to None.
step (int) – Global step value to record. Defaults to 0.
fps (int) – Frames per second for saving video. Defaults to 4.
- add_video(name: str, image: ndarray, step: int = 0, fps: int = 4, out_type: str = 'img') None [源代码]¶
Record the image.
- 参数:
name (str) – The image identifier.
image (np.ndarray, optional) – The image to be saved. The format should be RGB. Default to None.
step (int) – Global step value to record. Default to 0.
fps (int) – Frames per second for saving video. Defaults to 4.
out_type (str) – Output format type, choose from ‘img’, ‘gif’, ‘video’. Defaults to
'img'
.
- class mmaction.visualization.LocalVisBackend(save_dir: str, img_save_dir: str = 'vis_image', config_save_file: str = 'config.py', scalar_save_file: str = 'scalars.json')[源代码]¶
Local visualization backend class with video support.
See mmengine.visualization.LocalVisBackend for more details.
- add_video(name: str, frames: ndarray, step: int = 0, fps: int | None = 4, out_type: int | None = 'img', **kwargs) None [源代码]¶
Record the frames of a video to disk.
- 参数:
name (str) – The video identifier (frame folder).
frames (np.ndarray) – The frames to be saved. The format should be RGB. The shape should be (T, H, W, C).
step (int) – Global step value to record. Defaults to 0.
out_type (str) – Output format type, choose from ‘img’, ‘gif’,
'img'. ('video'. Defaults to) –
fps (int) – Frames per second for saving video. Defaults to 4.
- class mmaction.visualization.TensorboardVisBackend(save_dir: str)[源代码]¶
Tensorboard visualization backend class with video support. See mmengine.visualization.TensorboardVisBackend for more details.
Note that this requires the
future
andtensorboard
package.- add_video(name: str, frames: ndarray, step: int = 0, fps: int = 4, **kwargs) None [源代码]¶
Record the frames of a video to tensorboard.
Note that this requires the
moviepy
package.- 参数:
name (str) – The video identifier (frame folder).
frames (np.ndarray) – The frames to be saved. The format should be RGB. The shape should be (T, H, W, C).
step (int) – Global step value to record. Defaults to 0.
fps (int) – Frames per second. Defaults to 4.
- class mmaction.visualization.WandbVisBackend(save_dir: str, init_kwargs: dict | None = None, define_metric_cfg: dict | list | None = None, commit: bool | None = True, log_code_name: str | None = None, watch_kwargs: dict | None = None)[源代码]¶
Wandb visualization backend class with video support. See mmengine.visualization.WandbVisBackend for more details.
Note that this requires the
wandb
andmoviepy
package. A wandb account login is also required athttps://wandb.ai/authorize
.- add_video(name: str, frames: ndarray, fps: int = 4, **kwargs) None [源代码]¶
Record the frames of a video to wandb.
Note that this requires the
moviepy
package.- 参数:
name (str) – The video identifier (frame folder).
frames (np.ndarray) – The frames to be saved. The format should be RGB. The shape should be (T, H, W, C).
need. (step is a useless parameter that Wandb does not) –
fps (int) – Frames per second. Defaults to 4.
mmaction.utils¶
- class mmaction.utils.GradCAM(model: Module, target_layer_name: str, colormap: str = 'viridis')[源代码]¶
GradCAM class helps create visualization results.
Visualization results are blended by heatmaps and input images. This class is modified from https://github.com/facebookresearch/SlowFast/blob/master/slowfast/visualization/gradcam_utils.py # noqa For more information about GradCAM, please visit: https://arxiv.org/pdf/1610.02391.pdf
- 参数:
model (nn.Module) – the recognizer model to be used.
target_layer_name (str) – name of convolutional layer to be used to get gradients and feature maps from for creating localization maps.
colormap (str) – matplotlib colormap used to create heatmap. Defaults to ‘viridis’. For more information, please visit https://matplotlib.org/3.3.0/tutorials/colors/colormaps.html
- mmaction.utils.frame_extract(video_path: str, short_side: int | None = None, out_dir: str = './tmp')[源代码]¶
Extract frames given video_path.
- 参数:
video_path (str) – The video path.
short_side (int) – Target short-side of the output image. Defaults to None, means keeping original shape.
out_dir (str) – The output directory. Defaults to
'./tmp'
.
- mmaction.utils.get_random_string(length: int = 15) str [源代码]¶
Get random string with letters and digits.
- 参数:
length (int) – Length of random string. Defaults to 15.
- mmaction.utils.get_str_type(module: str | module | function) str [源代码]¶
Return the string type name of module.
- 参数:
module (str | ModuleType | FunctionType) – The target module class
- 返回:
Class name of the module
- mmaction.utils.register_all_modules(init_default_scope: bool = True) None [源代码]¶
Register all modules in mmaction into the registries.
- 参数:
init_default_scope (bool) – Whether initialize the mmaction default scope. If True, the global default scope will be set to mmaction, and all registries will build modules from mmaction’s registry node. To understand more about the registry, please refer to https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/registry.md Defaults to True.