Shortcuts

mmaction.apis

mmaction.apis.detection_inference(det_config: str | Path | Config | Module, det_checkpoint: str, frame_paths: List[str], det_score_thr: float = 0.9, det_cat_id: int = 0, device: str | device = 'cuda:0', with_score: bool = False) tuple[source]

Detect human boxes given frame paths.

Parameters:

(Union[str (det_config) – torch.nn.Module]): Det config file path or Detection model object. It can be a Path, a config object, or a module object.

:param Path: torch.nn.Module]):

Det config file path or Detection model object. It can be a Path, a config object, or a module object.

:param mmengine.Config: torch.nn.Module]):

Det config file path or Detection model object. It can be a Path, a config object, or a module object.

:paramtorch.nn.Module]):

Det config file path or Detection model object. It can be a Path, a config object, or a module object.

Parameters:
  • det_checkpoint – Checkpoint path/url.

  • frame_paths (List[str]) – The paths of frames to do detection inference.

  • det_score_thr (float) – The threshold of human detection score. Defaults to 0.9.

  • det_cat_id (int) – The category id for human detection. Defaults to 0.

  • device (Union[str, torch.device]) – The desired device of returned tensor. Defaults to 'cuda:0'.

  • with_score (bool) – Whether to append detection score after box. Defaults to None.

Returns:

List of detected human boxes. List[DetDataSample]: List of data samples, generally used

to visualize data.

Return type:

List[np.ndarray]

mmaction.apis.inference_recognizer(model: Module, video: str | dict, test_pipeline: Compose | None = None) ActionDataSample[source]

Inference a video with the recognizer.

Parameters:
  • model (nn.Module) – The loaded recognizer.

  • video (Union[str, dict]) – The video file path or the results dictionary (the input of pipeline).

  • test_pipeline (Compose, optional) – The test pipeline. If not specified, the test pipeline in the config will be used. Defaults to None.

Returns:

The inference results. Specifically, the predicted scores are saved at result.pred_score.

Return type:

ActionDataSample

mmaction.apis.inference_skeleton(model: Module, pose_results: List[dict], img_shape: Tuple[int], test_pipeline: Compose | None = None) ActionDataSample[source]

Inference a pose results with the skeleton recognizer.

Parameters:
  • model (nn.Module) – The loaded recognizer.

  • pose_results (List[dict]) – The pose estimation results dictionary (the results of pose_inference)

  • img_shape (Tuple[int]) – The original image shape used for inference skeleton recognizer.

  • test_pipeline (Compose, optional) – The test pipeline. If not specified, the test pipeline in the config will be used. Defaults to None.

Returns:

The inference results. Specifically, the predicted scores are saved at result.pred_score.

Return type:

ActionDataSample

mmaction.apis.init_recognizer(config: str | Path | Config, checkpoint: str | None = None, device: str | device = 'cuda:0') Module[source]

Initialize a recognizer from config file.

Parameters:
  • config (str or Path or mmengine.Config) – Config file path, Path or the config object.

  • checkpoint (str, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Defaults to None.

  • device (str | torch.device) – The desired device of returned tensor. Defaults to 'cuda:0'.

Returns:

The constructed recognizer.

Return type:

nn.Module

mmaction.apis.pose_inference(pose_config: str | Path | Config | Module, pose_checkpoint: str, frame_paths: List[str], det_results: List[ndarray], device: str | device = 'cuda:0') tuple[source]

Perform Top-Down pose estimation.

Parameters:

(Union[str (pose_config) – torch.nn.Module]): Pose config file path or pose model object. It can be a Path, a config object, or a module object.

:param Path: torch.nn.Module]): Pose config file path or

pose model object. It can be a Path, a config object, or a module object.

:param mmengine.Config: torch.nn.Module]): Pose config file path or

pose model object. It can be a Path, a config object, or a module object.

:paramtorch.nn.Module]): Pose config file path or

pose model object. It can be a Path, a config object, or a module object.

Parameters:
  • pose_checkpoint – Checkpoint path/url.

  • frame_paths (List[str]) – The paths of frames to do pose inference.

  • det_results (List[np.ndarray]) – List of detected human boxes.

  • device (Union[str, torch.device]) – The desired device of returned tensor. Defaults to 'cuda:0'.

Returns:

List of pose estimation results. List[PoseDataSample]: List of data samples, generally used

to visualize data.

Return type:

List[List[Dict[str, np.ndarray]]]

mmaction.datasets

datasets

class mmaction.datasets.AVADataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], exclude_file: str | None = None, label_file: str | None = None, filename_tmpl: str = 'img_{:05}.jpg', start_index: int = 1, proposal_file: str | None = None, person_det_score_thr: float = 0.9, num_classes: int = 81, custom_classes: List[int] | None = None, data_prefix: ConfigDict | dict = {'img': ''}, modality: str = 'RGB', test_mode: bool = False, num_max_proposals: int = 1000, timestamp_start: int = 900, timestamp_end: int = 1800, use_frames: bool = True, fps: int = 30, multilabel: bool = True, **kwargs)[source]

STAD dataset for spatial temporal action detection.

The dataset loads raw frames/video files, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.

This datasets can load information from the following files:

ann_file -> ava_{train, val}_{v2.1, v2.2}.csv
exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
label_file -> ava_action_list_{v2.1, v2.2}.pbtxt /
              ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl

Particularly, the proposal_file is a pickle file which contains img_key (in format of {video_id},{timestamp}). Example of a pickle file:

{
    ...
    '0f39OWEqJ24,0902':
        array([[0.011   , 0.157   , 0.655   , 0.983   , 0.998163]]),
    '0f39OWEqJ24,0912':
        array([[0.054   , 0.088   , 0.91    , 0.998   , 0.068273],
               [0.016   , 0.161   , 0.519   , 0.974   , 0.984025],
               [0.493   , 0.283   , 0.981   , 0.984   , 0.983621]]),
    ...
}
Parameters:
  • ann_file (str) – Path to the annotation file like ava_{train, val}_{v2.1, v2.2}.csv.

  • exclude_file (str) – Path to the excluded timestamp file like ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • label_file (str) – Path to the label file like ava_action_list_{v2.1, v2.2}.pbtxt or ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Defaults to None.

  • filename_tmpl (str) – Template for each filename. Defaults to ‘img_{:05}.jpg’.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. It should be set to 1 for AVA, since frame index start from 1 in AVA dataset. Defaults to 1.

  • proposal_file (str) – Path to the proposal file like ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Defaults to None.

  • person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used. Default: 0.9.

  • num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)

  • custom_classes (List[int], optional) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and num_classes should be equal to len(custom_classes) + 1.

  • data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to dict(img='').

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • num_max_proposals (int) – Max proposals number to store. Defaults to 1000.

  • timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Defaults to 902.

  • timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Defaults to 1798.

  • use_frames (bool) – Whether to use rawframes as input. Defaults to True.

  • fps (int) – Overrides the default FPS for the dataset. If set to 1, means counting timestamp by frame, e.g. MultiSports dataset. Otherwise by second. Defaults to 30.

  • multilabel (bool) – Determines whether it is a multilabel recognition task. Defaults to True.

filter_data() List[dict][source]

Filter out records in the exclude_file.

get_data_info(idx: int) dict[source]

Get annotation by index.

load_data_list() List[dict][source]

Load AVA annotations.

parse_img_record(img_records: List[dict]) tuple[source]

Merge image records of the same entity at the same time.

Parameters:

img_records (List[dict]) – List of img_records (lines in AVA annotations).

Returns:

A tuple consists of lists of bboxes, action labels and

entity_ids.

Return type:

Tuple(list)

class mmaction.datasets.AVAKineticsDataset(ann_file: str, exclude_file: str, pipeline: List[ConfigDict | dict | Callable], label_file: str, filename_tmpl: str = 'img_{:05}.jpg', start_index: int = 0, proposal_file: str | None = None, person_det_score_thr: float = 0.9, num_classes: int = 81, custom_classes: List[int] | None = None, data_prefix: ConfigDict | dict = {'img': ''}, modality: str = 'RGB', test_mode: bool = False, num_max_proposals: int = 1000, timestamp_start: int = 900, timestamp_end: int = 1800, fps: int = 30, **kwargs)[source]

AVA-Kinetics dataset for spatial temporal detection.

Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.

This datasets can load information from the following files:

ann_file -> ava_{train, val}_{v2.1, v2.2}.csv
exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
label_file -> ava_action_list_{v2.1, v2.2}.pbtxt /
              ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl

Particularly, the proposal_file is a pickle file which contains img_key (in format of {video_id},{timestamp}). Example of a pickle file:

{
    ...
    '0f39OWEqJ24,0902':
        array([[0.011   , 0.157   , 0.655   , 0.983   , 0.998163]]),
    '0f39OWEqJ24,0912':
        array([[0.054   , 0.088   , 0.91    , 0.998   , 0.068273],
               [0.016   , 0.161   , 0.519   , 0.974   , 0.984025],
               [0.493   , 0.283   , 0.981   , 0.984   , 0.983621]]),
    ...
}
Parameters:
  • ann_file (str) – Path to the annotation file like ava_{train, val}_{v2.1, v2.2}.csv.

  • exclude_file (str) – Path to the excluded timestamp file like ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • label_file (str) – Path to the label file like ava_action_list_{v2.1, v2.2}.pbtxt or ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Defaults to None.

  • filename_tmpl (str) – Template for each filename. Defaults to ‘img_{:05}.jpg’.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking frames as input, it should be set to 0, since frames from 0. Defaults to 0.

  • proposal_file (str) – Path to the proposal file like ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Defaults to None.

  • person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used. Default: 0.9.

  • num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)

  • custom_classes (List[int], optional) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and num_classes should be equal to len(custom_classes) + 1.

  • data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to dict(img='').

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • num_max_proposals (int) – Max proposals number to store. Defaults to 1000.

  • timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Defaults to 902.

  • timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Defaults to 1798.

  • fps (int) – Overrides the default FPS for the dataset. Defaults to 30.

filter_data() List[dict][source]

Filter out records in the exclude_file.

get_data_info(idx: int) dict[source]

Get annotation by index.

load_data_list() List[dict][source]

Load AVA annotations.

parse_img_record(img_records: List[dict]) tuple[source]

Merge image records of the same entity at the same time.

Parameters:

img_records (List[dict]) – List of img_records (lines in AVA annotations).

Returns:

A tuple consists of lists of bboxes, action labels and

entity_ids.

Return type:

Tuple(list)

class mmaction.datasets.ActivityNetDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict | None = {'video': ''}, test_mode: bool = False, **kwargs)[source]

ActivityNet dataset for temporal action localization. The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information. The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:

Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to dict(video='').

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

load_data_list() List[dict][source]

Load annotation file to get video information.

class mmaction.datasets.AudioDataset(ann_file: str, pipeline: List[Dict | Callable], data_prefix: Dict = {'audio': ''}, multi_class: bool = False, num_classes: int | None = None, **kwargs)[source]

Audio dataset for action recognition.

The ann_file is a text file with multiple lines, and each line indicates a sample audio or extracted audio feature with the filepath, total frames of the raw video and label, which are split with a whitespace. Example of a annotation file:

Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (dict) – Path to a directory where audios are held. Defaults to dict(audio='').

  • multi_class (bool) – Determines whether it is a multi-class recognition dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes in the dataset. Defaults to None.

load_data_list() List[Dict][source]

Load annotation file to get audio information.

class mmaction.datasets.BaseActionDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[source]

Base class for datasets.

Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict, optional) – Path to a directory where videos are held. Defaults to None.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.

  • modality (str) – Modality of data. Support RGB, Flow, Pose, Audio. Defaults to RGB.

get_data_info(idx: int) dict[source]

Get annotation by index.

class mmaction.datasets.CharadesSTADataset(ann_file: str, pipeline: List[dict | Callable], word2id_file: str, fps_file: str, duration_file: str, num_frames_file: str, window_size: int, ft_overlap: float, data_prefix: ConfigDict | dict | None = {'video': ''}, test_mode: bool = False, **kwargs)[source]
get_data_info(idx: int) dict[source]

Get annotation by index.

load_data_list() List[dict][source]

Load annotation file to get video information.

class mmaction.datasets.MSRVTTRetrieval(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[source]

MSR-VTT Retrieval dataset.

load_data_list() List[Dict][source]

Load annotation file to get video information.

class mmaction.datasets.MSRVTTVQA(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[source]

MSR-VTT Video Question Answering dataset.

load_data_list() List[Dict][source]

Load annotation file to get video information.

class mmaction.datasets.MSRVTTVQAMC(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[source]

MSR-VTT VQA multiple choices dataset.

load_data_list() List[Dict][source]

Load annotation file to get video information.

class mmaction.datasets.PoseDataset(ann_file: str, pipeline: List[Dict | Callable], split: str | None = None, valid_ratio: float | None = None, box_thr: float = 0.5, **kwargs)[source]

Pose dataset for action recognition.

The dataset loads pose and apply specified transforms to return a dict containing pose information.

The ann_file is a pickle file, the json file contains a list of annotations, the fields of an annotation include frame_dir(video_id), total_frames, label, kp, kpscore.

Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • split (str, optional) – The dataset split used. For UCF101 and HMDB51, allowed choices are ‘train1’, ‘test1’, ‘train2’, ‘test2’, ‘train3’, ‘test3’. For NTURGB+D, allowed choices are ‘xsub_train’, ‘xsub_val’, ‘xview_train’, ‘xview_val’. For NTURGB+D 120, allowed choices are ‘xsub_train’, ‘xsub_val’, ‘xset_train’, ‘xset_val’. For FineGYM, allowed choices are ‘train’, ‘val’. Defaults to None.

  • valid_ratio (float, optional) – The valid_ratio for videos in KineticsPose. For a video with n frames, it is a valid training sample only if n * valid_ratio frames have human pose. None means not applicable (only applicable to Kinetics Pose).Defaults to None.

  • box_thr (float) – The threshold for human proposals. Only boxes with confidence score larger than box_thr is kept. None means not applicable (only applicable to Kinetics). Allowed choices are 0.5, 0.6, 0.7, 0.8, 0.9. Defaults to 0.5.

filter_data() List[Dict][source]

Filter out invalid samples.

get_data_info(idx: int) Dict[source]

Get annotation by index.

load_data_list() List[Dict][source]

Load annotation file to get skeleton information.

class mmaction.datasets.RawframeDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict = {'img': ''}, filename_tmpl: str = 'img_{:05}.jpg', with_offset: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 1, modality: str = 'RGB', test_mode: bool = False, **kwargs)[source]

Rawframe dataset for action recognition.

The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:

some/directory-1 163 1
some/directory-2 122 1
some/directory-3 258 2
some/directory-4 234 2
some/directory-5 295 3
some/directory-6 121 3

Example of a multi-class annotation file:

some/directory-1 163 1 3 5
some/directory-2 122 1 2
some/directory-3 258 2
some/directory-4 234 2 4 6 8
some/directory-5 295 3
some/directory-6 121 3

Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.

some/directory-1 12 163 3
some/directory-2 213 122 4
some/directory-3 100 258 5
some/directory-4 98 234 2
some/directory-5 0 295 3
some/directory-6 50 121 3
Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to dict(img='').

  • filename_tmpl (str) – Template for each filename. Defaults to img_{:05}.jpg.

  • with_offset (bool) – Determines whether the offset information is in ann_file. Defaults to False.

  • multi_class (bool) – Determines whether it is a multi-class recognition dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes in the dataset. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking frames as input, it should be set to 1, since raw frames count from 1. Defaults to 1.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

get_data_info(idx: int) dict[source]

Get annotation by index.

load_data_list() List[dict][source]

Load annotation file to get video information.

class mmaction.datasets.RepeatAugDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict = {'video': ''}, num_repeats: int = 4, sample_once: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[source]

Video dataset for action recognition use repeat augment. https://arxiv.org/pdf/1901.09335.pdf.

The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:

some/path/000.mp4 1
some/path/001.mp4 1
some/path/002.mp4 2
some/path/003.mp4 2
some/path/004.mp4 3
some/path/005.mp4 3
Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to dict(video='').

  • num_repeats (int) – Number of repeat time of one video in a batch. Defaults to 4.

  • sample_once (bool) – Determines whether use same frame index for repeat samples. Defaults to False.

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

prepare_data(idx) List[dict][source]

Get data processed by self.pipeline.

Reduce the video loading and decompressing. :param idx: The index of data_info. :type idx: int

Returns:

A list of length num_repeats.

Return type:

List[dict]

class mmaction.datasets.VideoDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict = {'video': ''}, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', test_mode: bool = False, delimiter: str = ' ', **kwargs)[source]

Video dataset for action recognition.

The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:

some/path/000.mp4 1
some/path/001.mp4 1
some/path/002.mp4 2
some/path/003.mp4 2
some/path/004.mp4 3
some/path/005.mp4 3
Parameters:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to dict(video='').

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.

  • modality (str) – Modality of data. Support 'RGB', 'Flow'. Defaults to 'RGB'.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • delimiter (str) – Delimiter for the annotation file. Defaults to ' ' (whitespace).

load_data_list() List[dict][source]

Load annotation file to get video information.

class mmaction.datasets.VideoTextDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[source]

Video dataset for video-text task like video retrieval.

load_data_list() List[Dict][source]

Load annotation file to get video information.

transforms

class mmaction.datasets.transforms.ArrayDecode[source]

Load and decode frames with given indices from a 4D array.

Required keys are “array and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

transform(results)[source]

Perform the RawFrameDecode to pick frames given indices.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.AudioFeatureSelector(fixed_length: int = 128)[source]

Sample the audio feature w.r.t. the frames selected.

Required Keys:

  • audios

  • frame_inds

  • num_clips

  • length

  • total_frames

Modified Keys:

  • audios

Added Keys:

  • audios_shape

Parameters:

fixed_length (int) – As the features selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Defaults to 128.

transform(results: Dict) Dict[source]

Perform the AudioFeatureSelector to pick audio feature clips.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.BuildPseudoClip(clip_len)[source]

Build pseudo clips with one single image by repeating it n times.

Required key is “imgs”, added or modified key is “imgs”, “num_clips”,

“clip_len”.

Parameters:

clip_len (int) – Frames of the generated pseudo clips.

transform(results)[source]

Perform the building of pseudo clips.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.CLIPTokenize[source]

Tokenize text and convert to tensor.

transform(results: Dict) Dict[source]

The transform function of CLIPTokenize.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.CenterCrop(crop_size, lazy=False)[source]

Crop the center area from images.

Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.

Parameters:
  • crop_size (int | tuple[int]) – (w, h) of crop size.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[source]

Performs the CenterCrop augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1)[source]

Perform ColorJitter to each img.

Required keys are “imgs”, added or modified keys are “imgs”.

Parameters:
  • brightness (float | tuple[float]) – The jitter range for brightness, if set as a float, the range will be (1 - brightness, 1 + brightness). Default: 0.5.

  • contrast (float | tuple[float]) – The jitter range for contrast, if set as a float, the range will be (1 - contrast, 1 + contrast). Default: 0.5.

  • saturation (float | tuple[float]) – The jitter range for saturation, if set as a float, the range will be (1 - saturation, 1 + saturation). Default: 0.5.

  • hue (float | tuple[float]) – The jitter range for hue, if set as a float, the range will be (-hue, hue). Default: 0.1.

transform(results)[source]

Perform ColorJitter.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.DecompressPose(squeeze: bool = True, max_person: int = 10)[source]

Load Compressed Pose.

Required Keys:

  • frame_inds

  • total_frames

  • keypoint

  • anno_inds (optional)

Modified Keys:

  • keypoint

  • frame_inds

Added Keys:

  • keypoint_score

  • num_person

Parameters:
  • squeeze (bool) – Whether to remove frames with no human pose. Defaults to True.

  • max_person (int) – The max number of persons in a frame. Defaults to 10.

transform(results: Dict) Dict[source]

Perform the pose decoding.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.DecordDecode(mode: str = 'accurate')[source]

Using decord to decode the video.

Decord: https://github.com/dmlc/decord

Required Keys:

  • video_reader

  • frame_inds

Added Keys:

  • imgs

  • original_shape

  • img_shape

Parameters:

mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Defaults to 'accurate'.

transform(results: Dict) Dict[source]

Perform the Decord decoding.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.DecordInit(io_backend: str = 'disk', num_threads: int = 1, **kwargs)[source]

Using decord to initialize the video_reader.

Decord: https://github.com/dmlc/decord

Required Keys:

  • filename

Added Keys:

  • video_reader

  • total_frames

  • fps

Parameters:
  • io_backend (str) – io backend where frames are store. Defaults to 'disk'.

  • num_threads (int) – Number of thread to decode the video. Defaults to 1.

  • kwargs (dict) – Args for file client.

transform(results: Dict) Dict[source]

Perform the Decord initialization.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.DenseSampleFrames(*args, sample_range: int = 64, num_sample_positions: int = 10, **kwargs)[source]

Select frames from the video by dense sample strategy.

Required keys:

  • total_frames

  • start_index

Added keys:

  • frame_inds

  • clip_len

  • frame_interval

  • num_clips

Parameters:
  • clip_len (int) – Frames of each sampled output clip.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.

  • num_clips (int) – Number of clips to be sampled. Defaults to 1.

  • sample_range (int) – Total sample range for dense sample. Defaults to 64.

  • num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Defaults to 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.

  • temporal_jitter (bool) – Whether to apply temporal jittering. Defaults to False.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

transform(results: dict) dict[source]

Perform the SampleFrames loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, left_kp=None, right_kp=None, lazy=False)[source]

Flip the input images with a probability.

Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered.

Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.

Parameters:
  • flip_ratio (float) – Probability of implementing flip. Default: 0.5.

  • direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.

  • flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.

  • left_kp (list[int]) – Indexes of left keypoints, used to flip keypoints. Default: None.

  • right_kp (list[ind]) – Indexes of right keypoints, used to flip keypoints. Default: None.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[source]

Performs the Flip augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.FormatAudioShape(input_format: str)[source]

Format final audio shape to the given input_format.

Required Keys:

  • audios

Modified Keys:

  • audios

Added Keys:

  • input_shape

Parameters:

input_format (str) – Define the final imgs format.

transform(results: Dict) Dict[source]

Performs the FormatShape formatting.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.FormatGCNInput(num_person: int = 2, mode: str = 'zero')[source]

Format final skeleton shape.

Required Keys:

  • keypoint

  • keypoint_score (optional)

  • num_clips (optional)

Modified Key:

  • keypoint

Parameters:
  • num_person (int) – The maximum number of people. Defaults to 2.

  • mode (str) – The padding mode. Defaults to 'zero'.

transform(results: Dict) Dict[source]

The transform function of FormatGCNInput.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.FormatShape(input_format: str, collapse: bool = False)[source]

Format final imgs shape to the given input_format.

Required keys:

  • imgs (optional)

  • heatmap_imgs (optional)

  • modality (optional)

  • num_clips

  • clip_len

Modified Keys:

  • imgs

Added Keys:

  • input_shape

  • heatmap_input_shape (optional)

Parameters:
  • input_format (str) – Define the final data format.

  • collapse (bool) – To collapse input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Defaults to False.

transform(results: Dict) Dict[source]

Performs the FormatShape formatting.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Fuse[source]

Fuse lazy operations.

Fusion order:

crop -> resize -> flip

Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.

transform(results)[source]

Fuse lazy operations.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.GenSkeFeat(dataset: str = 'nturgb+d', feats: List[str] = ['j'], axis: int = -1)[source]

Unified interface for generating multi-stream skeleton features.

Required Keys:

  • keypoint

  • keypoint_score (optional)

Parameters:
  • dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to 'nturgb+d'.

  • feats (list[str]) – The list of the keys of features. Defaults to ['j'].

  • axis (int) – The axis along which the features will be joined. Defaults to -1.

transform(results: Dict) Dict[source]

The transform function of GenSkeFeat.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.GenerateLocalizationLabels[source]

Load video label for localizer with given video_name list.

Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.

transform(results)[source]

Perform the GenerateLocalizationLabels loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.GeneratePoseTarget(sigma: float = 0.6, use_score: bool = True, with_kp: bool = True, with_limb: bool = False, skeletons: Tuple[Tuple[int]] = ((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)), double: bool = False, left_kp: Tuple[int] = (1, 3, 5, 7, 9, 11, 13, 15), right_kp: Tuple[int] = (2, 4, 6, 8, 10, 12, 14, 16), left_limb: Tuple[int] = (0, 2, 4, 5, 6, 10, 11, 12), right_limb: Tuple[int] = (1, 3, 7, 8, 9, 13, 14, 15), scaling: float = 1.0)[source]

Generate pseudo heatmaps based on joint coordinates and confidence.

Required Keys:

  • keypoint

  • keypoint_score (optional)

  • img_shape

Added Keys:

  • imgs (optional)

  • heatmap_imgs (optional)

Parameters:
  • sigma (float) – The sigma of the generated gaussian map. Defaults to 0.6.

  • use_score (bool) – Use the confidence score of keypoints as the maximum of the gaussian maps. Defaults to True.

  • with_kp (bool) – Generate pseudo heatmaps for keypoints. Defaults to True.

  • with_limb (bool) – Generate pseudo heatmaps for limbs. At least one of ‘with_kp’ and ‘with_limb’ should be True. Defaults to False.

  • skeletons (tuple[tuple]) –

    The definition of human skeletons. Defaults to ``((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7),

    (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12))``,

    which is the definition of COCO-17p skeletons.

  • double (bool) – Output both original heatmaps and flipped heatmaps. Defaults to False.

  • left_kp (tuple[int]) – Indexes of left keypoints, which is used when flipping heatmaps. Defaults to (1, 3, 5, 7, 9, 11, 13, 15), which is left keypoints in COCO-17p.

  • right_kp (tuple[int]) – Indexes of right keypoints, which is used when flipping heatmaps. Defaults to (2, 4, 6, 8, 10, 12, 14, 16), which is right keypoints in COCO-17p.

  • left_limb (tuple[int]) – Indexes of left limbs, which is used when flipping heatmaps. Defaults to (0, 2, 4, 5, 6, 10, 11, 12), which is left limbs of skeletons we defined for COCO-17p.

  • right_limb (tuple[int]) – Indexes of right limbs, which is used when flipping heatmaps. Defaults to (1, 3, 7, 8, 9, 13, 14, 15), which is right limbs of skeletons we defined for COCO-17p.

  • scaling (float) – The ratio to scale the heatmaps. Defaults to 1.

gen_an_aug(results: Dict) ndarray[source]

Generate pseudo heatmaps for all frames.

Parameters:

results (dict) – The dictionary that contains all info of a sample.

Returns:

The generated pseudo heatmaps.

Return type:

np.ndarray

generate_a_heatmap(arr: ndarray, centers: ndarray, max_values: ndarray) None[source]

Generate pseudo heatmap for one keypoint in one frame.

Parameters:
  • arr (np.ndarray) – The array to store the generated heatmaps. Shape: img_h * img_w.

  • centers (np.ndarray) – The coordinates of corresponding keypoints (of multiple persons). Shape: M * 2.

  • max_values (np.ndarray) – The max values of each keypoint. Shape: M.

generate_a_limb_heatmap(arr: ndarray, starts: ndarray, ends: ndarray, start_values: ndarray, end_values: ndarray) None[source]

Generate pseudo heatmap for one limb in one frame.

Parameters:
  • arr (np.ndarray) – The array to store the generated heatmaps. Shape: img_h * img_w.

  • starts (np.ndarray) – The coordinates of one keypoint in the corresponding limbs. Shape: M * 2.

  • ends (np.ndarray) – The coordinates of the other keypoint in the corresponding limbs. Shape: M * 2.

  • start_values (np.ndarray) – The max values of one keypoint in the corresponding limbs. Shape: M.

  • end_values (np.ndarray) – The max values of the other keypoint in the corresponding limbs. Shape: M.

generate_heatmap(arr: ndarray, kps: ndarray, max_values: ndarray) None[source]

Generate pseudo heatmap for all keypoints and limbs in one frame (if needed).

Parameters:
  • arr (np.ndarray) – The array to store the generated heatmaps. Shape: V * img_h * img_w.

  • kps (np.ndarray) – The coordinates of keypoints in this frame. Shape: M * V * 2.

  • max_values (np.ndarray) – The confidence score of each keypoint. Shape: M * V.

transform(results: Dict) Dict[source]

Generate pseudo heatmaps based on joint coordinates and confidence.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]

Load and decode images.

Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

Parameters:
  • io_backend (str) – IO backend where frames are stored. Default: ‘disk’.

  • decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.

  • kwargs (dict, optional) – Arguments for FileClient.

transform(results)[source]

Perform the ImageDecode to load image given the file path.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ImgAug(transforms)[source]

Imgaug augmentation.

Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.

It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.

Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.

It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.

Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways

to create transforms. 1) string: only support default for now.

e.g. transforms=’default’

  1. list[dict]: create a list of augmenters by a list of dicts, each

    dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]

  2. iaa.Augmenter: create an imgaug.Augmenter object.

    e.g. transforms=iaa.Rotate(rotate=(-20, 20))

  1. Add Imgaug in dataset pipeline. It is recommended to insert imgaug

    pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [

    dict(

    type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,

    ), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(

    type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),

    dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])

Parameters:

transforms (str | list[dict] | iaa.Augmenter) – Three different ways to create imgaug augmenter.

static default_transforms()[source]

Default transforms for imgaug.

Implement RandAugment by imgaug. Please visit https://arxiv.org/abs/1909.13719 for more information.

Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa

Miss one augmenter SolarizeAdd since imgaug doesn’t support this.

Returns:

The constructed RandAugment transforms.

Return type:

dict

imgaug_builder(cfg)[source]

Import a module from imgaug.

It follows the logic of build_from_cfg(). Use a dict object to create an iaa.Augmenter object.

Parameters:

cfg (dict) – Config dict. It should at least contain the key “type”.

Returns:

iaa.Augmenter: The constructed imgaug augmenter.

Return type:

obj

transform(results)[source]

Perform Imgaug augmentations.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.JointToBone(dataset: str = 'nturgb+d', target: str = 'keypoint')[source]

Convert the joint information to bone information.

Required Keys:

  • keypoint

Modified Keys:

  • keypoint

Parameters:
  • dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to 'nturgb+d'.

  • target (str) – The target key for the bone information. Defaults to 'keypoint'.

transform(results: Dict) Dict[source]

The transform function of JointToBone.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.LoadAudioFeature(pad_method: str = 'zero')[source]

Load offline extracted audio features.

Required Keys:

  • audio_path

Added Keys:

  • length

  • audios

Parameters:

pad_method (str) – Padding method. Defaults to 'zero'.

transform(results: Dict) Dict[source]

Perform the numpy loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadHVULabel(**kwargs)[source]

Convert the HVU label from dictionaries to torch tensors.

Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.

init_hvu_info(categories, category_nums)[source]

Initialize hvu information.

transform(results)[source]

Convert the label dictionary to 3 tensors: “label”, “mask” and “category_mask”.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadLocalizationFeature[source]

Load Video features for localizer with given video_name list.

The required key is “feature_path”, added or modified keys are “raw_feature”.

Parameters:

raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.

transform(results)[source]

Perform the LoadLocalizationFeature loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadProposals(top_k, pgm_proposals_dir, pgm_features_dir, proposal_ext='.csv', feature_ext='.npy')[source]

Loading proposals with given proposal results.

Required keys are “video_name”, added or modified keys are ‘bsp_feature’, ‘tmin’, ‘tmax’, ‘tmin_score’, ‘tmax_score’ and ‘reference_temporal_iou’.

Parameters:
  • top_k (int) – The top k proposals to be loaded.

  • pgm_proposals_dir (str) – Directory to load proposals.

  • pgm_features_dir (str) – Directory to load proposal features.

  • proposal_ext (str) – Proposal file extension. Default: ‘.csv’.

  • feature_ext (str) – Feature file extension. Default: ‘.npy’.

transform(results)[source]

Perform the LoadProposals loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadRGBFromFile(to_float32: bool = False, color_type: str = 'color', imdecode_backend: str = 'cv2', io_backend: str = 'disk', ignore_empty: bool = False, **kwargs)[source]

Load a RGB image from file.

Required Keys:

  • img_path

Modified Keys:

  • img

  • img_shape

  • ori_shape

Parameters:
  • to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.

  • color_type (str) – The flag argument for :func:mmcv.imfrombytes. Defaults to ‘color’.

  • imdecode_backend (str) – The image decoding backend type. The backend argument for :func:mmcv.imfrombytes. See :func:mmcv.imfrombytes for details. Defaults to ‘cv2’.

  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.

  • kwargs (dict) – Args for file client.

transform(results: dict) dict[source]

Functions to load image.

Parameters:

results (dict) – Result dict from :obj:mmcv.BaseDataset.

Returns:

The dict contains loaded image and meta information.

Return type:

dict

class mmaction.datasets.transforms.MMCompact(padding: float = 0.25, threshold: int = 10, hw_ratio: float | Tuple[float] = 1, allow_imgpad: bool = True)[source]

Convert the coordinates of keypoints and crop the images to make them more compact.

Required Keys:

  • imgs

  • keypoint

  • img_shape

Modified Keys:

  • imgs

  • keypoint

  • img_shape

Parameters:
  • padding (float) – The padding size. Defaults to 0.25.

  • threshold (int) – The threshold for the tight bounding box. If the width or height of the tight bounding box is smaller than the threshold, we do not perform the compact operation. Defaults to 10.

  • hw_ratio (float | tuple[float]) – The hw_ratio of the expanded box. Float indicates the specific ratio and tuple indicates a ratio range. If set as None, it means there is no requirement on hw_ratio. Defaults to 1.

  • allow_imgpad (bool) – Whether to allow expanding the box outside the image to meet the hw_ratio requirement. Defaults to True.

transform(results: Dict) Dict[source]

The transform function of MMCompact.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.MMDecode(io_backend: str = 'disk', **kwargs)[source]

Decode RGB videos and skeletons.

transform(results: Dict) Dict[source]

The transform function of MMDecode.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.MMUniformSampleFrames(clip_len: int, num_clips: int = 1, test_mode: bool = False, seed: int = 255)[source]

Uniformly sample frames from the multi-modal data.

transform(results: Dict) Dict[source]

The transform function of MMUniformSampleFrames.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.MergeSkeFeat(feat_list: List[str] = ['keypoint'], target: str = 'keypoint', axis: int = -1)[source]

Merge multi-stream features.

Parameters:
  • feat_list (list[str]) – The list of the keys of features. Defaults to ['keypoint'].

  • target (str) – The target key for the merged multi-stream information. Defaults to 'keypoint'.

  • axis (int) – The axis along which the features will be joined. Defaults to -1.

transform(results: Dict) Dict[source]

The transform function of MergeSkeFeat.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.MultiScaleCrop(input_size, scales=(1,), max_wh_scale_gap=1, random_crop=False, num_fixed_crops=5, lazy=False)[source]

Crop images with a list of randomly selected scales.

Randomly select the w and h scales from a list of scales. Scale of 1 means the base size, which is the minimal of image width and height. The scale level of w and h is controlled to be smaller than a certain value to prevent too large or small aspect ratio.

Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “crop_bbox”, “img_shape”, “lazy” and “scales”. Required keys in “lazy” are “crop_bbox”, added or modified key is “crop_bbox”.

Parameters:
  • input_size (int | tuple[int]) – (w, h) of network input.

  • scales (tuple[float]) – width and height scales to be selected.

  • max_wh_scale_gap (int) – Maximum gap of w and h scale levels. Default: 1.

  • random_crop (bool) – If set to True, the cropping bbox will be randomly sampled, otherwise it will be sampler from fixed regions. Default: False.

  • num_fixed_crops (int) – If set to 5, the cropping bbox will keep 5 basic fixed regions: “upper left”, “upper right”, “lower left”, “lower right”, “center”. If set to 13, the cropping bbox will append another 8 fix regions: “center left”, “center right”, “lower center”, “upper center”, “upper left quarter”, “upper right quarter”, “lower left quarter”, “lower right quarter”. Default: 5.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[source]

Performs the MultiScaleCrop augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.OpenCVDecode[source]

Using OpenCV to decode the video.

Required keys are 'video_reader', 'filename' and 'frame_inds', added or modified keys are 'imgs', 'img_shape' and 'original_shape'.

transform(results: dict) dict[source]

Perform the OpenCV decoding.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.OpenCVInit(io_backend: str = 'disk', **kwargs)[source]

Using OpenCV to initialize the video_reader.

Required keys are 'filename', added or modified keys are ` ‘new_path’`, 'video_reader' and 'total_frames'.

Parameters:

io_backend (str) – io backend where frames are store. Defaults to 'disk'.

transform(results: dict) dict[source]

Perform the OpenCV initialization.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PIMSDecode[source]

Using PIMS to decode the videos.

PIMS: https://github.com/soft-matter/pims

Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

transform(results)[source]

Perform the PIMS decoding.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PIMSInit(io_backend='disk', mode='accurate', **kwargs)[source]

Use PIMS to initialize the video.

PIMS: https://github.com/soft-matter/pims

Parameters:
  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will always use pims.PyAVReaderIndexed to decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking by using pims.PyAVReaderTimed. Both will return the accurate frames in most cases. Default: ‘accurate’.

  • kwargs (dict) – Args for file client.

transform(results)[source]

Perform the PIMS initialization.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PackActionInputs(collect_keys: Tuple[str] | None = None, meta_keys: Sequence[str] = ('img_shape', 'img_key', 'video_id', 'timestamp'), algorithm_keys: Sequence[str] = ())[source]

Pack the inputs data.

Parameters:
  • collect_keys (tuple[str], optional) – The keys to be collected to packed_results['inputs']. Defaults to ``

  • meta_keys (Sequence[str]) – The meta keys to saved in the metainfo of the data_sample. Defaults to ('img_shape', 'img_key', 'video_id', 'timestamp').

  • algorithm_keys (Sequence[str]) – The keys of custom elements to be used in the algorithm. Defaults to an empty tuple.

transform(results: Dict) Dict[source]

The transform function of PackActionInputs.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.PackLocalizationInputs(keys=(), meta_keys=('video_name',))[source]
transform(results)[source]

Method to pack the input data.

Parameters:

results (dict) – Result dict from the data pipeline.

Returns:

  • ‘inputs’ (obj:torch.Tensor): The forward data of models.

  • ’data_samples’ (obj:DetDataSample): The annotation info of the

    sample.

Return type:

dict

class mmaction.datasets.transforms.PadTo(length: int, mode: str = 'loop')[source]

Sample frames from the video.

To sample an n-frame clip from the video, PadTo samples the frames from zero index, and loop or zero pad the frames if the length of video frames is less than the value of length.

Required Keys:

  • keypoint

  • total_frames

  • start_index (optional)

Modified Keys:

  • keypoint

  • total_frames

Parameters:
  • length (int) – The maximum length of the sampled output clip.

  • mode (str) – The padding mode. Defaults to 'loop'.

transform(results: Dict) Dict[source]

The transform function of PadTo.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.PoseCompact(padding: float = 0.25, threshold: int = 10, hw_ratio: float | Tuple[float] | None = None, allow_imgpad: bool = True)[source]

Convert the coordinates of keypoints to make it more compact. Specifically, it first find a tight bounding box that surrounds all joints in each frame, then we expand the tight box by a given padding ratio. For example, if ‘padding == 0.25’, then the expanded box has unchanged center, and 1.25x width and height.

Required Keys:

  • keypoint

  • img_shape

Modified Keys:

  • img_shape

  • keypoint

Added Keys:

  • crop_quadruple

Parameters:
  • padding (float) – The padding size. Defaults to 0.25.

  • threshold (int) – The threshold for the tight bounding box. If the width or height of the tight bounding box is smaller than the threshold, we do not perform the compact operation. Defaults to 10.

  • hw_ratio (float | tuple[float] | None) – The hw_ratio of the expanded box. Float indicates the specific ratio and tuple indicates a ratio range. If set as None, it means there is no requirement on hw_ratio. Defaults to None.

  • allow_imgpad (bool) – Whether to allow expanding the box outside the image to meet the hw_ratio requirement. Defaults to True.

transform(results: Dict) Dict[source]

Convert the coordinates of keypoints to make it more compact.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PoseDecode[source]

Load and decode pose with given indices.

Required Keys:

  • keypoint

  • total_frames (optional)

  • frame_inds (optional)

  • offset (optional)

  • keypoint_score (optional)

Modified Keys:

  • keypoint

  • keypoint_score (optional)

transform(results: Dict) Dict[source]

The transform function of PoseDecode.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.PreNormalize2D(img_shape: Tuple[int, int] = (1080, 1920))[source]

Normalize the range of keypoint values.

Required Keys:

  • keypoint

  • img_shape (optional)

Modified Keys:

  • keypoint

Parameters:

img_shape (tuple[int, int]) – The resolution of the original video. Defaults to (1080, 1920).

transform(results: Dict) Dict[source]

The transform function of PreNormalize2D.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.PreNormalize3D(zaxis: List[int] = [0, 1], xaxis: List[int] = [8, 4], align_spine: bool = True, align_shoulder: bool = True, align_center: bool = True)[source]

PreNormalize for NTURGB+D 3D keypoints (x, y, z).

PreNormalize3D first subtracts the coordinates of each joint from the coordinates of the ‘spine’ (joint #1 in ntu) of the first person in the first frame. Subsequently, it performs a 3D rotation to fix the Z axis parallel to the 3D vector from the ‘hip’ (joint #0) and the ‘spine’ (joint #1) and the X axis toward the 3D vector from the ‘right shoulder’ (joint #8) and the ‘left shoulder’ (joint #4). Codes adapted from https://github.com/lshiwjx/2s-AGCN.

Required Keys:

  • keypoint

  • total_frames (optional)

Modified Keys:

  • keypoint

Added Keys:

  • body_center

Parameters:
  • zaxis (list[int]) – The target Z axis for the 3D rotation. Defaults to [0, 1].

  • xaxis (list[int]) – The target X axis for the 3D rotation. Defaults to [8, 4].

  • align_spine (bool) – Whether to perform a 3D rotation to align the spine. Defaults to True.

  • align_shoulder (bool) – Whether to perform a 3D rotation to align the shoulder. Defaults to True.

  • align_center (bool) – Whether to align the body center. Defaults to True.

angle_between(v1: ndarray, v2: ndarray) float[source]

Returns the angle in radians between vectors ‘v1’ and ‘v2’.

rotation_matrix(axis: ndarray, theta: float) ndarray[source]

Returns the rotation matrix associated with counterclockwise rotation about the given axis by theta radians.

transform(results: Dict) Dict[source]

The transform function of PreNormalize3D.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

unit_vector(vector: ndarray) ndarray[source]

Returns the unit vector of the vector.

class mmaction.datasets.transforms.PyAVDecode(multi_thread=False, mode='accurate')[source]

Using PyAV to decode the video.

PyAV: https://github.com/mikeboers/PyAV

Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

Parameters:
  • multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.

  • mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return the nearest key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Default: ‘accurate’.

static frame_generator(container, stream)[source]

Frame generator for PyAV.

transform(results)[source]

Perform the PyAV decoding.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PyAVDecodeMotionVector(multi_thread=False, mode='accurate')[source]

Using pyav to decode the motion vectors from video.

Reference: https://github.com/PyAV-Org/PyAV/

blob/main/tests/test_decode.py

Required keys are “video_reader” and “frame_inds”, added or modified keys are “motion_vectors”, “frame_inds”.

transform(results)[source]

Perform the PyAV motion vector decoding.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PyAVInit(io_backend='disk', **kwargs)[source]

Using pyav to initialize the video.

PyAV: https://github.com/mikeboers/PyAV

Required keys are “filename”, added or modified keys are “video_reader”, and “total_frames”.

Parameters:
  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • kwargs (dict) – Args for file client.

transform(results)[source]

Perform the PyAV initialization.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PytorchVideoWrapper(op, **kwargs)[source]

PytorchVideoTrans Augmentations, under pytorchvideo.transforms.

Parameters:

op (str) – The name of the pytorchvideo transformation.

transform(results)[source]

Perform PytorchVideoTrans augmentations.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RandomCrop(size, lazy=False)[source]

Vanilla square random crop that specifics the output size.

Required keys in results are “img_shape”, “keypoint” (optional), “imgs” (optional), added or modified keys are “keypoint”, “imgs”, “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.

Parameters:
  • size (int) – The output size of the images.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[source]

Performs the RandomCrop augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RandomRescale(scale_range, interpolation='bilinear')[source]

Randomly resize images so that the short_edge is resized to a specific size in a given range. The scale ratio is unchanged after resizing.

Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “resize_size”, “short_edge”.

Parameters:
  • scale_range (tuple[int]) – The range of short edge length. A closed interval.

  • interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.

transform(results)[source]

Performs the Resize augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RandomResizedCrop(area_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333333333333333), lazy=False)[source]

Random crop that specifics the area and height-weight ratio range.

Required keys in results are “img_shape”, “crop_bbox”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox” and “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.

Parameters:
  • area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).

  • aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3).

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

static get_crop_bbox(img_shape, area_range, aspect_ratio_range, max_attempts=10)[source]

Get a crop bbox given the area range and aspect ratio range.

Parameters:
  • img_shape (Tuple[int]) – Image shape

  • area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).

  • aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3). max_attempts (int): The maximum of attempts. Default: 10.

  • max_attempts (int) – Max attempts times to generate random candidate bounding box. If it doesn’t qualified one, the center bounding box will be used.

Returns:

(list[int]) A random crop bbox within the area range and aspect ratio range.

transform(results)[source]

Performs the RandomResizeCrop augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RawFrameDecode(io_backend: str = 'disk', decoding_backend: str = 'cv2', **kwargs)[source]

Load and decode frames with given indices.

Required Keys:

  • frame_dir

  • filename_tmpl

  • frame_inds

  • modality

  • offset (optional)

Added Keys:

  • img

  • img_shape

  • original_shape

Parameters:
  • io_backend (str) – IO backend where frames are stored. Defaults to 'disk'.

  • decoding_backend (str) – Backend used for image decoding. Defaults to 'cv2'.

transform(results: dict) dict[source]

Perform the RawFrameDecode to pick frames given indices.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Resize(scale, keep_ratio=True, interpolation='bilinear', lazy=False)[source]

Resize images to a specific size.

Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.

Parameters:
  • scale (float | Tuple[int]) – If keep_ratio is True, it serves as scaling factor or maximum size: If it is a float number, the image will be rescaled by this factor, else if it is a tuple of 2 integers, the image will be rescaled as large as possible within the scale. Otherwise, it serves as (w, h) of output size.

  • keep_ratio (bool) – If set to True, Images will be resized without changing the aspect ratio. Otherwise, it will resize images to a given size. Default: True.

  • interpolation (str) – Algorithm used for interpolation, accepted values are “nearest”, “bilinear”, “bicubic”, “area”, “lanczos”. Default: “bilinear”.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[source]

Performs the Resize augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.SampleAVAFrames(clip_len, frame_interval=2, test_mode=False)[source]
transform(results)[source]

Perform the SampleFrames loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.SampleFrames(clip_len: int, frame_interval: int = 1, num_clips: int = 1, temporal_jitter: bool = False, twice_sample: bool = False, out_of_bound_opt: str = 'loop', test_mode: bool = False, keep_tail_frames: bool = False, target_fps: int | None = None, **kwargs)[source]

Sample frames from the video.

Required Keys:

  • total_frames

  • start_index

Added Keys:

  • frame_inds

  • frame_interval

  • num_clips

Parameters:
  • clip_len (int) – Frames of each sampled output clip.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.

  • num_clips (int) – Number of clips to be sampled. Default: 1.

  • temporal_jitter (bool) – Whether to apply temporal jittering. Defaults to False.

  • twice_sample (bool) – Whether to use twice sample when testing. If set to True, it will sample frames with and without fixed shift, which is commonly used for testing in TSM model. Defaults to False.

  • out_of_bound_opt (str) – The way to deal with out of bounds frame indexes. Available options are ‘loop’, ‘repeat_last’. Defaults to ‘loop’.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • keep_tail_frames (bool) – Whether to keep tail frames when sampling. Defaults to False.

  • target_fps (optional, int) – Convert input videos with arbitrary frame rates to the unified target FPS before sampling frames. If None, the frame rate will not be adjusted. Defaults to None.

transform(results: dict) dict[source]

Perform the SampleFrames loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.TenCrop(crop_size)[source]

Crop the images into 10 crops (corner + center + flip).

Crop the four corners and the center part of the image with the same given crop_size, and flip it horizontally. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

Parameters:

crop_size (int | tuple[int]) – (w, h) of crop size.

transform(results)[source]

Performs the TenCrop augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ThreeCrop(crop_size)[source]

Crop images into three crops.

Crop the images equally into three crops with equal intervals along the shorter side. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

Parameters:

crop_size (int | tuple[int]) – (w, h) of crop size.

transform(results)[source]

Performs the ThreeCrop augmentation.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ToMotion(dataset: str = 'nturgb+d', source: str = 'keypoint', target: str = 'motion')[source]

Convert the joint information or bone information to corresponding motion information.

Required Keys:

  • keypoint

Added Keys:

  • motion

Parameters:
  • dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to 'nturgb+d'.

  • source (str) – The source key for the joint or bone information. Defaults to 'keypoint'.

  • target (str) – The target key for the motion information. Defaults to 'motion'.

transform(results: Dict) Dict[source]

The transform function of ToMotion.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.TorchVisionWrapper(op, **kwargs)[source]

Torchvision Augmentations, under torchvision.transforms.

Parameters:

op (str) – The name of the torchvision transformation.

transform(results)[source]

Perform Torchvision augmentations.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Transpose(keys, order)[source]

Transpose image channels to a given order.

Parameters:
  • keys (Sequence[str]) – Required keys to be converted.

  • order (Sequence[int]) – Image channel order.

transform(results)[source]

Performs the Transpose formatting.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.UniformSample(clip_len: int, num_clips: int = 1, test_mode: bool = False)[source]

Uniformly sample frames from the video.

Modified from https://github.com/facebookresearch/SlowFast/blob/64a bcc90ccfdcbb11cf91d6e525bed60e92a8796/slowfast/datasets/ssv2.py#L159.

To sample an n-frame clip from the video. UniformSample basically divides the video into n segments of equal length and randomly samples one frame from each segment.

Required keys:

  • total_frames

  • start_index

Added keys:

  • frame_inds

  • clip_len

  • frame_interval

  • num_clips

Parameters:
  • clip_len (int) – Frames of each sampled output clip.

  • num_clips (int) – Number of clips to be sampled. Defaults to 1.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

transform(results: Dict) Dict[source]

Perform the Uniform Sampling.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.UniformSampleFrames(clip_len: int, num_clips: int = 1, test_mode: bool = False, seed: int = 255)[source]

Uniformly sample frames from the video.

To sample an n-frame clip from the video. UniformSampleFrames basically divide the video into n segments of equal length and randomly sample one frame from each segment. To make the testing results reproducible, a random seed is set during testing, to make the sampling results deterministic.

Required Keys:

  • total_frames

  • start_index (optional)

Added Keys:

  • frame_inds

  • frame_interval

  • num_clips

  • clip_len

Parameters:
  • clip_len (int) – Frames of each sampled output clip.

  • num_clips (int) – Number of clips to be sampled. Defaults to 1.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • seed (int) – The random seed used during test time. Defaults to 255.

transform(results: Dict) Dict[source]

The transform function of UniformSampleFrames.

Parameters:

results (dict) – The result dict.

Returns:

The result dict.

Return type:

dict

class mmaction.datasets.transforms.UntrimmedSampleFrames(clip_len=1, clip_interval=16, frame_interval=1)[source]

Sample frames from the untrimmed video.

Required keys are “filename”, “total_frames”, added or modified keys are “frame_inds”, “clip_interval” and “num_clips”.

Parameters:
  • clip_len (int) – The length of sampled clips. Defaults to 1.

  • clip_interval (int) – Clip interval of adjacent center of sampled clips. Defaults to 16.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.

transform(results)[source]

Perform the SampleFrames loading.

Parameters:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

mmaction.engine

hooks

class mmaction.engine.hooks.OutputHook(module, outputs=None, as_tensor=False)[source]

Output feature map of some layers.

Parameters:
  • module (nn.Module) – The whole module to get layers.

  • outputs (tuple[str] | list[str]) – Layer name to output. Default: None.

  • as_tensor (bool) – Determine to return a tensor or a numpy array. Default: False.

class mmaction.engine.hooks.VisualizationHook(enable=False, interval: int = 5000, show: bool = False, out_dir: str | None = None, **kwargs)[source]

Classification Visualization Hook. Used to visualize validation and testing prediction results.

  • If out_dir is specified, all storage backends are ignored and save the image to the out_dir.

  • If show is True, plot the result image in a window, please confirm you are able to access the graphical interface.

Parameters:
  • enable (bool) – Whether to enable this hook. Defaults to False.

  • interval (int) – The interval of samples to visualize. Defaults to 5000.

  • show (bool) – Whether to display the drawn image. Defaults to False.

  • out_dir (str, optional) – directory where painted images will be saved in the testing process. If None, handle with the backends of the visualizer. Defaults to None.

  • **kwargs – other keyword arguments of mmcls.visualization.ClsVisualizer.add_datasample().

after_test_iter(runner: Runner, batch_idx: int, data_batch: dict, outputs: Sequence[ActionDataSample]) None[source]

Visualize every self.interval samples during test.

Parameters:
  • runner (Runner) – The runner of the testing process.

  • batch_idx (int) – The index of the current batch in the test loop.

  • data_batch (dict) – Data from dataloader.

  • outputs (Sequence[DetDataSample]) – Outputs from model.

after_val_iter(runner: Runner, batch_idx: int, data_batch: dict, outputs: Sequence[ActionDataSample]) None[source]

Visualize every self.interval samples during validation.

Parameters:
  • runner (Runner) – The runner of the validation process.

  • batch_idx (int) – The index of the current batch in the val loop.

  • data_batch (dict) – Data from dataloader.

  • outputs (Sequence[ActionDataSample]) – Outputs from model.

optimizers

class mmaction.engine.optimizers.LearningRateDecayOptimizerConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[source]

Different learning rates are set for different layers of backbone. Note: Currently, this optimizer constructor is built for MViT.

Inspiration from the implementation in PySlowFast and MMDetection <https://github.com/open-mmlab/mmdetection/tree/dev-3.x>`_

add_params(params: List[dict], module: Module, **kwargs) None[source]

Add all parameters of module to the params list.

The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.

Parameters:
  • params (list[dict]) – A list of param groups, it will be modified in place.

  • module (nn.Module) – The module to be added.

class mmaction.engine.optimizers.SwinOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[source]
add_params(params: List[dict], module: Module, prefix: str = 'base', **kwargs) None[source]

Add all parameters of module to the params list.

The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.

Parameters:
  • params (list[dict]) – A list of param groups, it will be modified in place.

  • module (nn.Module) – The module to be added.

  • prefix (str) – The prefix of the module. Defaults to 'base'.

class mmaction.engine.optimizers.TSMOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[source]

Optimizer constructor in TSM model.

This constructor builds optimizer in different ways from the default one.

  1. Parameters of the first conv layer have default lr and weight decay.

  2. Parameters of BN layers have default lr and zero weight decay.

  3. If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.

  4. Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.

add_params(params, model, **kwargs)[source]

Add parameters and their corresponding lr and wd to the params.

Parameters:
  • params (list) – The list to be modified, containing all parameter groups and their corresponding lr and wd configurations.

  • model (nn.Module) – The model to be trained with the optimizer.

runner

class mmaction.engine.runner.MultiLoaderEpochBasedTrainLoop(runner, dataloader: Dict | DataLoader, other_loaders: List[Dict | DataLoader], max_epochs: int, val_begin: int = 1, val_interval: int = 1)[source]

EpochBasedTrainLoop with multiple dataloaders.

Parameters:
  • runner (Runner) – A reference of runner.

  • dataloader (Dataloader or Dict) – A dataloader object or a dict to build a dataloader for training the model.

  • other_loaders (List of Dataloader or Dict) – A list of other loaders. Each item in the list is a dataloader object or a dict to build a dataloader.

  • max_epochs (int) – Total training epochs.

  • val_begin (int) – The epoch that begins validating. Defaults to 1.

  • val_interval (int) – Validation interval. Defaults to 1.

run_epoch() None[source]

Iterate one epoch.

class mmaction.engine.runner.RetrievalTestLoop(runner, dataloader: DataLoader | Dict, evaluator: Evaluator | Dict | List, fp16: bool = False)[source]

Loop for multimodal retrieval test.

Parameters:
  • runner (Runner) – A reference of runner.

  • dataloader (Dataloader or dict) – A dataloader object or a dict to build a dataloader.

  • evaluator (Evaluator or dict or list) – Used for computing metrics.

  • fp16 (bool) – Whether to enable fp16 testing. Defaults to False.

run() dict[source]

Launch test.

class mmaction.engine.runner.RetrievalValLoop(runner, dataloader: DataLoader | Dict, evaluator: Evaluator | Dict | List, fp16: bool = False)[source]

Loop for multimodal retrieval val.

Parameters:
  • runner (Runner) – A reference of runner.

  • dataloader (Dataloader or dict) – A dataloader object or a dict to build a dataloader.

  • evaluator (Evaluator or dict or list) – Used for computing metrics.

  • fp16 (bool) – Whether to enable fp16 valing. Defaults to False.

run() dict[source]

Launch val.

mmaction.evaluation

functional

class mmaction.evaluation.functional.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[source]

Class to evaluate detection results on ActivityNet.

Parameters:
  • ground_truth_filename (str | None) – The filename of groundtruth. Default: None.

  • prediction_filename (str | None) – The filename of action detection results. Default: None.

  • tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default: np.linspace(0.5, 0.95, 10).

  • verbose (bool) – Whether to print verbose logs. Default: False.

evaluate()[source]

Evaluates a prediction file.

For the detection task we measure the interpolated mean average precision to measure the performance of a method.

wrapper_compute_average_precision()[source]

Computes average precision for each class.

mmaction.evaluation.functional.ava_eval(result_file, result_type, label_file, ann_file, exclude_file, verbose=True, ignore_empty_frames=True, custom_classes=None)[source]

Perform ava evaluation.

mmaction.evaluation.functional.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]

Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.

Parameters:
  • ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.

  • prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.

  • temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

Returns:

1D array of average precision score.

Return type:

np.ndarray

mmaction.evaluation.functional.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]

Computes the average recall given an average number (percentile) of proposals per video.

Parameters:
  • ground_truth (dict) – Dict containing the ground truth instances.

  • proposals (dict) – Dict containing the proposal instances.

  • total_num_proposals (int) – Total number of proposals in the proposal dict.

  • max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.

  • temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

Returns:

(recall, average_recall, proposals_per_video, auc) In recall, recall[i,j] is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent to recall.mean(axis=0). The proposals_per_video is the average number of proposals per video. The auc is the area under AR@AN curve.

Return type:

tuple([np.ndarray, np.ndarray, np.ndarray, float])

mmaction.evaluation.functional.confusion_matrix(y_pred, y_real, normalize=None)[source]

Compute confusion matrix.

Parameters:
  • y_pred (list[int] | np.ndarray[int]) – Prediction labels.

  • y_real (list[int] | np.ndarray[int]) – Ground truth labels.

  • normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.

Returns:

Confusion matrix.

Return type:

np.ndarray

mmaction.evaluation.functional.get_weighted_score(score_list, coeff_list)[source]

Get weighted score with given scores and coefficients.

Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n

Parameters:
  • score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes

  • coeff_list (list[float]) – List of coefficients, with shape n.

Returns:

List of weighted scores.

Return type:

list[np.ndarray]

mmaction.evaluation.functional.interpolated_precision_recall(precision, recall)[source]

Interpolated AP - VOCdevkit from VOC 2011.

Parameters:
  • precision (np.ndarray) – The precision of different thresholds.

  • recall (np.ndarray) – The recall of different thresholds.

Returns:

float: Average precision score.

mmaction.evaluation.functional.mean_average_precision(scores, labels)[source]

Mean average precision for multi-label recognition.

Parameters:
  • scores (list[np.ndarray]) – Prediction scores of different classes for each sample.

  • labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

Returns:

The mean average precision.

Return type:

np.float64

mmaction.evaluation.functional.mean_class_accuracy(scores, labels)[source]

Calculate mean class accuracy.

Parameters:
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int]) – Ground truth labels.

Returns:

Mean class accuracy.

Return type:

np.ndarray

mmaction.evaluation.functional.mmit_mean_average_precision(scores, labels)[source]

Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.

Parameters:
  • scores (list[np.ndarray]) – Prediction scores of different classes for each sample.

  • labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

Returns:

The MMIT style mean average precision.

Return type:

np.float64

mmaction.evaluation.functional.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[source]

Compute intersection over union between segments.

Parameters:
  • candidate_segments (np.ndarray) – 1-dim/2-dim array in format [init, end]/[m x 2:=[init, end]].

  • target_segments (np.ndarray) – 2-dim array in format [n x 2:=[init, end]].

  • calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.

Returns:

1-dim array [n] /

2-dim array [n x m] with IoU ratio.

t_overlap_self (np.ndarray, optional): 1-dim array [n] /

2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.

Return type:

t_iou (np.ndarray)

mmaction.evaluation.functional.read_labelmap(labelmap_file)[source]

Reads a labelmap without the dependency on protocol buffers.

Parameters:

labelmap_file – A file object containing a label map protocol buffer.

Returns:

The label map in the form used by the object_detection_evaluation module - a list of {“id”: integer, “name”: classname } dicts. class_ids: A set containing all of the valid class id integers.

Return type:

labelmap

mmaction.evaluation.functional.results2csv(results, out_file, custom_classes=None)[source]

Convert detection results to csv file.

mmaction.evaluation.functional.softmax(x, dim=1)[source]

Compute softmax values for each sets of scores in x.

mmaction.evaluation.functional.top_k_accuracy(scores, labels, topk=(1,))[source]

Calculate top k accuracy score.

Parameters:
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int]) – Ground truth labels.

  • topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).

Returns:

Top k accuracy score for each k.

Return type:

list[float]

mmaction.evaluation.functional.top_k_classes(scores, labels, k=10, mode='accurate')[source]

Calculate the most K accurate (inaccurate) classes.

Given the prediction scores, ground truth label and top-k value, compute the top K accurate (inaccurate) classes.

Parameters:
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int] | np.ndarray) – Ground truth labels.

  • k (int) – Top-k values. Default: 10.

  • mode (str) – Comparison mode for Top-k. Options are ‘accurate’ and ‘inaccurate’. Default: ‘accurate’.

Returns:

List of sorted (from high accuracy to low accuracy for

’accurate’ mode, and from low accuracy to high accuracy for inaccurate mode) top K classes in format of (label_id, acc_ratio).

Return type:

list

metrics

class mmaction.evaluation.metrics.ANetMetric(metric_type: str = 'TEM', collect_device: str = 'cpu', prefix: str | None = None, metric_options: dict = {}, dump_config: ConfigDict | dict = {'out': ''})[source]

ActivityNet dataset evaluation metric.

compute_ARAN(results: list) dict[source]

AR@AN evaluation metric.

compute_metrics(results: list) dict[source]

Compute the metrics from processed results.

If metric_type is ‘TEM’, only dump middle results and do not compute any metrics. :param results: The processed results of each batch. :type results: list

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

dump_results(results, version='VERSION 1.3')[source]

Save middle or final results to disk.

process(data_batch: Sequence[Tuple[Any, dict]], predictions: Sequence[dict]) None[source]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • predictions (Sequence[dict]) – A batch of outputs from the model.

static proposals2json(results, show_progress=False)[source]

Convert all proposals to a final dict(json) format. :param results: All proposals. :type results: list[dict] :param show_progress: Whether to show the progress bar.

Defaults: False.

Returns:

The final result dict. E.g. .. code-block:: Python

dict(video-1=[dict(segment=[1.1,2.0]. score=0.9),

dict(segment=[50.1, 129.3], score=0.6)])

Return type:

dict

class mmaction.evaluation.metrics.AVAMetric(ann_file: str, exclude_file: str, label_file: str, options: Tuple[str] = ('mAP',), action_thr: float = 0.002, num_classes: int = 81, custom_classes: List[int] | None = None, collect_device: str = 'cpu', prefix: str | None = None)[source]

AVA evaluation metric.

compute_metrics(results: list) dict[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

process(data_batch: Sequence[Tuple[Any, dict]], data_samples: Sequence[dict]) None[source]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.AccMetric(metric_list: str | Tuple[str] | None = ('top_k_accuracy', 'mean_class_accuracy'), collect_device: str = 'cpu', metric_options: Dict | None = {'top_k_accuracy': {'topk': (1, 5)}}, prefix: str | None = None)[source]

Accuracy evaluation metric.

calculate(preds: List[ndarray], labels: List[int | ndarray]) Dict[source]

Compute the metrics from processed results.

Parameters:
  • preds (list[np.ndarray]) – List of the prediction scores.

  • labels (list[int | np.ndarray]) – List of the labels.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

compute_metrics(results: List) Dict[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

process(data_batch: Sequence[Tuple[Any, Dict]], data_samples: Sequence[Dict]) None[source]

Process one batch of data samples and data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (Sequence[dict]) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.ConfusionMatrix(num_classes: int | None = None, collect_device: str = 'cpu', prefix: str | None = None)[source]

A metric to calculate confusion matrix for single-label tasks.

Parameters:
  • num_classes (int, optional) – The number of classes. Defaults to None.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

Examples

  1. The basic usage.

>>> import torch
>>> from mmaction.evaluation import ConfusionMatrix
>>> y_pred = [0, 1, 1, 3]
>>> y_true = [0, 2, 1, 3]
>>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4)
tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 0, 1]])
>>> # plot the confusion matrix
>>> import matplotlib.pyplot as plt
>>> y_score = torch.rand((1000, 10))
>>> y_true = torch.randint(10, (1000, ))
>>> matrix = ConfusionMatrix.calculate(y_score, y_true)
>>> ConfusionMatrix().plot(matrix)
>>> plt.show()
  1. In the config file

val_evaluator = dict(type='ConfusionMatrix')
test_evaluator = dict(type='ConfusionMatrix')
static calculate(pred, target, num_classes=None) dict[source]

Calculate the confusion matrix for single-label task.

Parameters:
  • pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. It can be labels (N, ), or scores of every class (N, C).

  • target (torch.Tensor | np.ndarray | Sequence) – The target of each prediction with shape (N, ).

  • num_classes (Optional, int) – The number of classes. If the pred is label instead of scores, this argument is required. Defaults to None.

Returns:

The confusion matrix.

Return type:

torch.Tensor

compute_metrics(results: list) dict[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

static plot(confusion_matrix: Tensor, include_values: bool = False, cmap: str = 'viridis', classes: List[str] | None = None, colorbar: bool = True, show: bool = True)[source]

Draw a confusion matrix by matplotlib.

Modified from Scikit-Learn

Parameters:
  • confusion_matrix (torch.Tensor) – The confusion matrix to draw.

  • include_values (bool) – Whether to draw the values in the figure. Defaults to False.

  • cmap (str) – The color map to use. Defaults to use “viridis”.

  • classes (list[str], optional) – The names of categories. Defaults to None, which means to use index number.

  • colorbar (bool) – Whether to show the colorbar. Defaults to True.

  • show (bool) – Whether to show the figure immediately. Defaults to True.

process(data_batch, data_samples: Sequence[dict]) None[source]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (Any) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.MultiSportsMetric(ann_file: str, metric_options: dict | None = {'F_mAP': {'thr': 0.5}, 'V_mAP': {'all': True, 'thr': (0.2, 0.5), 'tube_thr': 15}}, collect_device: str = 'cpu', verbose: bool = True, prefix: str | None = None)[source]

MAP Metric for MultiSports dataset.

compute_metrics(results: list) dict[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

process(data_batch: Sequence[Tuple[Any, dict]], data_samples: Sequence[dict]) None[source]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.RecallatTopK(topK_list: Tuple[int] = (1, 5), threshold: float = 0.5, collect_device: str = 'cpu', prefix: str | None = None)[source]

ActivityNet dataset evaluation metric.

compute_metrics(results: list) dict[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

process(data_batch: Sequence[Tuple[Any, dict]], predictions: Sequence[dict]) None[source]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • predictions (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.ReportVQA(file_path: str, collect_device: str = 'cpu', prefix: str | None = None)[source]

Dump VQA result to the standard json format for VQA evaluation.

Parameters:
  • file_path (str) – The file path to save the result file.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.

compute_metrics(results: List)[source]

Dump the result to json file.

process(data_batch, data_samples) None[source]

transfer tensors in predictions to CPU.

class mmaction.evaluation.metrics.RetrievalMetric(metric_list: Tuple[str] | str = ('R1', 'R5', 'R10', 'MdR', 'MnR'), collect_device: str = 'cpu', prefix: str | None = None)[source]

Metric for video retrieval task.

Parameters:
  • metric_list (str | tuple[str]) – The list of the metrics to be computed. Defaults to ('R1', 'R5', 'R10', 'MdR', 'MnR').

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

compute_metrics(results: List) Dict[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

dict

process(data_batch: Dict | None, data_samples: Sequence[Dict]) None[source]

Process one batch of data samples and data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

Parameters:
  • data_batch (dict, optional) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.RetrievalRecall(topk: int | Sequence[int], collect_device: str = 'cpu', prefix: str | None = None)[source]

Recall evaluation metric for image retrieval.

Parameters:
  • topk (int | Sequence[int]) – If the ground truth label matches one of the best k predictions, the sample will be regard as a positive prediction. If the parameter is a tuple, all of top-k recall will be calculated and outputted together. Defaults to 1.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

static calculate(pred: ndarray | Tensor, target: ndarray | Tensor, topk: int | Sequence[int], pred_indices: bool = False, target_indices: bool = False) float[source]

Calculate the average recall.

Parameters:
  • pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. A torch.Tensor or np.ndarray with shape (N, M) or a sequence of index/onehot format labels.

  • target (torch.Tensor | np.ndarray | Sequence) – The prediction results. A torch.Tensor or np.ndarray with shape (N, M) or a sequence of index/onehot format labels.

  • topk (int, Sequence[int]) – Predictions with the k-th highest scores are considered as positive.

  • pred_indices (bool) – Whether the pred is a sequence of category index labels. Defaults to False.

  • target_indices (bool) – Whether the target is a sequence of category index labels. Defaults to False.

Returns:

the average recalls.

Return type:

List[float]

compute_metrics(results: List)[source]

Compute the metrics from processed results.

Parameters:

results (list) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

Dict

process(data_batch: Sequence[dict], data_samples: Sequence[dict])[source]

Process one batch of data and predictions.

The processed results should be stored in self.results, which will be used to computed the metrics when all batches have been processed.

Parameters:
  • data_batch (Sequence[dict]) – A batch of data from the dataloader.

  • predictions (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.VQAAcc(full_score_weight: float = 0.3, collect_device: str = 'cpu', prefix: str | None = None)[source]

VQA Acc metric. :param collect_device: Device name used for collecting results from

different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

Parameters:

prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.

compute_metrics(results: List)[source]

Compute the metrics from processed results.

Parameters:

results (dict) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

Dict

process(data_batch, data_samples)[source]

Process one batch of data samples.

The processed results should be stored in self.results, which will be used to computed the metrics when all batches have been processed.

Parameters:
  • data_batch – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.VQAMCACC(collect_device: str = 'cpu', prefix: str | None = None)[source]

VQA multiple choice Acc metric. :param collect_device: Device name used for collecting results from

different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

Parameters:

prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.

compute_metrics(results: List)[source]

Compute the metrics from processed results.

Parameters:

results (dict) – The processed results of each batch.

Returns:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type:

Dict

process(data_batch, data_samples)[source]

Process one batch of data samples.

The processed results should be stored in self.results, which will be used to computed the metrics when all batches have been processed.

Parameters:
  • data_batch – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

mmaction.models

backbones

class mmaction.models.backbones.AAGCN(graph_cfg: Dict, in_channels: int = 3, base_channels: int = 64, data_bn_type: str = 'MVC', num_person: int = 2, num_stages: int = 10, inflate_stages: List[int] = [5, 8], down_stages: List[int] = [5, 8], init_cfg: Dict | List[Dict] | None = None, **kwargs)[source]

AAGCN backbone, the attention-enhanced version of 2s-AGCN.

Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. More details can be found in the paper .

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. More details can be found in the paper .

Parameters:
  • graph_cfg (dict) – Config for building the graph.

  • in_channels (int) – Number of input channels. Defaults to 3.

  • base_channels (int) – Number of base channels. Defaults to 64.

  • data_bn_type (str) – Type of the data bn layer. Defaults to 'MVC'.

  • num_person (int) – Maximum number of people. Only used when data_bn_type == ‘MVC’. Defaults to 2.

  • num_stages (int) – Total number of stages. Defaults to 10.

  • inflate_stages (list[int]) – Stages to inflate the number of channels. Defaults to [5, 8].

  • down_stages (list[int]) – Stages to perform downsampling in the time dimension. Defaults to [5, 8].

  • init_cfg (dict or list[dict], optional) – Config to control the initialization. Defaults to None.

  • Examples

  • torch (>>> import) –

  • AAGCN (>>> model =) –

  • register_all_modules (>>> from mmaction.utils import) –

  • >>>

  • register_all_modules() (>>>) –

  • 'stgcn_spatial' (>>> mode =) –

  • batch_size (>>>) –

  • num_person

  • 2 (num_frames =) –

  • 2

  • 150

  • >>>

  • layout (>>> # openpose-18) –

  • 18 (>>> num_joints =) –

  • AAGCN

  • model.init_weights() (>>>) –

  • torch.randn(batch_size (>>> inputs =) –

  • num_person

:param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # nturgb+d layout: :param >>> num_joints = 25: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’nturgb+d’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # coco layout: :param >>> num_joints = 17: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’coco’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # custom settings: :param >>> # disable the attention module to degenerate AAGCN to AGCN: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’coco’, mode=mode :param … gcn_attention=False): :param >>> model.init_weights(): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param torch.Size: :type torch.Size: [2, 2, 256, 38, 18] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 25] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17]

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

class mmaction.models.backbones.C2D(depth: int, pretrained: str | None = None, torchvision_pretrain: bool = True, in_channels: int = 3, num_stages: int = 4, out_indices: Sequence[int] = (3,), strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), style: str = 'pytorch', frozen_stages: int = -1, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, partial_bn: bool = False, with_cp: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': 'BatchNorm2d', 'val': 1.0}])[source]

C2D backbone.

Compared to ResNet-50, a temporal-pool is added after the first bottleneck. Detailed structure is kept same as “video-nonlocal-net” repo. Please refer to https://github.com/facebookresearch/video-nonlocal-net/blob /main/scripts/run_c2d_baseline_400k.sh. Please note that there are some improvements compared to “Non-local Neural Networks” paper (https://arxiv.org/abs/1711.07971). Differences are noted at https://github.com/facebookresearch/video-nonlocal -net#modifications-for-improving-speed.

forward(x: Tensor) Tensor | Tuple[Tensor][source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the

input samples extracted by the backbone.

Return type:

Union[torch.Tensor or Tuple[torch.Tensor]]

class mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[source]

C3D backbone.

Parameters:
  • pretrained (str | None) – Name of pretrained model.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses dict(type='Conv3d') to construct layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. required keys are type, Default: None.

  • act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses dict(type='ReLU') to construct layers. Default: None.

  • out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation of fc layers. Default: 0.01.

forward(x)[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data. the size of x is (num_batches, 3, 16, 112, 112).

Returns:

The feature of the input samples extracted by the backbone.

Return type:

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.MViT(arch: str = 'base', spatial_size: int = 224, temporal_size: int = 16, in_channels: int = 3, pretrained: str | None = None, pretrained_type: str | None = None, out_scales: int | Sequence[int] = -1, drop_path_rate: float = 0.0, use_abs_pos_embed: bool = False, interpolate_mode: str = 'trilinear', pool_kernel: tuple = (3, 3, 3), dim_mul: int = 2, head_mul: int = 2, adaptive_kv_stride: tuple = (1, 8, 8), rel_pos_embed: bool = True, residual_pooling: bool = True, dim_mul_in_attention: bool = True, with_cls_token: bool = True, output_cls_token: bool = True, rel_pos_zero_init: bool = False, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_cfg: Dict = {'eps': 1e-06, 'type': 'LN'}, patch_cfg: Dict = {'kernel_size': (3, 7, 7), 'padding': (1, 3, 3), 'stride': (2, 4, 4)}, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': ['Conv2d', 'Conv3d'], 'std': 0.02}, {'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.02}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.02}])[source]

Multi-scale ViT v2.

A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Inspiration from the official implementation and the mmclassification implementation

Parameters:
  • arch (str | dict) –

    MViT architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:

    • embed_dims (int): The dimensions of embedding.

    • num_layers (int): The number of layers.

    • num_heads (int): The number of heads in attention modules of the initial layer.

    • downscale_indices (List[int]): The layer indices to downscale the feature map.

    Defaults to ‘base’.

  • spatial_size (int) – The expected input spatial_size shape. Defaults to 224.

  • temporal_size (int) – The expected input temporal_size shape. Defaults to 224.

  • in_channels (int) – The num of input channels. Defaults to 3.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • pretrained_type (str, optional) – Type of pretrained model. choose from ‘imagenet’, ‘maskfeat’, None. Defaults to None, which means load from same architecture.

  • out_scales (int | Sequence[int]) – The output scale indices. They should not exceed the length of downscale_indices. Defaults to -1, which means the last scale.

  • drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.

  • use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.

  • interpolate_mode (str) – Select the interpolate mode for absolute position embedding vector resize. Defaults to “trilinear”.

  • pool_kernel (tuple) – kernel size for qkv pooling layers. Defaults to (3, 3, 3).

  • dim_mul (int) – The magnification for embed_dims in the downscale layers. Defaults to 2.

  • head_mul (int) – The magnification for num_heads in the downscale layers. Defaults to 2.

  • adaptive_kv_stride (int) – The stride size for kv pooling in the initial layer. Defaults to (1, 8, 8).

  • rel_pos_embed (bool) – Whether to enable the spatial and temporal relative position embedding. Defaults to True.

  • residual_pooling (bool) – Whether to enable the residual connection after attention pooling. Defaults to True.

  • dim_mul_in_attention (bool) – Whether to multiply the embed_dims in attention layers. If False, multiply it in MLP layers. Defaults to True.

  • with_cls_token (bool) – Whether concatenating class token into video tokens as transformer input. Defaults to True.

  • output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.

  • rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters. Defaults to False.

  • mlp_ratio (float) – Ratio of hidden dimensions in MLP layers. Defaults to 4.0.

  • qkv_bias (bool) – enable bias for qkv if True. Defaults to True.

  • norm_cfg (dict) – Config dict for normalization layer for all output features. Defaults to dict(type='LN', eps=1e-6).

  • patch_cfg (dict) –

    Config dict for the patch embedding layer. Defaults to ``dict(kernel_size=(3, 7, 7),

    stride=(2, 4, 4), padding=(1, 3, 3))``.

  • init_cfg (dict, optional) – The Config for initialization. Defaults to [ dict(type='TruncNormal', layer=['Conv2d', 'Conv3d'], std=0.02), dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.02), ]

Examples

>>> import torch
>>> from mmaction.registry import MODELS
>>> from mmaction.utils import register_all_modules
>>> register_all_modules()
>>>
>>> cfg = dict(type='MViT', arch='tiny', out_scales=[0, 1, 2, 3])
>>> model = MODELS.build(cfg)
>>> model.init_weights()
>>> inputs = torch.rand(1, 3, 16, 224, 224)
>>> outputs = model(inputs)
>>> for i, output in enumerate(outputs):
>>>     print(f'scale{i}: {output.shape}')
scale0: torch.Size([1, 96, 8, 56, 56])
scale1: torch.Size([1, 192, 8, 28, 28])
scale2: torch.Size([1, 384, 8, 14, 14])
scale3: torch.Size([1, 768, 8, 7, 7])
forward(x: Tensor) Tuple[Tensor | List[Tensor]][source]

Forward the MViT.

init_weights(pretrained: str | None = None) None[source]

Initialize the weights.

class mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7,), frozen_stages=-1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': ['GroupNorm', '_BatchNorm'], 'val': 1.0}])[source]

MobileNetV2 backbone.

Parameters:
  • pretrained (str | None) – Name of pretrained model. Defaults to None.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Defaults to 1.0.

  • out_indices (None or Sequence[int]) – Output from which stages. Defaults to (7, ).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in MobileNetV2 is conv2. Defaults to -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Defaults to None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type=’ReLU6’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='Kaiming', layer='Conv2d',), dict(type='Constant', layer=['GroupNorm', '_BatchNorm'], val=1.) ].

forward(x)[source]

Defines the computation performed at every call.

Parameters:

x (Tensor) – The input data.

Returns:

The feature of the input samples extracted by the backbone.

Return type:

Tensor or Tuple[Tensor]

make_layer(out_channels, num_blocks, stride, expand_ratio)[source]

Stack InvertedResidual blocks to build a layer for MobileNetV2.

Parameters:
  • out_channels (int) – out_channels of block.

  • num_blocks (int) – number of blocks.

  • stride (int) – stride of the first block. Defaults to 1

  • expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Defaults to 6.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, pretrained2d=True, **kwargs)[source]

MobileNetV2 backbone for TSM.

Parameters:
  • num_segments (int) – Number of frame segments. Defaults to 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Defaults to True.

  • shift_div (int) – Number of div for shift. Defaults to 8.

  • pretraind2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • **kwargs (keyword arguments, optional) – Arguments for MobilNetV2.

init_structure()[source]

Initiate the parameters either from existing checkpoint or from scratch.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_shift()[source]

Make temporal shift for some layers.

class mmaction.models.backbones.OmniResNet(layers: List[int] = [3, 4, 6, 3], pretrain_2d: str | None = None, init_cfg: ConfigDict | dict | None = None)[source]

Omni-ResNet that accepts both image and video inputs.

Parameters:
  • layers (List[int]) – number of layers in each residual stages. Defaults to [3, 4, 6, 3].

  • pretrain_2d (str, optional) – path to the 2D pretraining checkpoints. Defaults to None.

  • init_cfg (dict or ConfigDict, optional) – The Config for initialization. Defaults to None.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Accept both 3D (BCTHW for videos) and 2D (BCHW for images) tensors.

forward_2d(x: Tensor) Tensor[source]

Forward call for 2D tensors.

class mmaction.models.backbones.RGBPoseConv3D(pretrained: str | None = None, speed_ratio: int = 4, channel_ratio: int = 4, rgb_detach: bool = False, pose_detach: bool = False, rgb_drop_path: float = 0, pose_drop_path: float = 0, rgb_pathway: Dict = {'base_channels': 64, 'conv1_kernel': (1, 7, 7), 'fusion_kernel': 7, 'inflate': (0, 0, 1, 1), 'lateral': True, 'lateral_activate': (0, 0, 1, 1), 'lateral_infl': 1, 'num_stages': 4, 'with_pool2': False}, pose_pathway: Dict = {'base_channels': 32, 'conv1_kernel': (1, 7, 7), 'conv1_stride_s': 1, 'conv1_stride_t': 1, 'dilations': (1, 1, 1), 'fusion_kernel': 7, 'in_channels': 17, 'inflate': (0, 1, 1), 'lateral': True, 'lateral_activate': (0, 1, 1), 'lateral_infl': 16, 'lateral_inv': True, 'num_stages': 3, 'out_indices': (2,), 'pool1_stride_s': 1, 'pool1_stride_t': 1, 'spatial_strides': (2, 2, 2), 'stage_blocks': (4, 6, 3), 'temporal_strides': (1, 1, 1), 'with_pool2': False}, init_cfg: Dict | List[Dict] | None = None)[source]

RGBPoseConv3D backbone.

Parameters:
  • pretrained (str) – The file path to a pretrained model. Defaults to None.

  • speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Defaults to 4.

  • channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Defaults to 4.

  • rgb_detach (bool) – Whether to detach the gradients from the pose path. Defaults to False.

  • pose_detach (bool) – Whether to detach the gradients from the rgb path. Defaults to False.

  • rgb_drop_path (float) – The drop rate for dropping the features from the pose path. Defaults to 0.

  • pose_drop_path (float) – The drop rate for dropping the features from the rgb path. Defaults to 0.

  • rgb_pathway (dict) – Configuration of rgb branch. Defaults to dict(num_stages=4, lateral=True, lateral_infl=1, lateral_activate=(0, 0, 1, 1), fusion_kernel=7, base_channels=64, conv1_kernel=(1, 7, 7), inflate=(0, 0, 1, 1), with_pool2=False).

  • pose_pathway (dict) – Configuration of pose branch. Defaults to dict(num_stages=3, stage_blocks=(4, 6, 3), lateral=True, lateral_inv=True, lateral_infl=16, lateral_activate=(0, 1, 1), fusion_kernel=7, in_channels=17, base_channels=32, out_indices=(2, ), conv1_kernel=(1, 7, 7), conv1_stride_s=1, conv1_stride_t=1, pool1_stride_s=1, pool1_stride_t=1, inflate=(0, 1, 1), spatial_strides=(2, 2, 2), temporal_strides=(1, 1, 1), with_pool2=False).

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(imgs: Tensor, heatmap_imgs: Tensor) tuple[source]

Defines the computation performed at every call.

Parameters:
  • imgs (torch.Tensor) – The input data.

  • heatmap_imgs (torch.Tensor) – The input data.

Returns:

The feature of the input samples extracted by the backbone.

Return type:

tuple[torch.Tensor]

init_weights() None[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.ResNet(depth: int, pretrained: str | None = None, torchvision_pretrain: bool = True, in_channels: int = 3, num_stages: int = 4, out_indices: Sequence[int] = (3,), strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), style: str = 'pytorch', frozen_stages: int = -1, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, partial_bn: bool = False, with_cp: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': 'BatchNorm2d', 'val': 1.0}])[source]

ResNet backbone.

Parameters:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • torchvision_pretrain (bool) – Whether to load pretrained model from torchvision. Defaults to True.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • num_stages (int) – Resnet stages. Defaults to 4.

  • out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).

  • strides (Sequence[int]) – Strides of the first block of each stage. Defaults to (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).

  • style (str) – pytorch or caffe. If set to pytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to pytorch.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.

  • conv_cfg (dict or ConfigDict) – Config for norm layers. Defaults dict(type='Conv').

  • norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. required keys are type and requires_grad. Defaults to dict(type='BN2d', requires_grad=True).

  • act_cfg (Union[dict, ConfigDict]) – Config for activate layers. Defaults to dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • partial_bn (bool) – Whether to use partial bn. Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='Kaiming', layer='Conv2d',), dict(type='Constant', layer='BatchNorm', val=1.) ].

forward(x: Tensor) Tensor | Tuple[Tensor][source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the

input samples extracted by the backbone.

Return type:

Union[torch.Tensor or Tuple[torch.Tensor]]

init_weights() None[source]

Initiate the parameters either from existing checkpoint or from scratch.

train(mode: bool = True) None[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[source]

ResNet (2+1)d backbone.

This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition

forward(x)[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input samples extracted by the backbone.

Return type:

torch.Tensor

class mmaction.models.backbones.ResNet3d(depth: int = 50, pretrained: str | None = None, stage_blocks: Tuple | None = None, pretrained2d: bool = True, in_channels: int = 3, num_stages: int = 4, base_channels: int = 64, out_indices: Sequence[int] = (3,), spatial_strides: Sequence[int] = (1, 2, 2, 2), temporal_strides: Sequence[int] = (1, 1, 1, 1), dilations: Sequence[int] = (1, 1, 1, 1), conv1_kernel: Sequence[int] = (3, 7, 7), conv1_stride_s: int = 2, conv1_stride_t: int = 1, pool1_stride_s: int = 2, pool1_stride_t: int = 1, with_pool1: bool = True, with_pool2: bool = True, style: str = 'pytorch', frozen_stages: int = -1, inflate: Sequence[int] = (1, 1, 1, 1), inflate_style: str = '3x1x1', conv_cfg: Dict = {'type': 'Conv3d'}, norm_cfg: Dict = {'requires_grad': True, 'type': 'BN3d'}, act_cfg: Dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, with_cp: bool = False, non_local: Sequence[int] = (0, 0, 0, 0), non_local_cfg: Dict = {}, zero_init_residual: bool = True, init_cfg: Dict | List[Dict] | None = None, **kwargs)[source]

ResNet 3d backbone.

Parameters:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}. Defaults to 50.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • stage_blocks (tuple, optional) – Set number of stages for each res layer. Defaults to None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • num_stages (int) – Resnet stages. Defaults to 4.

  • base_channels (int) – Channel num of stem output features. Defaults to 64.

  • out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Defaults to (1, 2, 2, 2).

  • temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Defaults to (1, 1, 1, 1).

  • dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).

  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Defaults to (3, 7, 7).

  • conv1_stride_s (int) – Spatial stride of the first conv layer. Defaults to 2.

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Defaults to 1.

  • pool1_stride_s (int) – Spatial stride of the first pooling layer. Defaults to 2.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Defaults to 1.

  • with_pool2 (bool) – Whether to use pool2. Defaults to True.

  • style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to 'pytorch'.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.

  • inflate (Sequence[int]) – Inflate Dims of each block. Defaults to (1, 1, 1, 1).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Defaults to 3x1x1.

  • conv_cfg (dict) – Config for conv layers. Required keys are type. Defaults to dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. Required keys are type and requires_grad. Defaults to dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Defaults to dict().

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Defaults to True.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tensor) Tensor | Tuple[Tensor][source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input samples extracted by the backbone.

Return type:

torch.Tensor or tuple[torch.Tensor]

inflate_weights(logger: MMLogger) None[source]

Inflate weights.

init_weights(pretrained: str | None = None) None[source]

Initialize weights.

static make_res_layer(block: Module, inplanes: int, planes: int, blocks: int, spatial_stride: int | Sequence[int] = 1, temporal_stride: int | Sequence[int] = 1, dilation: int = 1, style: str = 'pytorch', inflate: int | Sequence[int] = 1, inflate_style: str = '3x1x1', non_local: int | Sequence[int] = 0, non_local_cfg: Dict = {}, norm_cfg: Dict | None = None, act_cfg: Dict | None = None, conv_cfg: Dict | None = None, with_cp: bool = False, **kwargs) Module[source]

Build residual layer for ResNet3D.

Parameters:
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Defaults to 1.

  • temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Defaults to 1.

  • dilation (int) – Spacing between kernel elements. Defaults to 1.

  • style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer,otherwise the stride-two layer is the first 1x1 conv layer. Defaults to 'pytorch'.

  • inflate (int | Sequence[int]) – Determine whether to inflate for each block. Defaults to 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: '3x1x1'.

  • non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to 0.

  • non_local_cfg (dict) – Config for non-local module. Defaults to dict().

  • conv_cfg (dict, optional) – Config for conv layers. Defaults to None.

  • norm_cfg (dict, optional) – Config for norm layers. Defaults to None.

  • act_cfg (dict, optional) – Config for activate layers. Defaults to None.

  • with_cp (bool, optional) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

Returns:

A residual layer for the given config.

Return type:

nn.Module

train(mode: bool = True) None[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]

ResNet backbone for CSN.

Parameters:
  • depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.

  • bottleneck_mode (str) –

    Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.

    If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.

    Default: ‘ip’.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dLayer(depth: int, pretrained: str | None = None, pretrained2d: bool = True, stage: int = 3, base_channels: int = 64, spatial_stride: int = 2, temporal_stride: int = 1, dilation: int = 1, style: str = 'pytorch', all_frozen: bool = False, inflate: int = 1, inflate_style: str = '3x1x1', conv_cfg: Dict = {'type': 'Conv3d'}, norm_cfg: Dict = {'requires_grad': True, 'type': 'BN3d'}, act_cfg: Dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = True, init_cfg: Dict | List[Dict] | None = None, **kwargs)[source]

ResNet 3d Layer.

Parameters:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • stage (int) – The index of Resnet stage. Defaults to 3.

  • base_channels (int) – Channel num of stem output features. Defaults to 64.

  • spatial_stride (int) – The 1st res block’s spatial stride. Defaults to 2.

  • temporal_stride (int) – The 1st res block’s temporal stride. Defaults to 1.

  • dilation (int) – The dilation. Defaults to 1.

  • style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to 'pytorch'.

  • all_frozen (bool) – Frozen all modules in the layer. Defaults to False.

  • inflate (int) – Inflate dims of each block. Defaults to 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Defaults to '3x1x1'.

  • conv_cfg (dict) – Config for conv layers. Required keys are type. Defaults to dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. Required keys are type and requires_grad. Defaults to dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Defaults to True.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input

samples extracted by the residual layer.

Return type:

torch.Tensor

inflate_weights(logger: MMLogger) None[source]

Inflate weights.

init_weights(pretrained: str | None = None) None[source]

Initialize weights.

train(mode: bool = True) None[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dSlowFast(pretrained: str | None = None, resample_rate: int = 8, speed_ratio: int = 8, channel_ratio: int = 8, slow_pathway: Dict = {'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway: Dict = {'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, init_cfg: Dict | List[Dict] | None = None)[source]

Slowfast backbone.

This module is proposed in SlowFast Networks for Video Recognition

Parameters:
  • pretrained (str) – The file path to a pretrained model.

  • resample_rate (int) – A large temporal stride resample_rate on input frames. The actual resample rate is calculated by multipling the interval in SampleFrames in the pipeline with resample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out of resample_rate * interval frames. Defaults to 8.

  • speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Defaults to 8.

  • channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Defaults to 8.

  • slow_pathway (dict) – Configuration of slow branch. Defaults to dict(type='resnet3d', depth=50, pretrained=None, lateral=True, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1)).

  • fast_pathway (dict) – Configuration of fast branch. Defaults to dict(type='resnet3d', depth=50, pretrained=None, lateral=False, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1).

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tensor) tuple[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input samples

extracted by the backbone.

Return type:

tuple[torch.Tensor]

init_weights(pretrained: str | None = None) None[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.ResNet3dSlowOnly(conv1_kernel: Sequence[int] = (1, 7, 7), conv1_stride_t: int = 1, pool1_stride_t: int = 1, inflate: Sequence[int] = (0, 0, 1, 1), with_pool2: bool = False, **kwargs)[source]

SlowOnly backbone based on ResNet3dPathway.

Parameters:
  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Defaults to (1, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Defaults to 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Defaults to 1.

  • inflate (Sequence[int]) – Inflate dims of each block. Defaults to (0, 0, 1, 1).

  • with_pool2 (bool) – Whether to use pool2. Defaults to False.

class mmaction.models.backbones.ResNetAudio(depth: int, pretrained: str | None = None, in_channels: int = 1, num_stages: int = 4, base_channels: int = 32, strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), conv1_kernel: int = 9, conv1_stride: int = 1, frozen_stages: int = -1, factorize: Sequence[int] = (1, 1, 0, 0), norm_eval: bool = False, with_cp: bool = False, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, zero_init_residual: bool = True)[source]

ResNet 2d audio backbone. Reference:

Parameters:
  • depth (int) – Depth of resnet, from {50, 101, 152}.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • in_channels (int) – Channel num of input features. Defaults to 1.

  • base_channels (int) – Channel num of stem output features. Defaults to 32.

  • num_stages (int) – Resnet stages. Defaults to 4.

  • strides (Sequence[int]) – Strides of residual blocks of each stage. Defaults to (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).

  • conv1_kernel (int) – Kernel size of the first conv layer. Defaults to 9.

  • conv1_stride (Union[int, Tuple[int]]) – Stride of the first conv layer. Defaults to 1.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.

  • factorize (Sequence[int]) – factorize Dims of each block for audio. Defaults to (1, 1, 0, 0).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • conv_cfg (Union[dict, ConfigDict]) – Config for norm layers. Defaults to dict(type='Conv').

  • norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. required keys are type and requires_grad. Defaults to dict(type='BN2d', requires_grad=True).

  • act_cfg (Union[dict, ConfigDict]) – Config for activate layers. Defaults to dict(type='ReLU', inplace=True).

  • zero_init_residual (bool) – Whether to use zero initialization for residual block. Defaults to True.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input samples extracted

by the backbone.

Return type:

torch.Tensor

init_weights() None[source]

Initiate the parameters either from existing checkpoint or from scratch.

static make_res_layer(block: Module, inplanes: int, planes: int, blocks: int, stride: int = 1, dilation: int = 1, factorize: int = 1, norm_cfg: ConfigDict | dict | None = None, with_cp: bool = False) Module[source]

Build residual layer for ResNetAudio.

Parameters:
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • stride (int) – Strides of residual blocks of each stage. Defaults to 1.

  • dilation (int) – Spacing between kernel elements. Defaults to 1.

  • factorize (Uninon[int, Sequence[int]]) – Determine whether to factorize for each block. Defaults to 1.

  • norm_cfg (Union[dict, ConfigDict], optional) – Config for norm layers. Defaults to None.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

Returns:

A residual layer for the given config.

Return type:

nn.Module

train(mode: bool = True) None[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNetTIN(depth, is_tin=True, **kwargs)[source]

ResNet backbone for TIN.

Parameters:
  • depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments. Default: 8.

  • is_tin (bool) – Whether to apply temporal interlace. Default: True.

  • shift_div (int) – Number of division parts for shift. Default: 4.

  • kwargs (dict, optional) – Arguments for ResNet.

init_structure()[source]

Initialize structure for tsm.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_interlace()[source]

Make temporal interlace for some layers.

class mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, pretrained2d=True, **kwargs)[source]

ResNet backbone for TSM.

Parameters:
  • num_segments (int) – Number of frame segments. Defaults to 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Defaults to True.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Defaults to dict().

  • shift_div (int) – Number of div for shift. Defaults to 8.

  • shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Defaults to ‘blockres’.

  • temporal_pool (bool) – Whether to add temporal pooling. Defaults to False.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • **kwargs (keyword arguments, optional) – Arguments for ResNet.

init_structure()[source]

Initialize structure for tsm.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

load_original_weights(logger)[source]

Load weights from original checkpoint, which required converting keys.

make_non_local()[source]

Wrap resnet layer into non local wrapper.

make_temporal_pool()[source]

Make temporal pooling between layer1 and layer2, using a 3D max pooling layer.

make_temporal_shift()[source]

Make temporal shift for some layers.

class mmaction.models.backbones.STGCN(graph_cfg: Dict, in_channels: int = 3, base_channels: int = 64, data_bn_type: str = 'VC', ch_ratio: int = 2, num_person: int = 2, num_stages: int = 10, inflate_stages: List[int] = [5, 8], down_stages: List[int] = [5, 8], init_cfg: Dict | List[Dict] | None = None, **kwargs)[source]

STGCN backbone.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. More details can be found in the paper .

Parameters:
  • graph_cfg (dict) – Config for building the graph.

  • in_channels (int) – Number of input channels. Defaults to 3.

  • base_channels (int) – Number of base channels. Defaults to 64.

  • data_bn_type (str) – Type of the data bn layer. Defaults to 'VC'.

  • ch_ratio (int) – Inflation ratio of the number of channels. Defaults to 2.

  • num_person (int) – Maximum number of people. Only used when data_bn_type == ‘MVC’. Defaults to 2.

  • num_stages (int) – Total number of stages. Defaults to 10.

  • inflate_stages (list[int]) – Stages to inflate the number of channels. Defaults to [5, 8].

  • down_stages (list[int]) – Stages to perform downsampling in the time dimension. Defaults to [5, 8].

  • stage_cfgs (dict) – Extra config dict for each stage. Defaults to dict().

  • init_cfg (dict or list[dict], optional) – Config to control the initialization. Defaults to None.

  • Examples

  • torch (>>> import) –

  • STGCN (>>> model =) –

  • >>>

  • 'stgcn_spatial' (>>> mode =) –

  • batch_size (>>>) –

  • num_person

  • 2 (num_frames =) –

  • 2

  • 150

  • >>>

  • layout (>>> # openpose-18) –

  • 18 (>>> num_joints =) –

  • STGCN

  • model.init_weights() (>>>) –

  • torch.randn(batch_size (>>> inputs =) –

  • num_person

:param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # nturgb+d layout: :param >>> num_joints = 25: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’nturgb+d’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # coco layout: :param >>> num_joints = 17: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’coco’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # custom settings: :param >>> # instantiate STGCN++: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’coco’, mode=’spatial’ :param … gcn_adaptive=’init’: :param gcn_with_res=True: :param : :param … tcn_type=’mstcn’): :param >>> model.init_weights(): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param torch.Size: :type torch.Size: [2, 2, 256, 38, 18] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 25] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17]

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

class mmaction.models.backbones.SwinTransformer3D(arch: str | Dict, pretrained: str | None = None, pretrained2d: bool = True, patch_size: int | Sequence[int] = (2, 4, 4), in_channels: int = 3, window_size: Sequence[int] = (8, 7, 7), mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, act_cfg: Dict = {'type': 'GELU'}, norm_cfg: Dict = {'type': 'LN'}, patch_norm: bool = True, frozen_stages: int = -1, with_cp: bool = False, out_indices: Sequence[int] = (3,), out_after_downsample: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[source]

Video Swin Transformer backbone.

A pytorch implement of: Video Swin Transformer

Parameters:
  • arch (str or dict) – Video Swin Transformer architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys: - embed_dims (int): The dimensions of embedding. - depths (Sequence[int]): The number of blocks in each stage. - num_heads (Sequence[int]): The number of heads in attention modules of each stage.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • patch_size (int or Sequence(int)) – Patch size. Defaults to (2, 4, 4).

  • in_channels (int) – Number of input image channels. Defaults to 3.

  • window_size (Sequence[int]) – Window size. Defaults to (8, 7, 7).

  • mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Defaults to 4.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Defaults to True.

  • qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.

  • drop_rate (float) – Dropout rate. Defaults to 0.0.

  • attn_drop_rate (float) – Attention dropout rate. Defaults to 0.0.

  • drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type='GELU').

  • norm_cfg (dict) – Config dict for norm layer. Defaults to dict(type='LN').

  • patch_norm (bool) – If True, add normalization after patch embedding. Defaults to True.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).

  • out_after_downsample (bool) – Whether to output the feature map of a stage after the following downsample layer. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tuple[Tensor] | Tensor[source]

Forward function for Swin3d Transformer.

inflate_weights(logger: MMLogger) None[source]

Inflate the swin2d parameters to swin3d.

The differences between swin3d and swin2d mainly lie in an extra axis. To utilize the pretrained parameters in 2d model, the weight of swin2d models should be inflated to fit in the shapes of the 3d counterpart.

Parameters:

logger (MMLogger) – The logger used to print debugging information.

init_weights() None[source]

Initialize the weights in backbone.

train(mode: bool = True) None[source]

Convert the model into training mode while keep layers frozen.

class mmaction.models.backbones.TANet(depth: int, num_segments: int, tam_cfg: dict | None = None, **kwargs)[source]

Temporal Adaptive Network (TANet) backbone.

This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.

Parameters:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments.

  • tam_cfg (dict, optional) – Config for temporal adaptive module (TAM). Defaults to None.

init_weights()[source]

Initialize weights.

make_tam_modeling()[source]

Replace ResNet-Block with TA-Block.

class mmaction.models.backbones.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[source]

TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?

Parameters:
  • num_frames (int) – Number of frames in the video.

  • img_size (int | tuple) – Size of input image.

  • patch_size (int) – Size of one patch.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • embed_dims (int) – Dimensions of embedding. Defaults to 768.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.

  • num_transformer_layers (int) – Number of transformer layers. Defaults to 12.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0..

  • (list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.

  • attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.

  • norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).

forward(x)[source]

Defines the computation performed at every call.

init_weights(pretrained=None)[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.UniFormer(depth: List[int] = [5, 8, 20, 7], img_size: int = 224, in_chans: int = 3, embed_dim: List[int] = [64, 128, 320, 512], head_dim: int = 64, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, pretrained2d: bool = True, pretrained: str | None = None, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[source]

UniFormer.

A pytorch implement of: UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning <https://arxiv.org/abs/2201.04676>

Parameters:
  • depth (List[int]) – List of depth in each stage. Defaults to [5, 8, 20, 7].

  • img_size (int) – Number of input size. Defaults to 224.

  • in_chans (int) – Number of input features. Defaults to 3.

  • head_dim (int) – Dimension of attention head. Defaults to 64.

  • embed_dim (List[int]) – List of embedding dimension in each layer. Defaults to [64, 128, 320, 512].

  • mlp_ratio (float) – Ratio of mlp hidden dimension to embedding dimension. Defaults to 4.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Defaults to True.

  • qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.

  • drop_rate (float) – Dropout rate. Defaults to 0.0.

  • attn_drop_rate (float) – Attention dropout rate. Defaults to 0.0.

  • drop_path_rate (float) – Stochastic depth rates. Defaults to 0.0.

  • pretrained2d (bool) – Whether to load pretrained from 2D model. Defaults to True.

  • pretrained (str) – Name of pretrained model. Defaults to None.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights in backbone.

class mmaction.models.backbones.UniFormerV2(input_resolution: int = 224, patch_size: int = 16, width: int = 768, layers: int = 12, heads: int = 12, backbone_drop_path_rate: float = 0.0, t_size: int = 8, kernel_size: int = 3, dw_reduction: float = 1.5, temporal_downsample: bool = False, no_lmhra: bool = True, double_lmhra: bool = False, return_list: List[int] = [8, 9, 10, 11], n_layers: int = 4, n_dim: int = 768, n_head: int = 12, mlp_factor: float = 4.0, drop_path_rate: float = 0.0, mlp_dropout: List[float] = [0.5, 0.5, 0.5, 0.5], clip_pretrained: bool = True, pretrained: str | None = None, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[source]

UniFormerV2:

A pytorch implement of: UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer <https://arxiv.org/abs/2211.09552>

Parameters:
  • input_resolution (int) – Number of input resolution. Defaults to 224.

  • patch_size (int) – Number of patch size. Defaults to 16.

  • width (int) – Number of input channels in local UniBlock. Defaults to 768.

  • layers (int) – Number of layers of local UniBlock. Defaults to 12.

  • heads (int) – Number of attention head in local UniBlock. Defaults to 12.

  • backbone_drop_path_rate (float) – Stochastic depth rate in local UniBlock. Defaults to 0.0.

  • t_size (int) – Number of temporal dimension after patch embedding. Defaults to 8.

  • temporal_downsample (bool) – Whether downsampling temporal dimentison. Defaults to False.

  • dw_reduction (float) – Downsample ratio of input channels in local MHRA. Defaults to 1.5.

  • no_lmhra (bool) – Whether removing local MHRA in local UniBlock. Defaults to False.

  • double_lmhra (bool) – Whether using double local MHRA in local UniBlock. Defaults to True.

  • return_list (List[int]) – Layer index of input features for global UniBlock. Defaults to [8, 9, 10, 11].

  • n_dim (int) – Number of layers of global UniBlock. Defaults to 4.

  • n_dim – Number of layers of global UniBlock. Defaults to 4.

  • n_dim – Number of input channels in global UniBlock. Defaults to 768.

  • n_head (int) – Number of attention head in global UniBlock. Defaults to 12.

  • mlp_factor (float) – Ratio of hidden dimensions in MLP layers in global UniBlock. Defaults to 4.0.

  • drop_path_rate (float) – Stochastic depth rate in global UniBlock. Defaults to 0.0.

  • mlp_dropout (List[float]) – Stochastic dropout rate in each MLP layer in global UniBlock. Defaults to [0.5, 0.5, 0.5, 0.5].

  • clip_pretrained (bool) – Whether to load pretrained CLIP visual encoder. Defaults to True.

  • pretrained (str) – Name of pretrained model. Defaults to None.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights in backbone.

class mmaction.models.backbones.VisionTransformer(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dims: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: int = 4.0, qkv_bias: bool = True, qk_scale: int | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: ConfigDict | dict = {'eps': 1e-06, 'type': 'LN'}, init_values: int = 0.0, use_learnable_pos_emb: bool = False, num_frames: int = 16, tubelet_size: int = 2, use_mean_pooling: int = True, pretrained: str | None = None, return_feat_map: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}], **kwargs)[source]

Vision Transformer with support for patch or hybrid CNN input stage. An impl of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Parameters:
  • img_size (int or tuple) – Size of input image. Defaults to 224.

  • patch_size (int) – Spatial size of one patch. Defaults to 16.

  • in_channels (int) – The number of channels of he input. Defaults to 3.

  • embed_dims (int) – Dimensions of embedding. Defaults to 768.

  • depth (int) – number of blocks in the transformer. Defaults to 12.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.

  • mlp_ratio (int) – The ratio between the hidden layer and the input layer in the FFN. Defaults to 4.

  • qkv_bias (bool) – If True, add a learnable bias to q and v. Defaults to True.

  • qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.

  • drop_rate (float) – Dropout ratio of output. Defaults to 0.

  • attn_drop_rate (float) – Dropout ratio of attention weight. Defaults to 0.

  • drop_path_rate (float) – Dropout ratio of the residual branch. Defaults to 0.

  • norm_cfg (dict or Configdict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).

  • init_values (float) – Value to init the multiplier of the residual branch. Defaults to 0.

  • use_learnable_pos_emb (bool) – If True, use learnable positional embedding, othersize use sinusoid encoding. Defaults to False.

  • num_frames (int) – Number of frames in the video. Defaults to 16.

  • tubelet_size (int) – Temporal size of one patch. Defaults to 2.

  • use_mean_pooling (bool) – If True, take the mean pooling over all positions. Defaults to True.

  • pretrained (str, optional) – Name of pretrained model. Default: None.

  • return_feat_map (bool) – If True, return the feature in the shape of [B, C, T, H, W]. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (Tensor) – The input data.

Returns:

The feature of the input

samples extracted by the backbone.

Return type:

Tensor

class mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=-1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters:
  • gamma_w (float) – Global channel width expansion factor. Default: 1.

  • gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.

  • gamma_d (float) – Network depth expansion factor. Default: 1.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input samples extracted by the backbone.

Return type:

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]

Build residual layer for ResNet3D.

Parameters:
  • block (nn.Module) – Residual module to be built.

  • layer_inplanes (int) – Number of channels for the input feature of the res layer.

  • inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.

  • planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict | None) – Config for norm layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. Default: None.

  • act_cfg (dict | None) – Config for activate layers. Default: None.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns:

A residual layer for the given config.

Return type:

nn.Module

train(mode=True)[source]

Set the optimization status when training.

common

class mmaction.models.common.Conv2plus1d(in_channels: int, out_channels: int, kernel_size: int | Tuple[int], stride: int | Tuple[int] = 1, padding: int | Tuple[int] = 0, dilation: int | Tuple[int] = 1, groups: int = 1, bias: bool | str = True, norm_cfg: ConfigDict | dict = {'type': 'BN3d'})[source]

(2+1)d Conv module for R(2+1)d backbone.

https://arxiv.org/pdf/1711.11248.pdf.

Parameters:
  • in_channels (int) – Same as nn.Conv3d.

  • out_channels (int) – Same as nn.Conv3d.

  • kernel_size (Union[int, Tuple[int]]) – Same as nn.Conv3d.

  • stride (Union[int, Tuple[int]]) – Same as nn.Conv3d. Defaults to 1.

  • padding (Union[int, Tuple[int]]) – Same as nn.Conv3d. Defaults to 0.

  • dilation (Union[int, Tuple[int]]) – Same as nn.Conv3d. Defaults to 1.

  • groups (int) – Same as nn.Conv3d. Defaults to 1.

  • bias (Union[bool, str]) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

  • norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. Defaults to dict(type='BN3d').

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The output of the module.

Return type:

torch.Tensor

init_weights() None[source]

Initiate the parameters from scratch.

class mmaction.models.common.ConvAudio(in_channels: int, out_channels: int, kernel_size: int | Tuple[int], op: str = 'concat', stride: int | Tuple[int] = 1, padding: int | Tuple[int] = 0, dilation: int | Tuple[int] = 1, groups: int = 1, bias: bool | str = False)[source]

Conv2d module for AudioResNet backbone.

Parameters:
  • in_channels (int) – Same as nn.Conv2d.

  • out_channels (int) – Same as nn.Conv2d.

  • kernel_size (Union[int, Tuple[int]]) – Same as nn.Conv2d.

  • op (str) – Operation to merge the output of freq and time feature map. Choices are sum and concat. Defaults to concat.

  • stride (Union[int, Tuple[int]]) – Same as nn.Conv2d. Defaults to 1.

  • padding (Union[int, Tuple[int]]) – Same as nn.Conv2d. Defaults to 0.

  • dilation (Union[int, Tuple[int]]) – Same as nn.Conv2d. Defaults to 1.

  • groups (int) – Same as nn.Conv2d. Defaults to 1.

  • bias (Union[bool, str]) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False. Defaults to False.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The output of the module.

Return type:

torch.Tensor

init_weights() None[source]

Initiate the parameters from scratch.

class mmaction.models.common.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]

Spatial Attention in Divided Space Time Attention.

Parameters:
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[source]

Defines the computation performed at every call.

init_weights()[source]

init DividedSpatialAttentionWithNorm by default.

class mmaction.models.common.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]

Temporal Attention in Divided Space Time Attention.

Parameters:
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[source]

Defines the computation performed at every call.

init_weights()[source]

Initialize weights.

class mmaction.models.common.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[source]

FFN with pre normalization layer.

FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.

FFNWithNorm has one main difference with FFN:

  • It apply one normalization layer before forwarding the input data to

    feed-forward networks.

Parameters:
  • embed_dims (int) – Dimensions of embedding. Defaults to 256.

  • feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.

  • num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.

  • act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)

  • ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..

  • add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.

  • dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.

  • init_cfg (dict) – The Config for initialization. Defaults to None.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

forward(x, residual=None)[source]

Defines the computation performed at every call.

class mmaction.models.common.SubBatchNorm3D(num_features, **cfg)[source]

Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.

Parameters:

num_features (int) – Dimensions of BatchNorm.

aggregate_stats()[source]

Synchronize running_mean, and running_var to self.bn.

Call this before eval, then call model.eval(); When eval, forward function will call self.bn instead of self.split_bn, During this time the running_mean, and running_var of self.bn has been obtained from self.split_bn.

forward(x)[source]

Defines the computation performed at every call.

init_weights(cfg)[source]

Initialize weights.

class mmaction.models.common.TAM(in_channels: int, num_segments: int, alpha: int = 2, adaptive_kernel_size: int = 3, beta: int = 4, conv1d_kernel_size: int = 3, adaptive_convolution_stride: int = 1, adaptive_convolution_padding: int = 1, init_std: float = 0.001)[source]

Temporal Adaptive Module(TAM) for TANet.

This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Parameters:
  • in_channels (int) – Channel num of input features.

  • num_segments (int) – Number of frame segments.

  • alpha (int) – alpha in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Defaults to 2.

  • adaptive_kernel_size (int) – K in the paper and is the size of the adaptive kernel size in the global branch. Defaults to 3.

  • beta (int) – beta in the paper and is set to control the model complexity in the local branch. Defaults to 4.

  • conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Defaults to 3.

  • adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of Temporal Adaptive Aggregation. Defaults to 1.

  • adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of Temporal Adaptive Aggregation. Defaults to 1.

  • init_std (float) – Std value for initiation of nn.Linear. Defaults to 0.001.

forward(x: Tensor) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The output of the module.

Return type:

torch.Tensor

data_preprocessors

class mmaction.models.data_preprocessors.ActionDataPreprocessor(mean: Sequence[float | int] | None = None, std: Sequence[float | int] | None = None, to_rgb: bool = False, to_float32: bool = True, blending: dict | None = None, format_shape: str = 'NCHW')[source]

Data pre-processor for action recognition tasks.

Parameters:
  • mean (Sequence[float or int], optional) – The pixel mean of channels of images or stacked optical flow. Defaults to None.

  • std (Sequence[float or int], optional) – The pixel standard deviation of channels of images or stacked optical flow. Defaults to None.

  • to_rgb (bool) – Whether to convert image from BGR to RGB. Defaults to False.

  • to_float32 (bool) – Whether to convert data to float32. Defaults to True.

  • blending (dict, optional) – Config for batch blending. Defaults to None.

  • format_shape (str) – Format shape of input data. Defaults to 'NCHW'.

forward(data: dict | Tuple[dict], training: bool = False) dict | Tuple[dict][source]

Perform normalization, padding, bgr2rgb conversion and batch augmentation based on BaseDataPreprocessor.

Parameters:
  • data (dict or Tuple[dict]) – data sampled from dataloader.

  • training (bool) – Whether to enable training time augmentation.

Returns:

Data in the same format as the model input.

Return type:

dict or Tuple[dict]

forward_onesample(data, training: bool = False) dict[source]

Perform normalization, padding, bgr2rgb conversion and batch augmentation on one data sample.

Parameters:
  • data (dict) – data sampled from dataloader.

  • training (bool) – Whether to enable training time augmentation.

Returns:

Data in the same format as the model input.

Return type:

dict

class mmaction.models.data_preprocessors.MultiModalDataPreprocessor(preprocessors: Dict)[source]

Multi-Modal data pre-processor for action recognition tasks.

forward(data: Dict, training: bool = False) Dict[source]

Preprocesses the data into the model input format.

Parameters:
  • data (dict) – Data returned by dataloader.

  • training (bool) – Whether to enable training time augmentation.

Returns:

Data in the same format as the model input.

Return type:

dict

heads

class mmaction.models.heads.BaseHead(num_classes: int, in_channels: int, loss_cls: Dict = {'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class: bool = False, label_smooth_eps: float = 0.0, topk: int | Tuple[int] = (1, 5), average_clips: Dict | None = None, init_cfg: Dict | None = None)[source]

Base class for head.

All Head should subclass it. All subclass should overwrite: - forward(), supporting to forward both for training and testing.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type='CrossEntropyLoss', loss_weight=1.0).

  • multi_class (bool) – Determines whether it is a multi-class recognition task. Defaults to False.

  • label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Defaults to 0.

  • topk (int or tuple) – Top-k accuracy. Defaults to (1, 5).

  • average_clips (dict, optional) – Config for averaging class scores over multiple clips. Defaults to None.

  • init_cfg (dict, optional) – Config to control the initialization. Defaults to None.

average_clip(cls_scores: Tensor, num_segs: int = 1) Tensor[source]

Averaging class scores over multiple clips.

Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.

Parameters:
  • cls_scores (torch.Tensor) – Class scores to be averaged.

  • num_segs (int) – Number of clips for each input sample.

Returns:

Averaged class scores.

Return type:

torch.Tensor

abstract forward(x, **kwargs) Dict[str, Tensor] | List[ActionDataSample] | Tuple[Tensor] | Tensor[source]

Defines the computation performed at every call.

loss(feats: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) Dict[source]

Perform forward propagation of head and loss calculation on the features of the upstream network.

Parameters:
  • feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

A dictionary of loss components.

Return type:

dict

loss_by_feat(cls_scores: Tensor, data_samples: List[ActionDataSample]) Dict[source]

Calculate the loss based on the features extracted by the head.

Parameters:
  • cls_scores (torch.Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

A dictionary of loss components.

Return type:

dict

predict(feats: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) List[ActionDataSample][source]

Perform forward propagation of head and predict recognition results on the features of the upstream network.

Parameters:
  • feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

Recognition results wrapped

by ActionDataSample.

Return type:

list[ActionDataSample]

predict_by_feat(cls_scores: Tensor, data_samples: List[ActionDataSample]) List[ActionDataSample][source]

Transform a batch of output features extracted from the head into prediction results.

Parameters:
  • cls_scores (torch.Tensor) – Classification scores, has a shape (B*num_segs, num_classes)

  • data_samples (list[ActionDataSample]) – The annotation data of every samples. It usually includes information such as gt_label.

Returns:

Recognition results wrapped

by ActionDataSample.

Return type:

List[ActionDataSample]

class mmaction.models.heads.FeatureHead(spatial_type: str = 'avg', temporal_type: str = 'avg', backbone_name: str | None = None, num_segments: str | None = None, **kwargs)[source]

General head for feature extraction.

Parameters:
  • spatial_type (str, optional) – Pooling type in spatial dimension. Default: ‘avg’. If set to None, means keeping spatial dimension, and for GCN backbone, keeping last two dimension(T, V).

  • temporal_type (str, optional) – Pooling type in temporal dimension. Default: ‘avg’. If set to None, meanse keeping temporal dimnsion, and for GCN backbone, keeping dimesion M. Please note that the channel order would keep same with the output of backbone, [N, T, C, H, W] for 2D recognizer, and [N, M, C, T, V] for GCN recognizer.

  • backbone_name (str, optional) – Backbone name to specifying special operations.Currently supports: ‘tsm’, ‘slowfast’, and ‘gcn’. Defaults to None, means take the input as normal feature.

  • num_segments (int, optional) – Number of frame segments for TSM backbone. Defaults to None.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tensor, num_segs: int | None = None, **kwargs) Tensor[source]

Defines the computation performed at every call.

Parameters:
  • x (Tensor) – The input data.

  • num_segs (int) – For 2D backbone. Number of segments into which a video is divided. Defaults to None.

Returns:

The output features after pooling.

Return type:

Tensor

predict_by_feat(feats: Tensor | Tuple[Tensor], data_samples) Tensor[source]

Integrate multi-view features into one tensor.

Parameters:
  • feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

The integrated multi-view features.

Return type:

Tensor

class mmaction.models.heads.GCNHead(num_classes: int, in_channels: int, loss_cls: Dict = {'type': 'CrossEntropyLoss'}, dropout: float = 0.0, average_clips: str = 'prob', init_cfg: Dict | List[Dict] = {'layer': 'Linear', 'std': 0.01, 'type': 'Normal'}, **kwargs)[source]

The classification head for GCN.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type='CrossEntropyLoss').

  • dropout (float) – Probability of dropout layer. Defaults to 0.

  • init_cfg (dict or list[dict]) – Config to control the initialization. Defaults to dict(type='Normal', layer='Linear', std=0.01).

forward(x: Tensor, **kwargs) Tensor[source]

Forward features from the upstream network.

Parameters:

x (torch.Tensor) – Features from the upstream network.

Returns:

Classification scores with shape (B, num_classes).

Return type:

torch.Tensor

class mmaction.models.heads.I3DHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, **kwargs)[source]

Classification head for I3D.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tensor, **kwargs) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (Tensor) – The input data.

Returns:

The classification scores for input samples.

Return type:

Tensor

init_weights() None[source]

Initiate the parameters from scratch.

class mmaction.models.heads.MViTHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, dropout_ratio: float = 0.5, init_std: float = 0.02, init_scale: float = 1.0, with_cls_token: bool = True, **kwargs)[source]

Classification head for Multi-scale ViT.

A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0.5.

  • init_std (float) – Std value for Initiation. Defaults to 0.02.

  • init_scale (float) – Scale factor for Initiation parameters. Defaults to 1.

  • with_cls_token (bool) – Whether the backbone output feature with cls_token. Defaults to True.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tuple[List[Tensor]], **kwargs) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (Tuple[List[Tensor]]) – The input data.

Returns:

The classification scores for input samples.

Return type:

Tensor

init_weights() None[source]

Initiate the parameters from scratch.

pre_logits(feats: Tuple[List[Tensor]]) Tensor[source]

The process before the final classification head.

The input feats is a tuple of list of tensor, and each tensor is the feature of a backbone stage.

class mmaction.models.heads.OmniHead(image_classes: int, video_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, image_dropout_ratio: float = 0.2, video_dropout_ratio: float = 0.5, video_nl_head: bool = True, **kwargs)[source]

Classification head for OmniResNet that accepts both image and video inputs.

Parameters:
  • image_classes (int) – Number of image classes to be classified.

  • video_classes (int) – Number of video classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • image_dropout_ratio (float) – Probability of dropout layer for the image head. Defaults to 0.2.

  • video_dropout_ratio (float) – Probability of dropout layer for the video head. Defaults to 0.5.

  • video_nl_head (bool) – if true, use a non-linear head for the video head. Defaults to True.

forward(x: Tensor, **kwargs) Tensor[source]

Defines the computation performed at every call.

Parameters:

x (Tensor) – The input data.

Returns:

The classification scores for input samples.

Return type:

Tensor

loss_by_feat(cls_scores: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample]) dict[source]

Calculate the loss based on the features extracted by the head.

Parameters:
  • cls_scores (Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).

  • data_samples (List[ActionDataSample]) – The batch data samples.

Returns:

A dictionary of loss components.

Return type:

dict

class mmaction.models.heads.RGBPoseHead(num_classes: int, in_channels: Tuple[int], loss_cls: Dict = {'type': 'CrossEntropyLoss'}, loss_components: List[str] = ['rgb', 'pose'], loss_weights: float | Tuple[float] = 1.0, dropout: float = 0.5, init_std: float = 0.01, **kwargs)[source]

The classification head for RGBPoseConv3D.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (tuple[int]) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type='CrossEntropyLoss').

  • loss_components (list[str]) – The components of the loss. Defaults to ['rgb', 'pose'].

  • loss_weights (float or tuple[float]) – The weights of the losses. Defaults to 1.

  • dropout (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

forward(x: Tuple[Tensor]) Dict[source]

Defines the computation performed at every call.

init_weights() None[source]

Initiate the parameters from scratch.

loss(feats: Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) Dict[source]

Perform forward propagation of head and loss calculation on the features of the upstream network.

Parameters:
  • feats (tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

A dictionary of loss components.

Return type:

dict

loss_by_feat(cls_scores: Dict[str, Tensor], data_samples: List[ActionDataSample]) Dict[source]

Calculate the loss based on the features extracted by the head.

Parameters:
  • cls_scores (dict[str, torch.Tensor]) – The dict of classification scores,

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

A dictionary of loss components.

Return type:

dict

loss_by_scores(cls_scores: Tensor, labels: Tensor) Dict[source]

Calculate the loss based on the features extracted by the head.

Parameters:
  • cls_scores (torch.Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).

  • labels (torch.Tensor) – The labels used to calculate the loss.

Returns:

A dictionary of loss components.

Return type:

dict

predict(feats: Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) List[ActionDataSample][source]

Perform forward propagation of head and predict recognition results on the features of the upstream network.

Parameters:
  • feats (tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

Returns:

Recognition results wrapped

by ActionDataSample.

Return type:

list[ActionDataSample]

predict_by_feat(cls_scores: Dict[str, Tensor], data_samples: List[ActionDataSample]) List[ActionDataSample][source]

Transform a batch of output features extracted from the head into prediction results.

Parameters:
  • cls_scores (dict[str, torch.Tensor]) – The dict of classification scores,

  • data_samples (list[ActionDataSample]) – The annotation data of every samples. It usually includes information such as gt_label.

Returns:

Recognition results wrapped

by ActionDataSample.

Return type:

list[ActionDataSample]

predict_by_scores(cls_scores: Tensor, data_samples: List[ActionDataSample]) Tensor[source]

Transform a batch of output features extracted from the head into prediction results.

Parameters:
  • cls_scores (torch.Tensor) – Classification scores, has a shape (B*num_segs, num_classes)

  • data_samples (list[ActionDataSample]) – The annotation data of every samples.

Returns:

The averaged classification scores.

Return type:

torch.Tensor

class mmaction.models.heads.SlowFastHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.8, init_std: float = 0.01, **kwargs)[source]

The classification head for SlowFast.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tuple[Tensor], **kwargs) None[source]

Defines the computation performed at every call.

Parameters:

x (tuple[torch.Tensor]) – The input data.

Returns:

The classification scores for input samples.

Return type:

Tensor

init_weights() None[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TPNHead(*args, **kwargs)[source]

Class head for TPN.

forward(x, num_segs: int | None = None, fcn_test: bool = False, **kwargs) Tensor[source]

Defines the computation performed at every call.

Parameters:
  • x (Tensor) – The input data.

  • num_segs (int, optional) – Number of segments into which a video is divided. Defaults to None.

  • fcn_test (bool) – Whether to apply full convolution (fcn) testing. Defaults to False.

Returns:

The classification scores for input samples.

Return type:

Tensor

class mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]

Class head for TRN.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.

  • hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.001.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs, **kwargs)[source]

Defines the computation performed at every call.

Parameters:
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.

Returns:

The classification scores for input samples.

Return type:

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TSMHead(num_classes: int, in_channels: int, num_segments: int = 8, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', consensus: ConfigDict | dict = {'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio: float = 0.8, init_std: float = 0.001, is_shift: bool = True, temporal_pool: bool = False, **kwargs)[source]

Class head for TSM.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict or ConfigDict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • is_shift (bool) – Indicating whether the feature is shifted. Default: True.

  • temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tensor, num_segs: int, **kwargs) Tensor[source]

Defines the computation performed at every call.

Parameters:
  • x (Tensor) – The input data.

  • num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.

Returns:

The classification scores for input samples.

Return type:

Tensor

init_weights() None[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TSNAudioHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.4, init_std: float = 0.01, **kwargs)[source]

Classification head for TSN on audio.

Parameters