Shortcuts

mmaction.apis

mmaction.apis.detection_inference(det_config: str | Path | Config | Module, det_checkpoint: str, frame_paths: List[str], det_score_thr: float = 0.9, det_cat_id: int = 0, device: str | device = 'cuda:0', with_score: bool = False) tuple[源代码]

Detect human boxes given frame paths.

参数:

(Union[str (det_config) – torch.nn.Module]): Det config file path or Detection model object. It can be a Path, a config object, or a module object.

:param Path: torch.nn.Module]):

Det config file path or Detection model object. It can be a Path, a config object, or a module object.

:param mmengine.Config: torch.nn.Module]):

Det config file path or Detection model object. It can be a Path, a config object, or a module object.

:paramtorch.nn.Module]):

Det config file path or Detection model object. It can be a Path, a config object, or a module object.

参数:
  • det_checkpoint – Checkpoint path/url.

  • frame_paths (List[str]) – The paths of frames to do detection inference.

  • det_score_thr (float) – The threshold of human detection score. Defaults to 0.9.

  • det_cat_id (int) – The category id for human detection. Defaults to 0.

  • device (Union[str, torch.device]) – The desired device of returned tensor. Defaults to 'cuda:0'.

  • with_score (bool) – Whether to append detection score after box. Defaults to None.

返回:

List of detected human boxes. List[DetDataSample]: List of data samples, generally used

to visualize data.

返回类型:

List[np.ndarray]

mmaction.apis.inference_recognizer(model: Module, video: str | dict, test_pipeline: Compose | None = None) ActionDataSample[源代码]

Inference a video with the recognizer.

参数:
  • model (nn.Module) – The loaded recognizer.

  • video (Union[str, dict]) – The video file path or the results dictionary (the input of pipeline).

  • test_pipeline (Compose, optional) – The test pipeline. If not specified, the test pipeline in the config will be used. Defaults to None.

返回:

The inference results. Specifically, the predicted scores are saved at result.pred_score.

返回类型:

ActionDataSample

mmaction.apis.inference_skeleton(model: Module, pose_results: List[dict], img_shape: Tuple[int], test_pipeline: Compose | None = None) ActionDataSample[源代码]

Inference a pose results with the skeleton recognizer.

参数:
  • model (nn.Module) – The loaded recognizer.

  • pose_results (List[dict]) – The pose estimation results dictionary (the results of pose_inference)

  • img_shape (Tuple[int]) – The original image shape used for inference skeleton recognizer.

  • test_pipeline (Compose, optional) – The test pipeline. If not specified, the test pipeline in the config will be used. Defaults to None.

返回:

The inference results. Specifically, the predicted scores are saved at result.pred_score.

返回类型:

ActionDataSample

mmaction.apis.init_recognizer(config: str | Path | Config, checkpoint: str | None = None, device: str | device = 'cuda:0') Module[源代码]

Initialize a recognizer from config file.

参数:
  • config (str or Path or mmengine.Config) – Config file path, Path or the config object.

  • checkpoint (str, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Defaults to None.

  • device (str | torch.device) – The desired device of returned tensor. Defaults to 'cuda:0'.

返回:

The constructed recognizer.

返回类型:

nn.Module

mmaction.apis.pose_inference(pose_config: str | Path | Config | Module, pose_checkpoint: str, frame_paths: List[str], det_results: List[ndarray], device: str | device = 'cuda:0') tuple[源代码]

Perform Top-Down pose estimation.

参数:

(Union[str (pose_config) – torch.nn.Module]): Pose config file path or pose model object. It can be a Path, a config object, or a module object.

:param Path: torch.nn.Module]): Pose config file path or

pose model object. It can be a Path, a config object, or a module object.

:param mmengine.Config: torch.nn.Module]): Pose config file path or

pose model object. It can be a Path, a config object, or a module object.

:paramtorch.nn.Module]): Pose config file path or

pose model object. It can be a Path, a config object, or a module object.

参数:
  • pose_checkpoint – Checkpoint path/url.

  • frame_paths (List[str]) – The paths of frames to do pose inference.

  • det_results (List[np.ndarray]) – List of detected human boxes.

  • device (Union[str, torch.device]) – The desired device of returned tensor. Defaults to 'cuda:0'.

返回:

List of pose estimation results. List[PoseDataSample]: List of data samples, generally used

to visualize data.

返回类型:

List[List[Dict[str, np.ndarray]]]

mmaction.datasets

datasets

class mmaction.datasets.AVADataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], exclude_file: str | None = None, label_file: str | None = None, filename_tmpl: str = 'img_{:05}.jpg', start_index: int = 1, proposal_file: str | None = None, person_det_score_thr: float = 0.9, num_classes: int = 81, custom_classes: List[int] | None = None, data_prefix: ConfigDict | dict = {'img': ''}, modality: str = 'RGB', test_mode: bool = False, num_max_proposals: int = 1000, timestamp_start: int = 900, timestamp_end: int = 1800, use_frames: bool = True, fps: int = 30, multilabel: bool = True, **kwargs)[源代码]

STAD dataset for spatial temporal action detection.

The dataset loads raw frames/video files, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.

This datasets can load information from the following files:

ann_file -> ava_{train, val}_{v2.1, v2.2}.csv
exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
label_file -> ava_action_list_{v2.1, v2.2}.pbtxt /
              ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl

Particularly, the proposal_file is a pickle file which contains img_key (in format of {video_id},{timestamp}). Example of a pickle file:

{
    ...
    '0f39OWEqJ24,0902':
        array([[0.011   , 0.157   , 0.655   , 0.983   , 0.998163]]),
    '0f39OWEqJ24,0912':
        array([[0.054   , 0.088   , 0.91    , 0.998   , 0.068273],
               [0.016   , 0.161   , 0.519   , 0.974   , 0.984025],
               [0.493   , 0.283   , 0.981   , 0.984   , 0.983621]]),
    ...
}
参数:
  • ann_file (str) – Path to the annotation file like ava_{train, val}_{v2.1, v2.2}.csv.

  • exclude_file (str) – Path to the excluded timestamp file like ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • label_file (str) – Path to the label file like ava_action_list_{v2.1, v2.2}.pbtxt or ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Defaults to None.

  • filename_tmpl (str) – Template for each filename. Defaults to ‘img_{:05}.jpg’.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. It should be set to 1 for AVA, since frame index start from 1 in AVA dataset. Defaults to 1.

  • proposal_file (str) – Path to the proposal file like ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Defaults to None.

  • person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used. Default: 0.9.

  • num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)

  • custom_classes (List[int], optional) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and num_classes should be equal to len(custom_classes) + 1.

  • data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to dict(img='').

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • num_max_proposals (int) – Max proposals number to store. Defaults to 1000.

  • timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Defaults to 902.

  • timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Defaults to 1798.

  • use_frames (bool) – Whether to use rawframes as input. Defaults to True.

  • fps (int) – Overrides the default FPS for the dataset. If set to 1, means counting timestamp by frame, e.g. MultiSports dataset. Otherwise by second. Defaults to 30.

  • multilabel (bool) – Determines whether it is a multilabel recognition task. Defaults to True.

filter_data() List[dict][源代码]

Filter out records in the exclude_file.

get_data_info(idx: int) dict[源代码]

Get annotation by index.

load_data_list() List[dict][源代码]

Load AVA annotations.

parse_img_record(img_records: List[dict]) tuple[源代码]

Merge image records of the same entity at the same time.

参数:

img_records (List[dict]) – List of img_records (lines in AVA annotations).

返回:

A tuple consists of lists of bboxes, action labels and

entity_ids.

返回类型:

Tuple(list)

class mmaction.datasets.AVAKineticsDataset(ann_file: str, exclude_file: str, pipeline: List[ConfigDict | dict | Callable], label_file: str, filename_tmpl: str = 'img_{:05}.jpg', start_index: int = 0, proposal_file: str | None = None, person_det_score_thr: float = 0.9, num_classes: int = 81, custom_classes: List[int] | None = None, data_prefix: ConfigDict | dict = {'img': ''}, modality: str = 'RGB', test_mode: bool = False, num_max_proposals: int = 1000, timestamp_start: int = 900, timestamp_end: int = 1800, fps: int = 30, **kwargs)[源代码]

AVA-Kinetics dataset for spatial temporal detection.

Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.

This datasets can load information from the following files:

ann_file -> ava_{train, val}_{v2.1, v2.2}.csv
exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
label_file -> ava_action_list_{v2.1, v2.2}.pbtxt /
              ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl

Particularly, the proposal_file is a pickle file which contains img_key (in format of {video_id},{timestamp}). Example of a pickle file:

{
    ...
    '0f39OWEqJ24,0902':
        array([[0.011   , 0.157   , 0.655   , 0.983   , 0.998163]]),
    '0f39OWEqJ24,0912':
        array([[0.054   , 0.088   , 0.91    , 0.998   , 0.068273],
               [0.016   , 0.161   , 0.519   , 0.974   , 0.984025],
               [0.493   , 0.283   , 0.981   , 0.984   , 0.983621]]),
    ...
}
参数:
  • ann_file (str) – Path to the annotation file like ava_{train, val}_{v2.1, v2.2}.csv.

  • exclude_file (str) – Path to the excluded timestamp file like ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • label_file (str) – Path to the label file like ava_action_list_{v2.1, v2.2}.pbtxt or ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Defaults to None.

  • filename_tmpl (str) – Template for each filename. Defaults to ‘img_{:05}.jpg’.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking frames as input, it should be set to 0, since frames from 0. Defaults to 0.

  • proposal_file (str) – Path to the proposal file like ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Defaults to None.

  • person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used. Default: 0.9.

  • num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)

  • custom_classes (List[int], optional) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and num_classes should be equal to len(custom_classes) + 1.

  • data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to dict(img='').

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • num_max_proposals (int) – Max proposals number to store. Defaults to 1000.

  • timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Defaults to 902.

  • timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Defaults to 1798.

  • fps (int) – Overrides the default FPS for the dataset. Defaults to 30.

filter_data() List[dict][源代码]

Filter out records in the exclude_file.

get_data_info(idx: int) dict[源代码]

Get annotation by index.

load_data_list() List[dict][源代码]

Load AVA annotations.

parse_img_record(img_records: List[dict]) tuple[源代码]

Merge image records of the same entity at the same time.

参数:

img_records (List[dict]) – List of img_records (lines in AVA annotations).

返回:

A tuple consists of lists of bboxes, action labels and

entity_ids.

返回类型:

Tuple(list)

class mmaction.datasets.ActivityNetDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict | None = {'video': ''}, test_mode: bool = False, **kwargs)[源代码]

ActivityNet dataset for temporal action localization. The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information. The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:

参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to dict(video='').

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

load_data_list() List[dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.AudioDataset(ann_file: str, pipeline: List[Dict | Callable], data_prefix: Dict = {'audio': ''}, multi_class: bool = False, num_classes: int | None = None, **kwargs)[源代码]

Audio dataset for action recognition.

The ann_file is a text file with multiple lines, and each line indicates a sample audio or extracted audio feature with the filepath, total frames of the raw video and label, which are split with a whitespace. Example of a annotation file:

参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (dict) – Path to a directory where audios are held. Defaults to dict(audio='').

  • multi_class (bool) – Determines whether it is a multi-class recognition dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes in the dataset. Defaults to None.

load_data_list() List[Dict][源代码]

Load annotation file to get audio information.

class mmaction.datasets.BaseActionDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]

Base class for datasets.

参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict, optional) – Path to a directory where videos are held. Defaults to None.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.

  • modality (str) – Modality of data. Support RGB, Flow, Pose, Audio. Defaults to RGB.

get_data_info(idx: int) dict[源代码]

Get annotation by index.

class mmaction.datasets.CharadesSTADataset(ann_file: str, pipeline: List[dict | Callable], word2id_file: str, fps_file: str, duration_file: str, num_frames_file: str, window_size: int, ft_overlap: float, data_prefix: ConfigDict | dict | None = {'video': ''}, test_mode: bool = False, **kwargs)[源代码]
get_data_info(idx: int) dict[源代码]

Get annotation by index.

load_data_list() List[dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.MSRVTTRetrieval(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]

MSR-VTT Retrieval dataset.

load_data_list() List[Dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.MSRVTTVQA(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]

MSR-VTT Video Question Answering dataset.

load_data_list() List[Dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.MSRVTTVQAMC(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]

MSR-VTT VQA multiple choices dataset.

load_data_list() List[Dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.PoseDataset(ann_file: str, pipeline: List[Dict | Callable], split: str | None = None, valid_ratio: float | None = None, box_thr: float = 0.5, **kwargs)[源代码]

Pose dataset for action recognition.

The dataset loads pose and apply specified transforms to return a dict containing pose information.

The ann_file is a pickle file, the json file contains a list of annotations, the fields of an annotation include frame_dir(video_id), total_frames, label, kp, kpscore.

参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • split (str, optional) – The dataset split used. For UCF101 and HMDB51, allowed choices are ‘train1’, ‘test1’, ‘train2’, ‘test2’, ‘train3’, ‘test3’. For NTURGB+D, allowed choices are ‘xsub_train’, ‘xsub_val’, ‘xview_train’, ‘xview_val’. For NTURGB+D 120, allowed choices are ‘xsub_train’, ‘xsub_val’, ‘xset_train’, ‘xset_val’. For FineGYM, allowed choices are ‘train’, ‘val’. Defaults to None.

  • valid_ratio (float, optional) – The valid_ratio for videos in KineticsPose. For a video with n frames, it is a valid training sample only if n * valid_ratio frames have human pose. None means not applicable (only applicable to Kinetics Pose).Defaults to None.

  • box_thr (float) – The threshold for human proposals. Only boxes with confidence score larger than box_thr is kept. None means not applicable (only applicable to Kinetics). Allowed choices are 0.5, 0.6, 0.7, 0.8, 0.9. Defaults to 0.5.

filter_data() List[Dict][源代码]

Filter out invalid samples.

get_data_info(idx: int) Dict[源代码]

Get annotation by index.

load_data_list() List[Dict][源代码]

Load annotation file to get skeleton information.

class mmaction.datasets.RawframeDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict = {'img': ''}, filename_tmpl: str = 'img_{:05}.jpg', with_offset: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 1, modality: str = 'RGB', test_mode: bool = False, **kwargs)[源代码]

Rawframe dataset for action recognition.

The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:

some/directory-1 163 1
some/directory-2 122 1
some/directory-3 258 2
some/directory-4 234 2
some/directory-5 295 3
some/directory-6 121 3

Example of a multi-class annotation file:

some/directory-1 163 1 3 5
some/directory-2 122 1 2
some/directory-3 258 2
some/directory-4 234 2 4 6 8
some/directory-5 295 3
some/directory-6 121 3

Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.

some/directory-1 12 163 3
some/directory-2 213 122 4
some/directory-3 100 258 5
some/directory-4 98 234 2
some/directory-5 0 295 3
some/directory-6 50 121 3
参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where video frames are held. Defaults to dict(img='').

  • filename_tmpl (str) – Template for each filename. Defaults to img_{:05}.jpg.

  • with_offset (bool) – Determines whether the offset information is in ann_file. Defaults to False.

  • multi_class (bool) – Determines whether it is a multi-class recognition dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes in the dataset. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking frames as input, it should be set to 1, since raw frames count from 1. Defaults to 1.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

get_data_info(idx: int) dict[源代码]

Get annotation by index.

load_data_list() List[dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.RepeatAugDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict = {'video': ''}, num_repeats: int = 4, sample_once: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]

Video dataset for action recognition use repeat augment. https://arxiv.org/pdf/1901.09335.pdf.

The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:

some/path/000.mp4 1
some/path/001.mp4 1
some/path/002.mp4 2
some/path/003.mp4 2
some/path/004.mp4 3
some/path/005.mp4 3
参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to dict(video='').

  • num_repeats (int) – Number of repeat time of one video in a batch. Defaults to 4.

  • sample_once (bool) – Determines whether use same frame index for repeat samples. Defaults to False.

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.

  • modality (str) – Modality of data. Support RGB, Flow. Defaults to RGB.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

prepare_data(idx) List[dict][源代码]

Get data processed by self.pipeline.

Reduce the video loading and decompressing. :param idx: The index of data_info. :type idx: int

返回:

A list of length num_repeats.

返回类型:

List[dict]

class mmaction.datasets.VideoDataset(ann_file: str, pipeline: List[dict | Callable], data_prefix: ConfigDict | dict = {'video': ''}, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', test_mode: bool = False, delimiter: str = ' ', **kwargs)[源代码]

Video dataset for action recognition.

The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:

some/path/000.mp4 1
some/path/001.mp4 1
some/path/002.mp4 2
some/path/003.mp4 2
some/path/004.mp4 3
some/path/005.mp4 3
参数:
  • ann_file (str) – Path to the annotation file.

  • pipeline (List[Union[dict, ConfigDict, Callable]]) – A sequence of data transforms.

  • data_prefix (dict or ConfigDict) – Path to a directory where videos are held. Defaults to dict(video='').

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Defaults to False.

  • num_classes (int, optional) – Number of classes of the dataset, used in multi-class datasets. Defaults to None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Defaults to 0.

  • modality (str) – Modality of data. Support 'RGB', 'Flow'. Defaults to 'RGB'.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • delimiter (str) – Delimiter for the annotation file. Defaults to ' ' (whitespace).

load_data_list() List[dict][源代码]

Load annotation file to get video information.

class mmaction.datasets.VideoTextDataset(ann_file: str, pipeline: List[ConfigDict | dict | Callable], data_prefix: ConfigDict | dict | None = {'prefix': ''}, test_mode: bool = False, multi_class: bool = False, num_classes: int | None = None, start_index: int = 0, modality: str = 'RGB', **kwargs)[源代码]

Video dataset for video-text task like video retrieval.

load_data_list() List[Dict][源代码]

Load annotation file to get video information.

transforms

class mmaction.datasets.transforms.ArrayDecode[源代码]

Load and decode frames with given indices from a 4D array.

Required keys are “array and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

transform(results)[源代码]

Perform the RawFrameDecode to pick frames given indices.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.AudioFeatureSelector(fixed_length: int = 128)[源代码]

Sample the audio feature w.r.t. the frames selected.

Required Keys:

  • audios

  • frame_inds

  • num_clips

  • length

  • total_frames

Modified Keys:

  • audios

Added Keys:

  • audios_shape

参数:

fixed_length (int) – As the features selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Defaults to 128.

transform(results: Dict) Dict[源代码]

Perform the AudioFeatureSelector to pick audio feature clips.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.BuildPseudoClip(clip_len)[源代码]

Build pseudo clips with one single image by repeating it n times.

Required key is “imgs”, added or modified key is “imgs”, “num_clips”,

“clip_len”.

参数:

clip_len (int) – Frames of the generated pseudo clips.

transform(results)[源代码]

Perform the building of pseudo clips.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.CLIPTokenize[源代码]

Tokenize text and convert to tensor.

transform(results: Dict) Dict[源代码]

The transform function of CLIPTokenize.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.CenterCrop(crop_size, lazy=False)[源代码]

Crop the center area from images.

Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.

参数:
  • crop_size (int | tuple[int]) – (w, h) of crop size.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[源代码]

Performs the CenterCrop augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1)[源代码]

Perform ColorJitter to each img.

Required keys are “imgs”, added or modified keys are “imgs”.

参数:
  • brightness (float | tuple[float]) – The jitter range for brightness, if set as a float, the range will be (1 - brightness, 1 + brightness). Default: 0.5.

  • contrast (float | tuple[float]) – The jitter range for contrast, if set as a float, the range will be (1 - contrast, 1 + contrast). Default: 0.5.

  • saturation (float | tuple[float]) – The jitter range for saturation, if set as a float, the range will be (1 - saturation, 1 + saturation). Default: 0.5.

  • hue (float | tuple[float]) – The jitter range for hue, if set as a float, the range will be (-hue, hue). Default: 0.1.

transform(results)[源代码]

Perform ColorJitter.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.DecompressPose(squeeze: bool = True, max_person: int = 10)[源代码]

Load Compressed Pose.

Required Keys:

  • frame_inds

  • total_frames

  • keypoint

  • anno_inds (optional)

Modified Keys:

  • keypoint

  • frame_inds

Added Keys:

  • keypoint_score

  • num_person

参数:
  • squeeze (bool) – Whether to remove frames with no human pose. Defaults to True.

  • max_person (int) – The max number of persons in a frame. Defaults to 10.

transform(results: Dict) Dict[源代码]

Perform the pose decoding.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.DecordDecode(mode: str = 'accurate')[源代码]

Using decord to decode the video.

Decord: https://github.com/dmlc/decord

Required Keys:

  • video_reader

  • frame_inds

Added Keys:

  • imgs

  • original_shape

  • img_shape

参数:

mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Defaults to 'accurate'.

transform(results: Dict) Dict[源代码]

Perform the Decord decoding.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.DecordInit(io_backend: str = 'disk', num_threads: int = 1, **kwargs)[源代码]

Using decord to initialize the video_reader.

Decord: https://github.com/dmlc/decord

Required Keys:

  • filename

Added Keys:

  • video_reader

  • total_frames

  • fps

参数:
  • io_backend (str) – io backend where frames are store. Defaults to 'disk'.

  • num_threads (int) – Number of thread to decode the video. Defaults to 1.

  • kwargs (dict) – Args for file client.

transform(results: Dict) Dict[源代码]

Perform the Decord initialization.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.DenseSampleFrames(*args, sample_range: int = 64, num_sample_positions: int = 10, **kwargs)[源代码]

Select frames from the video by dense sample strategy.

Required keys:

  • total_frames

  • start_index

Added keys:

  • frame_inds

  • clip_len

  • frame_interval

  • num_clips

参数:
  • clip_len (int) – Frames of each sampled output clip.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.

  • num_clips (int) – Number of clips to be sampled. Defaults to 1.

  • sample_range (int) – Total sample range for dense sample. Defaults to 64.

  • num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Defaults to 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.

  • temporal_jitter (bool) – Whether to apply temporal jittering. Defaults to False.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

transform(results: dict) dict[源代码]

Perform the SampleFrames loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, left_kp=None, right_kp=None, lazy=False)[源代码]

Flip the input images with a probability.

Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered.

Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.

参数:
  • flip_ratio (float) – Probability of implementing flip. Default: 0.5.

  • direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.

  • flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.

  • left_kp (list[int]) – Indexes of left keypoints, used to flip keypoints. Default: None.

  • right_kp (list[ind]) – Indexes of right keypoints, used to flip keypoints. Default: None.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[源代码]

Performs the Flip augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.FormatAudioShape(input_format: str)[源代码]

Format final audio shape to the given input_format.

Required Keys:

  • audios

Modified Keys:

  • audios

Added Keys:

  • input_shape

参数:

input_format (str) – Define the final imgs format.

transform(results: Dict) Dict[源代码]

Performs the FormatShape formatting.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.FormatGCNInput(num_person: int = 2, mode: str = 'zero')[源代码]

Format final skeleton shape.

Required Keys:

  • keypoint

  • keypoint_score (optional)

  • num_clips (optional)

Modified Key:

  • keypoint

参数:
  • num_person (int) – The maximum number of people. Defaults to 2.

  • mode (str) – The padding mode. Defaults to 'zero'.

transform(results: Dict) Dict[源代码]

The transform function of FormatGCNInput.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.FormatShape(input_format: str, collapse: bool = False)[源代码]

Format final imgs shape to the given input_format.

Required keys:

  • imgs (optional)

  • heatmap_imgs (optional)

  • modality (optional)

  • num_clips

  • clip_len

Modified Keys:

  • imgs

Added Keys:

  • input_shape

  • heatmap_input_shape (optional)

参数:
  • input_format (str) – Define the final data format.

  • collapse (bool) – To collapse input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Defaults to False.

transform(results: Dict) Dict[源代码]

Performs the FormatShape formatting.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Fuse[源代码]

Fuse lazy operations.

Fusion order:

crop -> resize -> flip

Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.

transform(results)[源代码]

Fuse lazy operations.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.GenSkeFeat(dataset: str = 'nturgb+d', feats: List[str] = ['j'], axis: int = -1)[源代码]

Unified interface for generating multi-stream skeleton features.

Required Keys:

  • keypoint

  • keypoint_score (optional)

参数:
  • dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to 'nturgb+d'.

  • feats (list[str]) – The list of the keys of features. Defaults to ['j'].

  • axis (int) – The axis along which the features will be joined. Defaults to -1.

transform(results: Dict) Dict[源代码]

The transform function of GenSkeFeat.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.GenerateLocalizationLabels[源代码]

Load video label for localizer with given video_name list.

Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.

transform(results)[源代码]

Perform the GenerateLocalizationLabels loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.GeneratePoseTarget(sigma: float = 0.6, use_score: bool = True, with_kp: bool = True, with_limb: bool = False, skeletons: Tuple[Tuple[int]] = ((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)), double: bool = False, left_kp: Tuple[int] = (1, 3, 5, 7, 9, 11, 13, 15), right_kp: Tuple[int] = (2, 4, 6, 8, 10, 12, 14, 16), left_limb: Tuple[int] = (0, 2, 4, 5, 6, 10, 11, 12), right_limb: Tuple[int] = (1, 3, 7, 8, 9, 13, 14, 15), scaling: float = 1.0)[源代码]

Generate pseudo heatmaps based on joint coordinates and confidence.

Required Keys:

  • keypoint

  • keypoint_score (optional)

  • img_shape

Added Keys:

  • imgs (optional)

  • heatmap_imgs (optional)

参数:
  • sigma (float) – The sigma of the generated gaussian map. Defaults to 0.6.

  • use_score (bool) – Use the confidence score of keypoints as the maximum of the gaussian maps. Defaults to True.

  • with_kp (bool) – Generate pseudo heatmaps for keypoints. Defaults to True.

  • with_limb (bool) – Generate pseudo heatmaps for limbs. At least one of ‘with_kp’ and ‘with_limb’ should be True. Defaults to False.

  • skeletons (tuple[tuple]) –

    The definition of human skeletons. Defaults to ``((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7),

    (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12))``,

    which is the definition of COCO-17p skeletons.

  • double (bool) – Output both original heatmaps and flipped heatmaps. Defaults to False.

  • left_kp (tuple[int]) – Indexes of left keypoints, which is used when flipping heatmaps. Defaults to (1, 3, 5, 7, 9, 11, 13, 15), which is left keypoints in COCO-17p.

  • right_kp (tuple[int]) – Indexes of right keypoints, which is used when flipping heatmaps. Defaults to (2, 4, 6, 8, 10, 12, 14, 16), which is right keypoints in COCO-17p.

  • left_limb (tuple[int]) – Indexes of left limbs, which is used when flipping heatmaps. Defaults to (0, 2, 4, 5, 6, 10, 11, 12), which is left limbs of skeletons we defined for COCO-17p.

  • right_limb (tuple[int]) – Indexes of right limbs, which is used when flipping heatmaps. Defaults to (1, 3, 7, 8, 9, 13, 14, 15), which is right limbs of skeletons we defined for COCO-17p.

  • scaling (float) – The ratio to scale the heatmaps. Defaults to 1.

gen_an_aug(results: Dict) ndarray[源代码]

Generate pseudo heatmaps for all frames.

参数:

results (dict) – The dictionary that contains all info of a sample.

返回:

The generated pseudo heatmaps.

返回类型:

np.ndarray

generate_a_heatmap(arr: ndarray, centers: ndarray, max_values: ndarray) None[源代码]

Generate pseudo heatmap for one keypoint in one frame.

参数:
  • arr (np.ndarray) – The array to store the generated heatmaps. Shape: img_h * img_w.

  • centers (np.ndarray) – The coordinates of corresponding keypoints (of multiple persons). Shape: M * 2.

  • max_values (np.ndarray) – The max values of each keypoint. Shape: M.

generate_a_limb_heatmap(arr: ndarray, starts: ndarray, ends: ndarray, start_values: ndarray, end_values: ndarray) None[源代码]

Generate pseudo heatmap for one limb in one frame.

参数:
  • arr (np.ndarray) – The array to store the generated heatmaps. Shape: img_h * img_w.

  • starts (np.ndarray) – The coordinates of one keypoint in the corresponding limbs. Shape: M * 2.

  • ends (np.ndarray) – The coordinates of the other keypoint in the corresponding limbs. Shape: M * 2.

  • start_values (np.ndarray) – The max values of one keypoint in the corresponding limbs. Shape: M.

  • end_values (np.ndarray) – The max values of the other keypoint in the corresponding limbs. Shape: M.

generate_heatmap(arr: ndarray, kps: ndarray, max_values: ndarray) None[源代码]

Generate pseudo heatmap for all keypoints and limbs in one frame (if needed).

参数:
  • arr (np.ndarray) – The array to store the generated heatmaps. Shape: V * img_h * img_w.

  • kps (np.ndarray) – The coordinates of keypoints in this frame. Shape: M * V * 2.

  • max_values (np.ndarray) – The confidence score of each keypoint. Shape: M * V.

transform(results: Dict) Dict[源代码]

Generate pseudo heatmaps based on joint coordinates and confidence.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[源代码]

Load and decode images.

Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

参数:
  • io_backend (str) – IO backend where frames are stored. Default: ‘disk’.

  • decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.

  • kwargs (dict, optional) – Arguments for FileClient.

transform(results)[源代码]

Perform the ImageDecode to load image given the file path.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ImgAug(transforms)[源代码]

Imgaug augmentation.

Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.

It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.

Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.

It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.

Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways

to create transforms. 1) string: only support default for now.

e.g. transforms=’default’

  1. list[dict]: create a list of augmenters by a list of dicts, each

    dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]

  2. iaa.Augmenter: create an imgaug.Augmenter object.

    e.g. transforms=iaa.Rotate(rotate=(-20, 20))

  1. Add Imgaug in dataset pipeline. It is recommended to insert imgaug

    pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [

    dict(

    type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,

    ), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(

    type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),

    dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])

参数:

transforms (str | list[dict] | iaa.Augmenter) – Three different ways to create imgaug augmenter.

static default_transforms()[源代码]

Default transforms for imgaug.

Implement RandAugment by imgaug. Please visit https://arxiv.org/abs/1909.13719 for more information.

Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa

Miss one augmenter SolarizeAdd since imgaug doesn’t support this.

返回:

The constructed RandAugment transforms.

返回类型:

dict

imgaug_builder(cfg)[源代码]

Import a module from imgaug.

It follows the logic of build_from_cfg(). Use a dict object to create an iaa.Augmenter object.

参数:

cfg (dict) – Config dict. It should at least contain the key “type”.

返回:

iaa.Augmenter: The constructed imgaug augmenter.

返回类型:

obj

transform(results)[源代码]

Perform Imgaug augmentations.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.JointToBone(dataset: str = 'nturgb+d', target: str = 'keypoint')[源代码]

Convert the joint information to bone information.

Required Keys:

  • keypoint

Modified Keys:

  • keypoint

参数:
  • dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to 'nturgb+d'.

  • target (str) – The target key for the bone information. Defaults to 'keypoint'.

transform(results: Dict) Dict[源代码]

The transform function of JointToBone.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.LoadAudioFeature(pad_method: str = 'zero')[源代码]

Load offline extracted audio features.

Required Keys:

  • audio_path

Added Keys:

  • length

  • audios

参数:

pad_method (str) – Padding method. Defaults to 'zero'.

transform(results: Dict) Dict[源代码]

Perform the numpy loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadHVULabel(**kwargs)[源代码]

Convert the HVU label from dictionaries to torch tensors.

Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.

init_hvu_info(categories, category_nums)[源代码]

Initialize hvu information.

transform(results)[源代码]

Convert the label dictionary to 3 tensors: “label”, “mask” and “category_mask”.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadLocalizationFeature[源代码]

Load Video features for localizer with given video_name list.

The required key is “feature_path”, added or modified keys are “raw_feature”.

参数:

raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.

transform(results)[源代码]

Perform the LoadLocalizationFeature loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadProposals(top_k, pgm_proposals_dir, pgm_features_dir, proposal_ext='.csv', feature_ext='.npy')[源代码]

Loading proposals with given proposal results.

Required keys are “video_name”, added or modified keys are ‘bsp_feature’, ‘tmin’, ‘tmax’, ‘tmin_score’, ‘tmax_score’ and ‘reference_temporal_iou’.

参数:
  • top_k (int) – The top k proposals to be loaded.

  • pgm_proposals_dir (str) – Directory to load proposals.

  • pgm_features_dir (str) – Directory to load proposal features.

  • proposal_ext (str) – Proposal file extension. Default: ‘.csv’.

  • feature_ext (str) – Feature file extension. Default: ‘.npy’.

transform(results)[源代码]

Perform the LoadProposals loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.LoadRGBFromFile(to_float32: bool = False, color_type: str = 'color', imdecode_backend: str = 'cv2', io_backend: str = 'disk', ignore_empty: bool = False, **kwargs)[源代码]

Load a RGB image from file.

Required Keys:

  • img_path

Modified Keys:

  • img

  • img_shape

  • ori_shape

参数:
  • to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.

  • color_type (str) – The flag argument for :func:mmcv.imfrombytes. Defaults to ‘color’.

  • imdecode_backend (str) – The image decoding backend type. The backend argument for :func:mmcv.imfrombytes. See :func:mmcv.imfrombytes for details. Defaults to ‘cv2’.

  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.

  • kwargs (dict) – Args for file client.

transform(results: dict) dict[源代码]

Functions to load image.

参数:

results (dict) – Result dict from :obj:mmcv.BaseDataset.

返回:

The dict contains loaded image and meta information.

返回类型:

dict

class mmaction.datasets.transforms.MMCompact(padding: float = 0.25, threshold: int = 10, hw_ratio: float | Tuple[float] = 1, allow_imgpad: bool = True)[源代码]

Convert the coordinates of keypoints and crop the images to make them more compact.

Required Keys:

  • imgs

  • keypoint

  • img_shape

Modified Keys:

  • imgs

  • keypoint

  • img_shape

参数:
  • padding (float) – The padding size. Defaults to 0.25.

  • threshold (int) – The threshold for the tight bounding box. If the width or height of the tight bounding box is smaller than the threshold, we do not perform the compact operation. Defaults to 10.

  • hw_ratio (float | tuple[float]) – The hw_ratio of the expanded box. Float indicates the specific ratio and tuple indicates a ratio range. If set as None, it means there is no requirement on hw_ratio. Defaults to 1.

  • allow_imgpad (bool) – Whether to allow expanding the box outside the image to meet the hw_ratio requirement. Defaults to True.

transform(results: Dict) Dict[源代码]

The transform function of MMCompact.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.MMDecode(io_backend: str = 'disk', **kwargs)[源代码]

Decode RGB videos and skeletons.

transform(results: Dict) Dict[源代码]

The transform function of MMDecode.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.MMUniformSampleFrames(clip_len: int, num_clips: int = 1, test_mode: bool = False, seed: int = 255)[源代码]

Uniformly sample frames from the multi-modal data.

transform(results: Dict) Dict[源代码]

The transform function of MMUniformSampleFrames.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.MergeSkeFeat(feat_list: List[str] = ['keypoint'], target: str = 'keypoint', axis: int = -1)[源代码]

Merge multi-stream features.

参数:
  • feat_list (list[str]) – The list of the keys of features. Defaults to ['keypoint'].

  • target (str) – The target key for the merged multi-stream information. Defaults to 'keypoint'.

  • axis (int) – The axis along which the features will be joined. Defaults to -1.

transform(results: Dict) Dict[源代码]

The transform function of MergeSkeFeat.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.MultiScaleCrop(input_size, scales=(1,), max_wh_scale_gap=1, random_crop=False, num_fixed_crops=5, lazy=False)[源代码]

Crop images with a list of randomly selected scales.

Randomly select the w and h scales from a list of scales. Scale of 1 means the base size, which is the minimal of image width and height. The scale level of w and h is controlled to be smaller than a certain value to prevent too large or small aspect ratio.

Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “crop_bbox”, “img_shape”, “lazy” and “scales”. Required keys in “lazy” are “crop_bbox”, added or modified key is “crop_bbox”.

参数:
  • input_size (int | tuple[int]) – (w, h) of network input.

  • scales (tuple[float]) – width and height scales to be selected.

  • max_wh_scale_gap (int) – Maximum gap of w and h scale levels. Default: 1.

  • random_crop (bool) – If set to True, the cropping bbox will be randomly sampled, otherwise it will be sampler from fixed regions. Default: False.

  • num_fixed_crops (int) – If set to 5, the cropping bbox will keep 5 basic fixed regions: “upper left”, “upper right”, “lower left”, “lower right”, “center”. If set to 13, the cropping bbox will append another 8 fix regions: “center left”, “center right”, “lower center”, “upper center”, “upper left quarter”, “upper right quarter”, “lower left quarter”, “lower right quarter”. Default: 5.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[源代码]

Performs the MultiScaleCrop augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.OpenCVDecode[源代码]

Using OpenCV to decode the video.

Required keys are 'video_reader', 'filename' and 'frame_inds', added or modified keys are 'imgs', 'img_shape' and 'original_shape'.

transform(results: dict) dict[源代码]

Perform the OpenCV decoding.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.OpenCVInit(io_backend: str = 'disk', **kwargs)[源代码]

Using OpenCV to initialize the video_reader.

Required keys are 'filename', added or modified keys are ` ‘new_path’`, 'video_reader' and 'total_frames'.

参数:

io_backend (str) – io backend where frames are store. Defaults to 'disk'.

transform(results: dict) dict[源代码]

Perform the OpenCV initialization.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PIMSDecode[源代码]

Using PIMS to decode the videos.

PIMS: https://github.com/soft-matter/pims

Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

transform(results)[源代码]

Perform the PIMS decoding.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PIMSInit(io_backend='disk', mode='accurate', **kwargs)[源代码]

Use PIMS to initialize the video.

PIMS: https://github.com/soft-matter/pims

参数:
  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will always use pims.PyAVReaderIndexed to decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking by using pims.PyAVReaderTimed. Both will return the accurate frames in most cases. Default: ‘accurate’.

  • kwargs (dict) – Args for file client.

transform(results)[源代码]

Perform the PIMS initialization.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PackActionInputs(collect_keys: Tuple[str] | None = None, meta_keys: Sequence[str] = ('img_shape', 'img_key', 'video_id', 'timestamp'), algorithm_keys: Sequence[str] = ())[源代码]

Pack the inputs data.

参数:
  • collect_keys (tuple[str], optional) – The keys to be collected to packed_results['inputs']. Defaults to ``

  • meta_keys (Sequence[str]) – The meta keys to saved in the metainfo of the data_sample. Defaults to ('img_shape', 'img_key', 'video_id', 'timestamp').

  • algorithm_keys (Sequence[str]) – The keys of custom elements to be used in the algorithm. Defaults to an empty tuple.

transform(results: Dict) Dict[源代码]

The transform function of PackActionInputs.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.PackLocalizationInputs(keys=(), meta_keys=('video_name',))[源代码]
transform(results)[源代码]

Method to pack the input data.

参数:

results (dict) – Result dict from the data pipeline.

返回:

  • ‘inputs’ (obj:torch.Tensor): The forward data of models.

  • ’data_samples’ (obj:DetDataSample): The annotation info of the

    sample.

返回类型:

dict

class mmaction.datasets.transforms.PadTo(length: int, mode: str = 'loop')[源代码]

Sample frames from the video.

To sample an n-frame clip from the video, PadTo samples the frames from zero index, and loop or zero pad the frames if the length of video frames is less than the value of length.

Required Keys:

  • keypoint

  • total_frames

  • start_index (optional)

Modified Keys:

  • keypoint

  • total_frames

参数:
  • length (int) – The maximum length of the sampled output clip.

  • mode (str) – The padding mode. Defaults to 'loop'.

transform(results: Dict) Dict[源代码]

The transform function of PadTo.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.PoseCompact(padding: float = 0.25, threshold: int = 10, hw_ratio: float | Tuple[float] | None = None, allow_imgpad: bool = True)[源代码]

Convert the coordinates of keypoints to make it more compact. Specifically, it first find a tight bounding box that surrounds all joints in each frame, then we expand the tight box by a given padding ratio. For example, if ‘padding == 0.25’, then the expanded box has unchanged center, and 1.25x width and height.

Required Keys:

  • keypoint

  • img_shape

Modified Keys:

  • img_shape

  • keypoint

Added Keys:

  • crop_quadruple

参数:
  • padding (float) – The padding size. Defaults to 0.25.

  • threshold (int) – The threshold for the tight bounding box. If the width or height of the tight bounding box is smaller than the threshold, we do not perform the compact operation. Defaults to 10.

  • hw_ratio (float | tuple[float] | None) – The hw_ratio of the expanded box. Float indicates the specific ratio and tuple indicates a ratio range. If set as None, it means there is no requirement on hw_ratio. Defaults to None.

  • allow_imgpad (bool) – Whether to allow expanding the box outside the image to meet the hw_ratio requirement. Defaults to True.

transform(results: Dict) Dict[源代码]

Convert the coordinates of keypoints to make it more compact.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PoseDecode[源代码]

Load and decode pose with given indices.

Required Keys:

  • keypoint

  • total_frames (optional)

  • frame_inds (optional)

  • offset (optional)

  • keypoint_score (optional)

Modified Keys:

  • keypoint

  • keypoint_score (optional)

transform(results: Dict) Dict[源代码]

The transform function of PoseDecode.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.PreNormalize2D(img_shape: Tuple[int, int] = (1080, 1920))[源代码]

Normalize the range of keypoint values.

Required Keys:

  • keypoint

  • img_shape (optional)

Modified Keys:

  • keypoint

参数:

img_shape (tuple[int, int]) – The resolution of the original video. Defaults to (1080, 1920).

transform(results: Dict) Dict[源代码]

The transform function of PreNormalize2D.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.PreNormalize3D(zaxis: List[int] = [0, 1], xaxis: List[int] = [8, 4], align_spine: bool = True, align_shoulder: bool = True, align_center: bool = True)[源代码]

PreNormalize for NTURGB+D 3D keypoints (x, y, z).

PreNormalize3D first subtracts the coordinates of each joint from the coordinates of the ‘spine’ (joint #1 in ntu) of the first person in the first frame. Subsequently, it performs a 3D rotation to fix the Z axis parallel to the 3D vector from the ‘hip’ (joint #0) and the ‘spine’ (joint #1) and the X axis toward the 3D vector from the ‘right shoulder’ (joint #8) and the ‘left shoulder’ (joint #4). Codes adapted from https://github.com/lshiwjx/2s-AGCN.

Required Keys:

  • keypoint

  • total_frames (optional)

Modified Keys:

  • keypoint

Added Keys:

  • body_center

参数:
  • zaxis (list[int]) – The target Z axis for the 3D rotation. Defaults to [0, 1].

  • xaxis (list[int]) – The target X axis for the 3D rotation. Defaults to [8, 4].

  • align_spine (bool) – Whether to perform a 3D rotation to align the spine. Defaults to True.

  • align_shoulder (bool) – Whether to perform a 3D rotation to align the shoulder. Defaults to True.

  • align_center (bool) – Whether to align the body center. Defaults to True.

angle_between(v1: ndarray, v2: ndarray) float[源代码]

Returns the angle in radians between vectors ‘v1’ and ‘v2’.

rotation_matrix(axis: ndarray, theta: float) ndarray[源代码]

Returns the rotation matrix associated with counterclockwise rotation about the given axis by theta radians.

transform(results: Dict) Dict[源代码]

The transform function of PreNormalize3D.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

unit_vector(vector: ndarray) ndarray[源代码]

Returns the unit vector of the vector.

class mmaction.datasets.transforms.PyAVDecode(multi_thread=False, mode='accurate')[源代码]

Using PyAV to decode the video.

PyAV: https://github.com/mikeboers/PyAV

Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

参数:
  • multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.

  • mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return the nearest key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Default: ‘accurate’.

static frame_generator(container, stream)[源代码]

Frame generator for PyAV.

transform(results)[源代码]

Perform the PyAV decoding.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PyAVDecodeMotionVector(multi_thread=False, mode='accurate')[源代码]

Using pyav to decode the motion vectors from video.

Reference: https://github.com/PyAV-Org/PyAV/

blob/main/tests/test_decode.py

Required keys are “video_reader” and “frame_inds”, added or modified keys are “motion_vectors”, “frame_inds”.

transform(results)[源代码]

Perform the PyAV motion vector decoding.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PyAVInit(io_backend='disk', **kwargs)[源代码]

Using pyav to initialize the video.

PyAV: https://github.com/mikeboers/PyAV

Required keys are “filename”, added or modified keys are “video_reader”, and “total_frames”.

参数:
  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • kwargs (dict) – Args for file client.

transform(results)[源代码]

Perform the PyAV initialization.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.PytorchVideoWrapper(op, **kwargs)[源代码]

PytorchVideoTrans Augmentations, under pytorchvideo.transforms.

参数:

op (str) – The name of the pytorchvideo transformation.

transform(results)[源代码]

Perform PytorchVideoTrans augmentations.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RandomCrop(size, lazy=False)[源代码]

Vanilla square random crop that specifics the output size.

Required keys in results are “img_shape”, “keypoint” (optional), “imgs” (optional), added or modified keys are “keypoint”, “imgs”, “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.

参数:
  • size (int) – The output size of the images.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[源代码]

Performs the RandomCrop augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RandomRescale(scale_range, interpolation='bilinear')[源代码]

Randomly resize images so that the short_edge is resized to a specific size in a given range. The scale ratio is unchanged after resizing.

Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “resize_size”, “short_edge”.

参数:
  • scale_range (tuple[int]) – The range of short edge length. A closed interval.

  • interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.

transform(results)[源代码]

Performs the Resize augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RandomResizedCrop(area_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333333333333333), lazy=False)[源代码]

Random crop that specifics the area and height-weight ratio range.

Required keys in results are “img_shape”, “crop_bbox”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox” and “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.

参数:
  • area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).

  • aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3).

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

static get_crop_bbox(img_shape, area_range, aspect_ratio_range, max_attempts=10)[源代码]

Get a crop bbox given the area range and aspect ratio range.

参数:
  • img_shape (Tuple[int]) – Image shape

  • area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).

  • aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3). max_attempts (int): The maximum of attempts. Default: 10.

  • max_attempts (int) – Max attempts times to generate random candidate bounding box. If it doesn’t qualified one, the center bounding box will be used.

返回:

(list[int]) A random crop bbox within the area range and aspect ratio range.

transform(results)[源代码]

Performs the RandomResizeCrop augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.RawFrameDecode(io_backend: str = 'disk', decoding_backend: str = 'cv2', **kwargs)[源代码]

Load and decode frames with given indices.

Required Keys:

  • frame_dir

  • filename_tmpl

  • frame_inds

  • modality

  • offset (optional)

Added Keys:

  • img

  • img_shape

  • original_shape

参数:
  • io_backend (str) – IO backend where frames are stored. Defaults to 'disk'.

  • decoding_backend (str) – Backend used for image decoding. Defaults to 'cv2'.

transform(results: dict) dict[源代码]

Perform the RawFrameDecode to pick frames given indices.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Resize(scale, keep_ratio=True, interpolation='bilinear', lazy=False)[源代码]

Resize images to a specific size.

Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.

参数:
  • scale (float | Tuple[int]) – If keep_ratio is True, it serves as scaling factor or maximum size: If it is a float number, the image will be rescaled by this factor, else if it is a tuple of 2 integers, the image will be rescaled as large as possible within the scale. Otherwise, it serves as (w, h) of output size.

  • keep_ratio (bool) – If set to True, Images will be resized without changing the aspect ratio. Otherwise, it will resize images to a given size. Default: True.

  • interpolation (str) – Algorithm used for interpolation, accepted values are “nearest”, “bilinear”, “bicubic”, “area”, “lanczos”. Default: “bilinear”.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

transform(results)[源代码]

Performs the Resize augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.SampleAVAFrames(clip_len, frame_interval=2, test_mode=False)[源代码]
transform(results)[源代码]

Perform the SampleFrames loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.SampleFrames(clip_len: int, frame_interval: int = 1, num_clips: int = 1, temporal_jitter: bool = False, twice_sample: bool = False, out_of_bound_opt: str = 'loop', test_mode: bool = False, keep_tail_frames: bool = False, target_fps: int | None = None, **kwargs)[源代码]

Sample frames from the video.

Required Keys:

  • total_frames

  • start_index

Added Keys:

  • frame_inds

  • frame_interval

  • num_clips

参数:
  • clip_len (int) – Frames of each sampled output clip.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.

  • num_clips (int) – Number of clips to be sampled. Default: 1.

  • temporal_jitter (bool) – Whether to apply temporal jittering. Defaults to False.

  • twice_sample (bool) – Whether to use twice sample when testing. If set to True, it will sample frames with and without fixed shift, which is commonly used for testing in TSM model. Defaults to False.

  • out_of_bound_opt (str) – The way to deal with out of bounds frame indexes. Available options are ‘loop’, ‘repeat_last’. Defaults to ‘loop’.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • keep_tail_frames (bool) – Whether to keep tail frames when sampling. Defaults to False.

  • target_fps (optional, int) – Convert input videos with arbitrary frame rates to the unified target FPS before sampling frames. If None, the frame rate will not be adjusted. Defaults to None.

transform(results: dict) dict[源代码]

Perform the SampleFrames loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.TenCrop(crop_size)[源代码]

Crop the images into 10 crops (corner + center + flip).

Crop the four corners and the center part of the image with the same given crop_size, and flip it horizontally. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

参数:

crop_size (int | tuple[int]) – (w, h) of crop size.

transform(results)[源代码]

Performs the TenCrop augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ThreeCrop(crop_size)[源代码]

Crop images into three crops.

Crop the images equally into three crops with equal intervals along the shorter side. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

参数:

crop_size (int | tuple[int]) – (w, h) of crop size.

transform(results)[源代码]

Performs the ThreeCrop augmentation.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.ToMotion(dataset: str = 'nturgb+d', source: str = 'keypoint', target: str = 'motion')[源代码]

Convert the joint information or bone information to corresponding motion information.

Required Keys:

  • keypoint

Added Keys:

  • motion

参数:
  • dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose’, ‘coco’. Defaults to 'nturgb+d'.

  • source (str) – The source key for the joint or bone information. Defaults to 'keypoint'.

  • target (str) – The target key for the motion information. Defaults to 'motion'.

transform(results: Dict) Dict[源代码]

The transform function of ToMotion.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.TorchVisionWrapper(op, **kwargs)[源代码]

Torchvision Augmentations, under torchvision.transforms.

参数:

op (str) – The name of the torchvision transformation.

transform(results)[源代码]

Perform Torchvision augmentations.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.Transpose(keys, order)[源代码]

Transpose image channels to a given order.

参数:
  • keys (Sequence[str]) – Required keys to be converted.

  • order (Sequence[int]) – Image channel order.

transform(results)[源代码]

Performs the Transpose formatting.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

class mmaction.datasets.transforms.UniformSample(clip_len: int, num_clips: int = 1, test_mode: bool = False)[源代码]

Uniformly sample frames from the video.

Modified from https://github.com/facebookresearch/SlowFast/blob/64a bcc90ccfdcbb11cf91d6e525bed60e92a8796/slowfast/datasets/ssv2.py#L159.

To sample an n-frame clip from the video. UniformSample basically divides the video into n segments of equal length and randomly samples one frame from each segment.

Required keys:

  • total_frames

  • start_index

Added keys:

  • frame_inds

  • clip_len

  • frame_interval

  • num_clips

参数:
  • clip_len (int) – Frames of each sampled output clip.

  • num_clips (int) – Number of clips to be sampled. Defaults to 1.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

transform(results: Dict) Dict[源代码]

Perform the Uniform Sampling.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.UniformSampleFrames(clip_len: int, num_clips: int = 1, test_mode: bool = False, seed: int = 255)[源代码]

Uniformly sample frames from the video.

To sample an n-frame clip from the video. UniformSampleFrames basically divide the video into n segments of equal length and randomly sample one frame from each segment. To make the testing results reproducible, a random seed is set during testing, to make the sampling results deterministic.

Required Keys:

  • total_frames

  • start_index (optional)

Added Keys:

  • frame_inds

  • frame_interval

  • num_clips

  • clip_len

参数:
  • clip_len (int) – Frames of each sampled output clip.

  • num_clips (int) – Number of clips to be sampled. Defaults to 1.

  • test_mode (bool) – Store True when building test or validation dataset. Defaults to False.

  • seed (int) – The random seed used during test time. Defaults to 255.

transform(results: Dict) Dict[源代码]

The transform function of UniformSampleFrames.

参数:

results (dict) – The result dict.

返回:

The result dict.

返回类型:

dict

class mmaction.datasets.transforms.UntrimmedSampleFrames(clip_len=1, clip_interval=16, frame_interval=1)[源代码]

Sample frames from the untrimmed video.

Required keys are “filename”, “total_frames”, added or modified keys are “frame_inds”, “clip_interval” and “num_clips”.

参数:
  • clip_len (int) – The length of sampled clips. Defaults to 1.

  • clip_interval (int) – Clip interval of adjacent center of sampled clips. Defaults to 16.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Defaults to 1.

transform(results)[源代码]

Perform the SampleFrames loading.

参数:

results (dict) – The resulting dict to be modified and passed to the next transform in pipeline.

mmaction.engine

hooks

class mmaction.engine.hooks.OutputHook(module, outputs=None, as_tensor=False)[源代码]

Output feature map of some layers.

参数:
  • module (nn.Module) – The whole module to get layers.

  • outputs (tuple[str] | list[str]) – Layer name to output. Default: None.

  • as_tensor (bool) – Determine to return a tensor or a numpy array. Default: False.

class mmaction.engine.hooks.VisualizationHook(enable=False, interval: int = 5000, show: bool = False, out_dir: str | None = None, **kwargs)[源代码]

Classification Visualization Hook. Used to visualize validation and testing prediction results.

  • If out_dir is specified, all storage backends are ignored and save the image to the out_dir.

  • If show is True, plot the result image in a window, please confirm you are able to access the graphical interface.

参数:
  • enable (bool) – Whether to enable this hook. Defaults to False.

  • interval (int) – The interval of samples to visualize. Defaults to 5000.

  • show (bool) – Whether to display the drawn image. Defaults to False.

  • out_dir (str, optional) – directory where painted images will be saved in the testing process. If None, handle with the backends of the visualizer. Defaults to None.

  • **kwargs – other keyword arguments of mmcls.visualization.ClsVisualizer.add_datasample().

after_test_iter(runner: Runner, batch_idx: int, data_batch: dict, outputs: Sequence[ActionDataSample]) None[源代码]

Visualize every self.interval samples during test.

参数:
  • runner (Runner) – The runner of the testing process.

  • batch_idx (int) – The index of the current batch in the test loop.

  • data_batch (dict) – Data from dataloader.

  • outputs (Sequence[DetDataSample]) – Outputs from model.

after_val_iter(runner: Runner, batch_idx: int, data_batch: dict, outputs: Sequence[ActionDataSample]) None[源代码]

Visualize every self.interval samples during validation.

参数:
  • runner (Runner) – The runner of the validation process.

  • batch_idx (int) – The index of the current batch in the val loop.

  • data_batch (dict) – Data from dataloader.

  • outputs (Sequence[ActionDataSample]) – Outputs from model.

optimizers

class mmaction.engine.optimizers.LearningRateDecayOptimizerConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[源代码]

Different learning rates are set for different layers of backbone. Note: Currently, this optimizer constructor is built for MViT.

Inspiration from the implementation in PySlowFast and MMDetection <https://github.com/open-mmlab/mmdetection/tree/dev-3.x>`_

add_params(params: List[dict], module: Module, **kwargs) None[源代码]

Add all parameters of module to the params list.

The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.

参数:
  • params (list[dict]) – A list of param groups, it will be modified in place.

  • module (nn.Module) – The module to be added.

class mmaction.engine.optimizers.SwinOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[源代码]
add_params(params: List[dict], module: Module, prefix: str = 'base', **kwargs) None[源代码]

Add all parameters of module to the params list.

The parameters of the given module will be added to the list of param groups, with specific rules defined by paramwise_cfg.

参数:
  • params (list[dict]) – A list of param groups, it will be modified in place.

  • module (nn.Module) – The module to be added.

  • prefix (str) – The prefix of the module. Defaults to 'base'.

class mmaction.engine.optimizers.TSMOptimWrapperConstructor(optim_wrapper_cfg: dict, paramwise_cfg: dict | None = None)[源代码]

Optimizer constructor in TSM model.

This constructor builds optimizer in different ways from the default one.

  1. Parameters of the first conv layer have default lr and weight decay.

  2. Parameters of BN layers have default lr and zero weight decay.

  3. If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.

  4. Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.

add_params(params, model, **kwargs)[源代码]

Add parameters and their corresponding lr and wd to the params.

参数:
  • params (list) – The list to be modified, containing all parameter groups and their corresponding lr and wd configurations.

  • model (nn.Module) – The model to be trained with the optimizer.

runner

class mmaction.engine.runner.MultiLoaderEpochBasedTrainLoop(runner, dataloader: Dict | DataLoader, other_loaders: List[Dict | DataLoader], max_epochs: int, val_begin: int = 1, val_interval: int = 1)[源代码]

EpochBasedTrainLoop with multiple dataloaders.

参数:
  • runner (Runner) – A reference of runner.

  • dataloader (Dataloader or Dict) – A dataloader object or a dict to build a dataloader for training the model.

  • other_loaders (List of Dataloader or Dict) – A list of other loaders. Each item in the list is a dataloader object or a dict to build a dataloader.

  • max_epochs (int) – Total training epochs.

  • val_begin (int) – The epoch that begins validating. Defaults to 1.

  • val_interval (int) – Validation interval. Defaults to 1.

run_epoch() None[源代码]

Iterate one epoch.

class mmaction.engine.runner.RetrievalTestLoop(runner, dataloader: DataLoader | Dict, evaluator: Evaluator | Dict | List, fp16: bool = False)[源代码]

Loop for multimodal retrieval test.

参数:
  • runner (Runner) – A reference of runner.

  • dataloader (Dataloader or dict) – A dataloader object or a dict to build a dataloader.

  • evaluator (Evaluator or dict or list) – Used for computing metrics.

  • fp16 (bool) – Whether to enable fp16 testing. Defaults to False.

run() dict[源代码]

Launch test.

class mmaction.engine.runner.RetrievalValLoop(runner, dataloader: DataLoader | Dict, evaluator: Evaluator | Dict | List, fp16: bool = False)[源代码]

Loop for multimodal retrieval val.

参数:
  • runner (Runner) – A reference of runner.

  • dataloader (Dataloader or dict) – A dataloader object or a dict to build a dataloader.

  • evaluator (Evaluator or dict or list) – Used for computing metrics.

  • fp16 (bool) – Whether to enable fp16 valing. Defaults to False.

run() dict[源代码]

Launch val.

mmaction.evaluation

functional

class mmaction.evaluation.functional.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[源代码]

Class to evaluate detection results on ActivityNet.

参数:
  • ground_truth_filename (str | None) – The filename of groundtruth. Default: None.

  • prediction_filename (str | None) – The filename of action detection results. Default: None.

  • tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default: np.linspace(0.5, 0.95, 10).

  • verbose (bool) – Whether to print verbose logs. Default: False.

evaluate()[源代码]

Evaluates a prediction file.

For the detection task we measure the interpolated mean average precision to measure the performance of a method.

wrapper_compute_average_precision()[源代码]

Computes average precision for each class.

mmaction.evaluation.functional.ava_eval(result_file, result_type, label_file, ann_file, exclude_file, verbose=True, ignore_empty_frames=True, custom_classes=None)[源代码]

Perform ava evaluation.

mmaction.evaluation.functional.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[源代码]

Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.

参数:
  • ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.

  • prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.

  • temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

返回:

1D array of average precision score.

返回类型:

np.ndarray

mmaction.evaluation.functional.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[源代码]

Computes the average recall given an average number (percentile) of proposals per video.

参数:
  • ground_truth (dict) – Dict containing the ground truth instances.

  • proposals (dict) – Dict containing the proposal instances.

  • total_num_proposals (int) – Total number of proposals in the proposal dict.

  • max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.

  • temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

返回:

(recall, average_recall, proposals_per_video, auc) In recall, recall[i,j] is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent to recall.mean(axis=0). The proposals_per_video is the average number of proposals per video. The auc is the area under AR@AN curve.

返回类型:

tuple([np.ndarray, np.ndarray, np.ndarray, float])

mmaction.evaluation.functional.confusion_matrix(y_pred, y_real, normalize=None)[源代码]

Compute confusion matrix.

参数:
  • y_pred (list[int] | np.ndarray[int]) – Prediction labels.

  • y_real (list[int] | np.ndarray[int]) – Ground truth labels.

  • normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.

返回:

Confusion matrix.

返回类型:

np.ndarray

mmaction.evaluation.functional.get_weighted_score(score_list, coeff_list)[源代码]

Get weighted score with given scores and coefficients.

Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n

参数:
  • score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes

  • coeff_list (list[float]) – List of coefficients, with shape n.

返回:

List of weighted scores.

返回类型:

list[np.ndarray]

mmaction.evaluation.functional.interpolated_precision_recall(precision, recall)[源代码]

Interpolated AP - VOCdevkit from VOC 2011.

参数:
  • precision (np.ndarray) – The precision of different thresholds.

  • recall (np.ndarray) – The recall of different thresholds.

Returns:

float: Average precision score.

mmaction.evaluation.functional.mean_average_precision(scores, labels)[源代码]

Mean average precision for multi-label recognition.

参数:
  • scores (list[np.ndarray]) – Prediction scores of different classes for each sample.

  • labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

返回:

The mean average precision.

返回类型:

np.float64

mmaction.evaluation.functional.mean_class_accuracy(scores, labels)[源代码]

Calculate mean class accuracy.

参数:
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int]) – Ground truth labels.

返回:

Mean class accuracy.

返回类型:

np.ndarray

mmaction.evaluation.functional.mmit_mean_average_precision(scores, labels)[源代码]

Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.

参数:
  • scores (list[np.ndarray]) – Prediction scores of different classes for each sample.

  • labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

返回:

The MMIT style mean average precision.

返回类型:

np.float64

mmaction.evaluation.functional.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[源代码]

Compute intersection over union between segments.

参数:
  • candidate_segments (np.ndarray) – 1-dim/2-dim array in format [init, end]/[m x 2:=[init, end]].

  • target_segments (np.ndarray) – 2-dim array in format [n x 2:=[init, end]].

  • calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.

返回:

1-dim array [n] /

2-dim array [n x m] with IoU ratio.

t_overlap_self (np.ndarray, optional): 1-dim array [n] /

2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.

返回类型:

t_iou (np.ndarray)

mmaction.evaluation.functional.read_labelmap(labelmap_file)[源代码]

Reads a labelmap without the dependency on protocol buffers.

参数:

labelmap_file – A file object containing a label map protocol buffer.

返回:

The label map in the form used by the object_detection_evaluation module - a list of {“id”: integer, “name”: classname } dicts. class_ids: A set containing all of the valid class id integers.

返回类型:

labelmap

mmaction.evaluation.functional.results2csv(results, out_file, custom_classes=None)[源代码]

Convert detection results to csv file.

mmaction.evaluation.functional.softmax(x, dim=1)[源代码]

Compute softmax values for each sets of scores in x.

mmaction.evaluation.functional.top_k_accuracy(scores, labels, topk=(1,))[源代码]

Calculate top k accuracy score.

参数:
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int]) – Ground truth labels.

  • topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).

返回:

Top k accuracy score for each k.

返回类型:

list[float]

mmaction.evaluation.functional.top_k_classes(scores, labels, k=10, mode='accurate')[源代码]

Calculate the most K accurate (inaccurate) classes.

Given the prediction scores, ground truth label and top-k value, compute the top K accurate (inaccurate) classes.

参数:
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int] | np.ndarray) – Ground truth labels.

  • k (int) – Top-k values. Default: 10.

  • mode (str) – Comparison mode for Top-k. Options are ‘accurate’ and ‘inaccurate’. Default: ‘accurate’.

返回:

List of sorted (from high accuracy to low accuracy for

’accurate’ mode, and from low accuracy to high accuracy for inaccurate mode) top K classes in format of (label_id, acc_ratio).

返回类型:

list

metrics

class mmaction.evaluation.metrics.ANetMetric(metric_type: str = 'TEM', collect_device: str = 'cpu', prefix: str | None = None, metric_options: dict = {}, dump_config: ConfigDict | dict = {'out': ''})[源代码]

ActivityNet dataset evaluation metric.

compute_ARAN(results: list) dict[源代码]

AR@AN evaluation metric.

compute_metrics(results: list) dict[源代码]

Compute the metrics from processed results.

If metric_type is ‘TEM’, only dump middle results and do not compute any metrics. :param results: The processed results of each batch. :type results: list

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

dump_results(results, version='VERSION 1.3')[源代码]

Save middle or final results to disk.

process(data_batch: Sequence[Tuple[Any, dict]], predictions: Sequence[dict]) None[源代码]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • predictions (Sequence[dict]) – A batch of outputs from the model.

static proposals2json(results, show_progress=False)[源代码]

Convert all proposals to a final dict(json) format. :param results: All proposals. :type results: list[dict] :param show_progress: Whether to show the progress bar.

Defaults: False.

返回:

The final result dict. E.g. .. code-block:: Python

dict(video-1=[dict(segment=[1.1,2.0]. score=0.9),

dict(segment=[50.1, 129.3], score=0.6)])

返回类型:

dict

class mmaction.evaluation.metrics.AVAMetric(ann_file: str, exclude_file: str, label_file: str, options: Tuple[str] = ('mAP',), action_thr: float = 0.002, num_classes: int = 81, custom_classes: List[int] | None = None, collect_device: str = 'cpu', prefix: str | None = None)[源代码]

AVA evaluation metric.

compute_metrics(results: list) dict[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

process(data_batch: Sequence[Tuple[Any, dict]], data_samples: Sequence[dict]) None[源代码]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.AccMetric(metric_list: str | Tuple[str] | None = ('top_k_accuracy', 'mean_class_accuracy'), collect_device: str = 'cpu', metric_options: Dict | None = {'top_k_accuracy': {'topk': (1, 5)}}, prefix: str | None = None)[源代码]

Accuracy evaluation metric.

calculate(preds: List[ndarray], labels: List[int | ndarray]) Dict[源代码]

Compute the metrics from processed results.

参数:
  • preds (list[np.ndarray]) – List of the prediction scores.

  • labels (list[int | np.ndarray]) – List of the labels.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

compute_metrics(results: List) Dict[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

process(data_batch: Sequence[Tuple[Any, Dict]], data_samples: Sequence[Dict]) None[源代码]

Process one batch of data samples and data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (Sequence[dict]) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.ConfusionMatrix(num_classes: int | None = None, collect_device: str = 'cpu', prefix: str | None = None)[源代码]

A metric to calculate confusion matrix for single-label tasks.

参数:
  • num_classes (int, optional) – The number of classes. Defaults to None.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

示例

  1. The basic usage.

>>> import torch
>>> from mmaction.evaluation import ConfusionMatrix
>>> y_pred = [0, 1, 1, 3]
>>> y_true = [0, 2, 1, 3]
>>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4)
tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 0, 1]])
>>> # plot the confusion matrix
>>> import matplotlib.pyplot as plt
>>> y_score = torch.rand((1000, 10))
>>> y_true = torch.randint(10, (1000, ))
>>> matrix = ConfusionMatrix.calculate(y_score, y_true)
>>> ConfusionMatrix().plot(matrix)
>>> plt.show()
  1. In the config file

val_evaluator = dict(type='ConfusionMatrix')
test_evaluator = dict(type='ConfusionMatrix')
static calculate(pred, target, num_classes=None) dict[源代码]

Calculate the confusion matrix for single-label task.

参数:
  • pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. It can be labels (N, ), or scores of every class (N, C).

  • target (torch.Tensor | np.ndarray | Sequence) – The target of each prediction with shape (N, ).

  • num_classes (Optional, int) – The number of classes. If the pred is label instead of scores, this argument is required. Defaults to None.

返回:

The confusion matrix.

返回类型:

torch.Tensor

compute_metrics(results: list) dict[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

static plot(confusion_matrix: Tensor, include_values: bool = False, cmap: str = 'viridis', classes: List[str] | None = None, colorbar: bool = True, show: bool = True)[源代码]

Draw a confusion matrix by matplotlib.

Modified from Scikit-Learn

参数:
  • confusion_matrix (torch.Tensor) – The confusion matrix to draw.

  • include_values (bool) – Whether to draw the values in the figure. Defaults to False.

  • cmap (str) – The color map to use. Defaults to use “viridis”.

  • classes (list[str], optional) – The names of categories. Defaults to None, which means to use index number.

  • colorbar (bool) – Whether to show the colorbar. Defaults to True.

  • show (bool) – Whether to show the figure immediately. Defaults to True.

process(data_batch, data_samples: Sequence[dict]) None[源代码]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (Any) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.MultiSportsMetric(ann_file: str, metric_options: dict | None = {'F_mAP': {'thr': 0.5}, 'V_mAP': {'all': True, 'thr': (0.2, 0.5), 'tube_thr': 15}}, collect_device: str = 'cpu', verbose: bool = True, prefix: str | None = None)[源代码]

MAP Metric for MultiSports dataset.

compute_metrics(results: list) dict[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

process(data_batch: Sequence[Tuple[Any, dict]], data_samples: Sequence[dict]) None[源代码]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.RecallatTopK(topK_list: Tuple[int] = (1, 5), threshold: float = 0.5, collect_device: str = 'cpu', prefix: str | None = None)[源代码]

ActivityNet dataset evaluation metric.

compute_metrics(results: list) dict[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

process(data_batch: Sequence[Tuple[Any, dict]], predictions: Sequence[dict]) None[源代码]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (Sequence[Tuple[Any, dict]]) – A batch of data from the dataloader.

  • predictions (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.ReportVQA(file_path: str, collect_device: str = 'cpu', prefix: str | None = None)[源代码]

Dump VQA result to the standard json format for VQA evaluation.

参数:
  • file_path (str) – The file path to save the result file.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.

compute_metrics(results: List)[源代码]

Dump the result to json file.

process(data_batch, data_samples) None[源代码]

transfer tensors in predictions to CPU.

class mmaction.evaluation.metrics.RetrievalMetric(metric_list: Tuple[str] | str = ('R1', 'R5', 'R10', 'MdR', 'MnR'), collect_device: str = 'cpu', prefix: str | None = None)[源代码]

Metric for video retrieval task.

参数:
  • metric_list (str | tuple[str]) – The list of the metrics to be computed. Defaults to ('R1', 'R5', 'R10', 'MdR', 'MnR').

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

compute_metrics(results: List) Dict[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

dict

process(data_batch: Dict | None, data_samples: Sequence[Dict]) None[源代码]

Process one batch of data samples and data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数:
  • data_batch (dict, optional) – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.RetrievalRecall(topk: int | Sequence[int], collect_device: str = 'cpu', prefix: str | None = None)[源代码]

Recall evaluation metric for image retrieval.

参数:
  • topk (int | Sequence[int]) – If the ground truth label matches one of the best k predictions, the sample will be regard as a positive prediction. If the parameter is a tuple, all of top-k recall will be calculated and outputted together. Defaults to 1.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

static calculate(pred: ndarray | Tensor, target: ndarray | Tensor, topk: int | Sequence[int], pred_indices: bool = False, target_indices: bool = False) float[源代码]

Calculate the average recall.

参数:
  • pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. A torch.Tensor or np.ndarray with shape (N, M) or a sequence of index/onehot format labels.

  • target (torch.Tensor | np.ndarray | Sequence) – The prediction results. A torch.Tensor or np.ndarray with shape (N, M) or a sequence of index/onehot format labels.

  • topk (int, Sequence[int]) – Predictions with the k-th highest scores are considered as positive.

  • pred_indices (bool) – Whether the pred is a sequence of category index labels. Defaults to False.

  • target_indices (bool) – Whether the target is a sequence of category index labels. Defaults to False.

返回:

the average recalls.

返回类型:

List[float]

compute_metrics(results: List)[源代码]

Compute the metrics from processed results.

参数:

results (list) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

Dict

process(data_batch: Sequence[dict], data_samples: Sequence[dict])[源代码]

Process one batch of data and predictions.

The processed results should be stored in self.results, which will be used to computed the metrics when all batches have been processed.

参数:
  • data_batch (Sequence[dict]) – A batch of data from the dataloader.

  • predictions (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.VQAAcc(full_score_weight: float = 0.3, collect_device: str = 'cpu', prefix: str | None = None)[源代码]

VQA Acc metric. :param collect_device: Device name used for collecting results from

different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

参数:

prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.

compute_metrics(results: List)[源代码]

Compute the metrics from processed results.

参数:

results (dict) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

Dict

process(data_batch, data_samples)[源代码]

Process one batch of data samples.

The processed results should be stored in self.results, which will be used to computed the metrics when all batches have been processed.

参数:
  • data_batch – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

class mmaction.evaluation.metrics.VQAMCACC(collect_device: str = 'cpu', prefix: str | None = None)[源代码]

VQA multiple choice Acc metric. :param collect_device: Device name used for collecting results from

different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

参数:

prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Should be modified according to the retrieval_type for unambiguous results. Defaults to TR.

compute_metrics(results: List)[源代码]

Compute the metrics from processed results.

参数:

results (dict) – The processed results of each batch.

返回:

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型:

Dict

process(data_batch, data_samples)[源代码]

Process one batch of data samples.

The processed results should be stored in self.results, which will be used to computed the metrics when all batches have been processed.

参数:
  • data_batch – A batch of data from the dataloader.

  • data_samples (Sequence[dict]) – A batch of outputs from the model.

mmaction.models

backbones

class mmaction.models.backbones.AAGCN(graph_cfg: Dict, in_channels: int = 3, base_channels: int = 64, data_bn_type: str = 'MVC', num_person: int = 2, num_stages: int = 10, inflate_stages: List[int] = [5, 8], down_stages: List[int] = [5, 8], init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]

AAGCN backbone, the attention-enhanced version of 2s-AGCN.

Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. More details can be found in the paper .

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. More details can be found in the paper .

参数:
  • graph_cfg (dict) – Config for building the graph.

  • in_channels (int) – Number of input channels. Defaults to 3.

  • base_channels (int) – Number of base channels. Defaults to 64.

  • data_bn_type (str) – Type of the data bn layer. Defaults to 'MVC'.

  • num_person (int) – Maximum number of people. Only used when data_bn_type == ‘MVC’. Defaults to 2.

  • num_stages (int) – Total number of stages. Defaults to 10.

  • inflate_stages (list[int]) – Stages to inflate the number of channels. Defaults to [5, 8].

  • down_stages (list[int]) – Stages to perform downsampling in the time dimension. Defaults to [5, 8].

  • init_cfg (dict or list[dict], optional) – Config to control the initialization. Defaults to None.

  • Examples

  • torch (>>> import) –

  • AAGCN (>>> model =) –

  • register_all_modules (>>> from mmaction.utils import) –

  • >>>

  • register_all_modules() (>>>) –

  • 'stgcn_spatial' (>>> mode =) –

  • batch_size (>>>) –

  • num_person

  • 2 (num_frames =) –

  • 2

  • 150

  • >>>

  • layout (>>> # openpose-18) –

  • 18 (>>> num_joints =) –

  • AAGCN

  • model.init_weights() (>>>) –

  • torch.randn(batch_size (>>> inputs =) –

  • num_person

:param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # nturgb+d layout: :param >>> num_joints = 25: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’nturgb+d’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # coco layout: :param >>> num_joints = 17: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’coco’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # custom settings: :param >>> # disable the attention module to degenerate AAGCN to AGCN: :param >>> model = AAGCN: :type >>> model = AAGCN: graph_cfg=dict(layout=’coco’, mode=mode :param … gcn_attention=False): :param >>> model.init_weights(): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param torch.Size: :type torch.Size: [2, 2, 256, 38, 18] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 25] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17]

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

class mmaction.models.backbones.C2D(depth: int, pretrained: str | None = None, torchvision_pretrain: bool = True, in_channels: int = 3, num_stages: int = 4, out_indices: Sequence[int] = (3,), strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), style: str = 'pytorch', frozen_stages: int = -1, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, partial_bn: bool = False, with_cp: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': 'BatchNorm2d', 'val': 1.0}])[源代码]

C2D backbone.

Compared to ResNet-50, a temporal-pool is added after the first bottleneck. Detailed structure is kept same as “video-nonlocal-net” repo. Please refer to https://github.com/facebookresearch/video-nonlocal-net/blob /main/scripts/run_c2d_baseline_400k.sh. Please note that there are some improvements compared to “Non-local Neural Networks” paper (https://arxiv.org/abs/1711.07971). Differences are noted at https://github.com/facebookresearch/video-nonlocal -net#modifications-for-improving-speed.

forward(x: Tensor) Tensor | Tuple[Tensor][源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the

input samples extracted by the backbone.

返回类型:

Union[torch.Tensor or Tuple[torch.Tensor]]

class mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[源代码]

C3D backbone.

参数:
  • pretrained (str | None) – Name of pretrained model.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses dict(type='Conv3d') to construct layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. required keys are type, Default: None.

  • act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses dict(type='ReLU') to construct layers. Default: None.

  • out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation of fc layers. Default: 0.01.

forward(x)[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data. the size of x is (num_batches, 3, 16, 112, 112).

返回:

The feature of the input samples extracted by the backbone.

返回类型:

torch.Tensor

init_weights()[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.MViT(arch: str = 'base', spatial_size: int = 224, temporal_size: int = 16, in_channels: int = 3, pretrained: str | None = None, pretrained_type: str | None = None, out_scales: int | Sequence[int] = -1, drop_path_rate: float = 0.0, use_abs_pos_embed: bool = False, interpolate_mode: str = 'trilinear', pool_kernel: tuple = (3, 3, 3), dim_mul: int = 2, head_mul: int = 2, adaptive_kv_stride: tuple = (1, 8, 8), rel_pos_embed: bool = True, residual_pooling: bool = True, dim_mul_in_attention: bool = True, with_cls_token: bool = True, output_cls_token: bool = True, rel_pos_zero_init: bool = False, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_cfg: Dict = {'eps': 1e-06, 'type': 'LN'}, patch_cfg: Dict = {'kernel_size': (3, 7, 7), 'padding': (1, 3, 3), 'stride': (2, 4, 4)}, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': ['Conv2d', 'Conv3d'], 'std': 0.02}, {'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.02}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.02}])[源代码]

Multi-scale ViT v2.

A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Inspiration from the official implementation and the mmclassification implementation

参数:
  • arch (str | dict) –

    MViT architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:

    • embed_dims (int): The dimensions of embedding.

    • num_layers (int): The number of layers.

    • num_heads (int): The number of heads in attention modules of the initial layer.

    • downscale_indices (List[int]): The layer indices to downscale the feature map.

    Defaults to ‘base’.

  • spatial_size (int) – The expected input spatial_size shape. Defaults to 224.

  • temporal_size (int) – The expected input temporal_size shape. Defaults to 224.

  • in_channels (int) – The num of input channels. Defaults to 3.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • pretrained_type (str, optional) – Type of pretrained model. choose from ‘imagenet’, ‘maskfeat’, None. Defaults to None, which means load from same architecture.

  • out_scales (int | Sequence[int]) – The output scale indices. They should not exceed the length of downscale_indices. Defaults to -1, which means the last scale.

  • drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.

  • use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.

  • interpolate_mode (str) – Select the interpolate mode for absolute position embedding vector resize. Defaults to “trilinear”.

  • pool_kernel (tuple) – kernel size for qkv pooling layers. Defaults to (3, 3, 3).

  • dim_mul (int) – The magnification for embed_dims in the downscale layers. Defaults to 2.

  • head_mul (int) – The magnification for num_heads in the downscale layers. Defaults to 2.

  • adaptive_kv_stride (int) – The stride size for kv pooling in the initial layer. Defaults to (1, 8, 8).

  • rel_pos_embed (bool) – Whether to enable the spatial and temporal relative position embedding. Defaults to True.

  • residual_pooling (bool) – Whether to enable the residual connection after attention pooling. Defaults to True.

  • dim_mul_in_attention (bool) – Whether to multiply the embed_dims in attention layers. If False, multiply it in MLP layers. Defaults to True.

  • with_cls_token (bool) – Whether concatenating class token into video tokens as transformer input. Defaults to True.

  • output_cls_token (bool) – Whether output the cls_token. If set True, with_cls_token must be True. Defaults to True.

  • rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters. Defaults to False.

  • mlp_ratio (float) – Ratio of hidden dimensions in MLP layers. Defaults to 4.0.

  • qkv_bias (bool) – enable bias for qkv if True. Defaults to True.

  • norm_cfg (dict) – Config dict for normalization layer for all output features. Defaults to dict(type='LN', eps=1e-6).

  • patch_cfg (dict) –

    Config dict for the patch embedding layer. Defaults to ``dict(kernel_size=(3, 7, 7),

    stride=(2, 4, 4), padding=(1, 3, 3))``.

  • init_cfg (dict, optional) – The Config for initialization. Defaults to [ dict(type='TruncNormal', layer=['Conv2d', 'Conv3d'], std=0.02), dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.02), ]

示例

>>> import torch
>>> from mmaction.registry import MODELS
>>> from mmaction.utils import register_all_modules
>>> register_all_modules()
>>>
>>> cfg = dict(type='MViT', arch='tiny', out_scales=[0, 1, 2, 3])
>>> model = MODELS.build(cfg)
>>> model.init_weights()
>>> inputs = torch.rand(1, 3, 16, 224, 224)
>>> outputs = model(inputs)
>>> for i, output in enumerate(outputs):
>>>     print(f'scale{i}: {output.shape}')
scale0: torch.Size([1, 96, 8, 56, 56])
scale1: torch.Size([1, 192, 8, 28, 28])
scale2: torch.Size([1, 384, 8, 14, 14])
scale3: torch.Size([1, 768, 8, 7, 7])
forward(x: Tensor) Tuple[Tensor | List[Tensor]][源代码]

Forward the MViT.

init_weights(pretrained: str | None = None) None[源代码]

Initialize the weights.

class mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7,), frozen_stages=-1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': ['GroupNorm', '_BatchNorm'], 'val': 1.0}])[源代码]

MobileNetV2 backbone.

参数:
  • pretrained (str | None) – Name of pretrained model. Defaults to None.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Defaults to 1.0.

  • out_indices (None or Sequence[int]) – Output from which stages. Defaults to (7, ).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in MobileNetV2 is conv2. Defaults to -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Defaults to None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type=’ReLU6’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='Kaiming', layer='Conv2d',), dict(type='Constant', layer=['GroupNorm', '_BatchNorm'], val=1.) ].

forward(x)[源代码]

Defines the computation performed at every call.

参数:

x (Tensor) – The input data.

返回:

The feature of the input samples extracted by the backbone.

返回类型:

Tensor or Tuple[Tensor]

make_layer(out_channels, num_blocks, stride, expand_ratio)[源代码]

Stack InvertedResidual blocks to build a layer for MobileNetV2.

参数:
  • out_channels (int) – out_channels of block.

  • num_blocks (int) – number of blocks.

  • stride (int) – stride of the first block. Defaults to 1

  • expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Defaults to 6.

train(mode=True)[源代码]

Set the optimization status when training.

class mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, pretrained2d=True, **kwargs)[源代码]

MobileNetV2 backbone for TSM.

参数:
  • num_segments (int) – Number of frame segments. Defaults to 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Defaults to True.

  • shift_div (int) – Number of div for shift. Defaults to 8.

  • pretraind2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • **kwargs (keyword arguments, optional) – Arguments for MobilNetV2.

init_structure()[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

init_weights()[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_shift()[源代码]

Make temporal shift for some layers.

class mmaction.models.backbones.OmniResNet(layers: List[int] = [3, 4, 6, 3], pretrain_2d: str | None = None, init_cfg: ConfigDict | dict | None = None)[源代码]

Omni-ResNet that accepts both image and video inputs.

参数:
  • layers (List[int]) – number of layers in each residual stages. Defaults to [3, 4, 6, 3].

  • pretrain_2d (str, optional) – path to the 2D pretraining checkpoints. Defaults to None.

  • init_cfg (dict or ConfigDict, optional) – The Config for initialization. Defaults to None.

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

Accept both 3D (BCTHW for videos) and 2D (BCHW for images) tensors.

forward_2d(x: Tensor) Tensor[源代码]

Forward call for 2D tensors.

class mmaction.models.backbones.RGBPoseConv3D(pretrained: str | None = None, speed_ratio: int = 4, channel_ratio: int = 4, rgb_detach: bool = False, pose_detach: bool = False, rgb_drop_path: float = 0, pose_drop_path: float = 0, rgb_pathway: Dict = {'base_channels': 64, 'conv1_kernel': (1, 7, 7), 'fusion_kernel': 7, 'inflate': (0, 0, 1, 1), 'lateral': True, 'lateral_activate': (0, 0, 1, 1), 'lateral_infl': 1, 'num_stages': 4, 'with_pool2': False}, pose_pathway: Dict = {'base_channels': 32, 'conv1_kernel': (1, 7, 7), 'conv1_stride_s': 1, 'conv1_stride_t': 1, 'dilations': (1, 1, 1), 'fusion_kernel': 7, 'in_channels': 17, 'inflate': (0, 1, 1), 'lateral': True, 'lateral_activate': (0, 1, 1), 'lateral_infl': 16, 'lateral_inv': True, 'num_stages': 3, 'out_indices': (2,), 'pool1_stride_s': 1, 'pool1_stride_t': 1, 'spatial_strides': (2, 2, 2), 'stage_blocks': (4, 6, 3), 'temporal_strides': (1, 1, 1), 'with_pool2': False}, init_cfg: Dict | List[Dict] | None = None)[源代码]

RGBPoseConv3D backbone.

参数:
  • pretrained (str) – The file path to a pretrained model. Defaults to None.

  • speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Defaults to 4.

  • channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Defaults to 4.

  • rgb_detach (bool) – Whether to detach the gradients from the pose path. Defaults to False.

  • pose_detach (bool) – Whether to detach the gradients from the rgb path. Defaults to False.

  • rgb_drop_path (float) – The drop rate for dropping the features from the pose path. Defaults to 0.

  • pose_drop_path (float) – The drop rate for dropping the features from the rgb path. Defaults to 0.

  • rgb_pathway (dict) – Configuration of rgb branch. Defaults to dict(num_stages=4, lateral=True, lateral_infl=1, lateral_activate=(0, 0, 1, 1), fusion_kernel=7, base_channels=64, conv1_kernel=(1, 7, 7), inflate=(0, 0, 1, 1), with_pool2=False).

  • pose_pathway (dict) – Configuration of pose branch. Defaults to dict(num_stages=3, stage_blocks=(4, 6, 3), lateral=True, lateral_inv=True, lateral_infl=16, lateral_activate=(0, 1, 1), fusion_kernel=7, in_channels=17, base_channels=32, out_indices=(2, ), conv1_kernel=(1, 7, 7), conv1_stride_s=1, conv1_stride_t=1, pool1_stride_s=1, pool1_stride_t=1, inflate=(0, 1, 1), spatial_strides=(2, 2, 2), temporal_strides=(1, 1, 1), with_pool2=False).

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(imgs: Tensor, heatmap_imgs: Tensor) tuple[源代码]

Defines the computation performed at every call.

参数:
  • imgs (torch.Tensor) – The input data.

  • heatmap_imgs (torch.Tensor) – The input data.

返回:

The feature of the input samples extracted by the backbone.

返回类型:

tuple[torch.Tensor]

init_weights() None[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.ResNet(depth: int, pretrained: str | None = None, torchvision_pretrain: bool = True, in_channels: int = 3, num_stages: int = 4, out_indices: Sequence[int] = (3,), strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), style: str = 'pytorch', frozen_stages: int = -1, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, partial_bn: bool = False, with_cp: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': 'BatchNorm2d', 'val': 1.0}])[源代码]

ResNet backbone.

参数:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • torchvision_pretrain (bool) – Whether to load pretrained model from torchvision. Defaults to True.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • num_stages (int) – Resnet stages. Defaults to 4.

  • out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).

  • strides (Sequence[int]) – Strides of the first block of each stage. Defaults to (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).

  • style (str) – pytorch or caffe. If set to pytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to pytorch.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.

  • conv_cfg (dict or ConfigDict) – Config for norm layers. Defaults dict(type='Conv').

  • norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. required keys are type and requires_grad. Defaults to dict(type='BN2d', requires_grad=True).

  • act_cfg (Union[dict, ConfigDict]) – Config for activate layers. Defaults to dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • partial_bn (bool) – Whether to use partial bn. Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='Kaiming', layer='Conv2d',), dict(type='Constant', layer='BatchNorm', val=1.) ].

forward(x: Tensor) Tensor | Tuple[Tensor][源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the

input samples extracted by the backbone.

返回类型:

Union[torch.Tensor or Tuple[torch.Tensor]]

init_weights() None[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

train(mode: bool = True) None[源代码]

Set the optimization status when training.

class mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[源代码]

ResNet (2+1)d backbone.

This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition

forward(x)[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the input samples extracted by the backbone.

返回类型:

torch.Tensor

class mmaction.models.backbones.ResNet3d(depth: int = 50, pretrained: str | None = None, stage_blocks: Tuple | None = None, pretrained2d: bool = True, in_channels: int = 3, num_stages: int = 4, base_channels: int = 64, out_indices: Sequence[int] = (3,), spatial_strides: Sequence[int] = (1, 2, 2, 2), temporal_strides: Sequence[int] = (1, 1, 1, 1), dilations: Sequence[int] = (1, 1, 1, 1), conv1_kernel: Sequence[int] = (3, 7, 7), conv1_stride_s: int = 2, conv1_stride_t: int = 1, pool1_stride_s: int = 2, pool1_stride_t: int = 1, with_pool1: bool = True, with_pool2: bool = True, style: str = 'pytorch', frozen_stages: int = -1, inflate: Sequence[int] = (1, 1, 1, 1), inflate_style: str = '3x1x1', conv_cfg: Dict = {'type': 'Conv3d'}, norm_cfg: Dict = {'requires_grad': True, 'type': 'BN3d'}, act_cfg: Dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, with_cp: bool = False, non_local: Sequence[int] = (0, 0, 0, 0), non_local_cfg: Dict = {}, zero_init_residual: bool = True, init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]

ResNet 3d backbone.

参数:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}. Defaults to 50.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • stage_blocks (tuple, optional) – Set number of stages for each res layer. Defaults to None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • num_stages (int) – Resnet stages. Defaults to 4.

  • base_channels (int) – Channel num of stem output features. Defaults to 64.

  • out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Defaults to (1, 2, 2, 2).

  • temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Defaults to (1, 1, 1, 1).

  • dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).

  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Defaults to (3, 7, 7).

  • conv1_stride_s (int) – Spatial stride of the first conv layer. Defaults to 2.

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Defaults to 1.

  • pool1_stride_s (int) – Spatial stride of the first pooling layer. Defaults to 2.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Defaults to 1.

  • with_pool2 (bool) – Whether to use pool2. Defaults to True.

  • style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to 'pytorch'.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.

  • inflate (Sequence[int]) – Inflate Dims of each block. Defaults to (1, 1, 1, 1).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Defaults to 3x1x1.

  • conv_cfg (dict) – Config for conv layers. Required keys are type. Defaults to dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. Required keys are type and requires_grad. Defaults to dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Defaults to dict().

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Defaults to True.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tensor) Tensor | Tuple[Tensor][源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the input samples extracted by the backbone.

返回类型:

torch.Tensor or tuple[torch.Tensor]

inflate_weights(logger: MMLogger) None[源代码]

Inflate weights.

init_weights(pretrained: str | None = None) None[源代码]

Initialize weights.

static make_res_layer(block: Module, inplanes: int, planes: int, blocks: int, spatial_stride: int | Sequence[int] = 1, temporal_stride: int | Sequence[int] = 1, dilation: int = 1, style: str = 'pytorch', inflate: int | Sequence[int] = 1, inflate_style: str = '3x1x1', non_local: int | Sequence[int] = 0, non_local_cfg: Dict = {}, norm_cfg: Dict | None = None, act_cfg: Dict | None = None, conv_cfg: Dict | None = None, with_cp: bool = False, **kwargs) Module[源代码]

Build residual layer for ResNet3D.

参数:
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Defaults to 1.

  • temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Defaults to 1.

  • dilation (int) – Spacing between kernel elements. Defaults to 1.

  • style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer,otherwise the stride-two layer is the first 1x1 conv layer. Defaults to 'pytorch'.

  • inflate (int | Sequence[int]) – Determine whether to inflate for each block. Defaults to 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: '3x1x1'.

  • non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to 0.

  • non_local_cfg (dict) – Config for non-local module. Defaults to dict().

  • conv_cfg (dict, optional) – Config for conv layers. Defaults to None.

  • norm_cfg (dict, optional) – Config for norm layers. Defaults to None.

  • act_cfg (dict, optional) – Config for activate layers. Defaults to None.

  • with_cp (bool, optional) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

返回:

A residual layer for the given config.

返回类型:

nn.Module

train(mode: bool = True) None[源代码]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[源代码]

ResNet backbone for CSN.

参数:
  • depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.

  • bottleneck_mode (str) –

    Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.

    If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.

    Default: ‘ip’.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

train(mode=True)[源代码]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dLayer(depth: int, pretrained: str | None = None, pretrained2d: bool = True, stage: int = 3, base_channels: int = 64, spatial_stride: int = 2, temporal_stride: int = 1, dilation: int = 1, style: str = 'pytorch', all_frozen: bool = False, inflate: int = 1, inflate_style: str = '3x1x1', conv_cfg: Dict = {'type': 'Conv3d'}, norm_cfg: Dict = {'requires_grad': True, 'type': 'BN3d'}, act_cfg: Dict = {'inplace': True, 'type': 'ReLU'}, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = True, init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]

ResNet 3d Layer.

参数:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • stage (int) – The index of Resnet stage. Defaults to 3.

  • base_channels (int) – Channel num of stem output features. Defaults to 64.

  • spatial_stride (int) – The 1st res block’s spatial stride. Defaults to 2.

  • temporal_stride (int) – The 1st res block’s temporal stride. Defaults to 1.

  • dilation (int) – The dilation. Defaults to 1.

  • style (str) – ‘pytorch’ or ‘caffe’. If set to ‘pytorch’, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Defaults to 'pytorch'.

  • all_frozen (bool) – Frozen all modules in the layer. Defaults to False.

  • inflate (int) – Inflate dims of each block. Defaults to 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Defaults to '3x1x1'.

  • conv_cfg (dict) – Config for conv layers. Required keys are type. Defaults to dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. Required keys are type and requires_grad. Defaults to dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Defaults to True.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the input

samples extracted by the residual layer.

返回类型:

torch.Tensor

inflate_weights(logger: MMLogger) None[源代码]

Inflate weights.

init_weights(pretrained: str | None = None) None[源代码]

Initialize weights.

train(mode: bool = True) None[源代码]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dSlowFast(pretrained: str | None = None, resample_rate: int = 8, speed_ratio: int = 8, channel_ratio: int = 8, slow_pathway: Dict = {'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway: Dict = {'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, init_cfg: Dict | List[Dict] | None = None)[源代码]

Slowfast backbone.

This module is proposed in SlowFast Networks for Video Recognition

参数:
  • pretrained (str) – The file path to a pretrained model.

  • resample_rate (int) – A large temporal stride resample_rate on input frames. The actual resample rate is calculated by multipling the interval in SampleFrames in the pipeline with resample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out of resample_rate * interval frames. Defaults to 8.

  • speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Defaults to 8.

  • channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Defaults to 8.

  • slow_pathway (dict) – Configuration of slow branch. Defaults to dict(type='resnet3d', depth=50, pretrained=None, lateral=True, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1)).

  • fast_pathway (dict) – Configuration of fast branch. Defaults to dict(type='resnet3d', depth=50, pretrained=None, lateral=False, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1).

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Defaults to None.

forward(x: Tensor) tuple[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the input samples

extracted by the backbone.

返回类型:

tuple[torch.Tensor]

init_weights(pretrained: str | None = None) None[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.ResNet3dSlowOnly(conv1_kernel: Sequence[int] = (1, 7, 7), conv1_stride_t: int = 1, pool1_stride_t: int = 1, inflate: Sequence[int] = (0, 0, 1, 1), with_pool2: bool = False, **kwargs)[源代码]

SlowOnly backbone based on ResNet3dPathway.

参数:
  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Defaults to (1, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Defaults to 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Defaults to 1.

  • inflate (Sequence[int]) – Inflate dims of each block. Defaults to (0, 0, 1, 1).

  • with_pool2 (bool) – Whether to use pool2. Defaults to False.

class mmaction.models.backbones.ResNetAudio(depth: int, pretrained: str | None = None, in_channels: int = 1, num_stages: int = 4, base_channels: int = 32, strides: Sequence[int] = (1, 2, 2, 2), dilations: Sequence[int] = (1, 1, 1, 1), conv1_kernel: int = 9, conv1_stride: int = 1, frozen_stages: int = -1, factorize: Sequence[int] = (1, 1, 0, 0), norm_eval: bool = False, with_cp: bool = False, conv_cfg: ConfigDict | dict = {'type': 'Conv'}, norm_cfg: ConfigDict | dict = {'requires_grad': True, 'type': 'BN2d'}, act_cfg: ConfigDict | dict = {'inplace': True, 'type': 'ReLU'}, zero_init_residual: bool = True)[源代码]

ResNet 2d audio backbone. Reference:

参数:
  • depth (int) – Depth of resnet, from {50, 101, 152}.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • in_channels (int) – Channel num of input features. Defaults to 1.

  • base_channels (int) – Channel num of stem output features. Defaults to 32.

  • num_stages (int) – Resnet stages. Defaults to 4.

  • strides (Sequence[int]) – Strides of residual blocks of each stage. Defaults to (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Defaults to (1, 1, 1, 1).

  • conv1_kernel (int) – Kernel size of the first conv layer. Defaults to 9.

  • conv1_stride (Union[int, Tuple[int]]) – Stride of the first conv layer. Defaults to 1.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1.

  • factorize (Sequence[int]) – factorize Dims of each block for audio. Defaults to (1, 1, 0, 0).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Defaults to False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • conv_cfg (Union[dict, ConfigDict]) – Config for norm layers. Defaults to dict(type='Conv').

  • norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. required keys are type and requires_grad. Defaults to dict(type='BN2d', requires_grad=True).

  • act_cfg (Union[dict, ConfigDict]) – Config for activate layers. Defaults to dict(type='ReLU', inplace=True).

  • zero_init_residual (bool) – Whether to use zero initialization for residual block. Defaults to True.

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the input samples extracted

by the backbone.

返回类型:

torch.Tensor

init_weights() None[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

static make_res_layer(block: Module, inplanes: int, planes: int, blocks: int, stride: int = 1, dilation: int = 1, factorize: int = 1, norm_cfg: ConfigDict | dict | None = None, with_cp: bool = False) Module[源代码]

Build residual layer for ResNetAudio.

参数:
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • stride (int) – Strides of residual blocks of each stage. Defaults to 1.

  • dilation (int) – Spacing between kernel elements. Defaults to 1.

  • factorize (Uninon[int, Sequence[int]]) – Determine whether to factorize for each block. Defaults to 1.

  • norm_cfg (Union[dict, ConfigDict], optional) – Config for norm layers. Defaults to None.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

返回:

A residual layer for the given config.

返回类型:

nn.Module

train(mode: bool = True) None[源代码]

Set the optimization status when training.

class mmaction.models.backbones.ResNetTIN(depth, is_tin=True, **kwargs)[源代码]

ResNet backbone for TIN.

参数:
  • depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments. Default: 8.

  • is_tin (bool) – Whether to apply temporal interlace. Default: True.

  • shift_div (int) – Number of division parts for shift. Default: 4.

  • kwargs (dict, optional) – Arguments for ResNet.

init_structure()[源代码]

Initialize structure for tsm.

make_temporal_interlace()[源代码]

Make temporal interlace for some layers.

class mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, pretrained2d=True, **kwargs)[源代码]

ResNet backbone for TSM.

参数:
  • num_segments (int) – Number of frame segments. Defaults to 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Defaults to True.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Defaults to (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Defaults to dict().

  • shift_div (int) – Number of div for shift. Defaults to 8.

  • shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Defaults to ‘blockres’.

  • temporal_pool (bool) – Whether to add temporal pooling. Defaults to False.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • **kwargs (keyword arguments, optional) – Arguments for ResNet.

init_structure()[源代码]

Initialize structure for tsm.

init_weights()[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

load_original_weights(logger)[源代码]

Load weights from original checkpoint, which required converting keys.

make_non_local()[源代码]

Wrap resnet layer into non local wrapper.

make_temporal_pool()[源代码]

Make temporal pooling between layer1 and layer2, using a 3D max pooling layer.

make_temporal_shift()[源代码]

Make temporal shift for some layers.

class mmaction.models.backbones.STGCN(graph_cfg: Dict, in_channels: int = 3, base_channels: int = 64, data_bn_type: str = 'VC', ch_ratio: int = 2, num_person: int = 2, num_stages: int = 10, inflate_stages: List[int] = [5, 8], down_stages: List[int] = [5, 8], init_cfg: Dict | List[Dict] | None = None, **kwargs)[源代码]

STGCN backbone.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. More details can be found in the paper .

参数:
  • graph_cfg (dict) – Config for building the graph.

  • in_channels (int) – Number of input channels. Defaults to 3.

  • base_channels (int) – Number of base channels. Defaults to 64.

  • data_bn_type (str) – Type of the data bn layer. Defaults to 'VC'.

  • ch_ratio (int) – Inflation ratio of the number of channels. Defaults to 2.

  • num_person (int) – Maximum number of people. Only used when data_bn_type == ‘MVC’. Defaults to 2.

  • num_stages (int) – Total number of stages. Defaults to 10.

  • inflate_stages (list[int]) – Stages to inflate the number of channels. Defaults to [5, 8].

  • down_stages (list[int]) – Stages to perform downsampling in the time dimension. Defaults to [5, 8].

  • stage_cfgs (dict) – Extra config dict for each stage. Defaults to dict().

  • init_cfg (dict or list[dict], optional) – Config to control the initialization. Defaults to None.

  • Examples

  • torch (>>> import) –

  • STGCN (>>> model =) –

  • >>>

  • 'stgcn_spatial' (>>> mode =) –

  • batch_size (>>>) –

  • num_person

  • 2 (num_frames =) –

  • 2

  • 150

  • >>>

  • layout (>>> # openpose-18) –

  • 18 (>>> num_joints =) –

  • STGCN

  • model.init_weights() (>>>) –

  • torch.randn(batch_size (>>> inputs =) –

  • num_person

:param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # nturgb+d layout: :param >>> num_joints = 25: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’nturgb+d’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # coco layout: :param >>> num_joints = 17: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’coco’, mode=mode) :param >>> model.init_weights(): :param >>> inputs = torch.randn(batch_size: :param num_person: :param : :param … num_frames: :param num_joints: :param 3): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param >>>: :param >>> # custom settings: :param >>> # instantiate STGCN++: :param >>> model = STGCN: :type >>> model = STGCN: graph_cfg=dict(layout=’coco’, mode=’spatial’ :param … gcn_adaptive=’init’: :param gcn_with_res=True: :param : :param … tcn_type=’mstcn’): :param >>> model.init_weights(): :param >>> output = model: :type >>> output = model: inputs :param >>> print: :type >>> print: output.shape :param torch.Size: :type torch.Size: [2, 2, 256, 38, 18] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 25] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17] :param torch.Size: :type torch.Size: [2, 2, 256, 38, 17]

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

class mmaction.models.backbones.SwinTransformer3D(arch: str | Dict, pretrained: str | None = None, pretrained2d: bool = True, patch_size: int | Sequence[int] = (2, 4, 4), in_channels: int = 3, window_size: Sequence[int] = (8, 7, 7), mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, act_cfg: Dict = {'type': 'GELU'}, norm_cfg: Dict = {'type': 'LN'}, patch_norm: bool = True, frozen_stages: int = -1, with_cp: bool = False, out_indices: Sequence[int] = (3,), out_after_downsample: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[源代码]

Video Swin Transformer backbone.

A pytorch implement of: Video Swin Transformer

参数:
  • arch (str or dict) – Video Swin Transformer architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys: - embed_dims (int): The dimensions of embedding. - depths (Sequence[int]): The number of blocks in each stage. - num_heads (Sequence[int]): The number of heads in attention modules of each stage.

  • pretrained (str, optional) – Name of pretrained model. Defaults to None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Defaults to True.

  • patch_size (int or Sequence(int)) – Patch size. Defaults to (2, 4, 4).

  • in_channels (int) – Number of input image channels. Defaults to 3.

  • window_size (Sequence[int]) – Window size. Defaults to (8, 7, 7).

  • mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Defaults to 4.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Defaults to True.

  • qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.

  • drop_rate (float) – Dropout rate. Defaults to 0.0.

  • attn_drop_rate (float) – Attention dropout rate. Defaults to 0.0.

  • drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type='GELU').

  • norm_cfg (dict) – Config dict for norm layer. Defaults to dict(type='LN').

  • patch_norm (bool) – If True, add normalization after patch embedding. Defaults to True.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.

  • out_indices (Sequence[int]) – Indices of output feature. Defaults to (3, ).

  • out_after_downsample (bool) – Whether to output the feature map of a stage after the following downsample layer. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tuple[Tensor] | Tensor[源代码]

Forward function for Swin3d Transformer.

inflate_weights(logger: MMLogger) None[源代码]

Inflate the swin2d parameters to swin3d.

The differences between swin3d and swin2d mainly lie in an extra axis. To utilize the pretrained parameters in 2d model, the weight of swin2d models should be inflated to fit in the shapes of the 3d counterpart.

参数:

logger (MMLogger) – The logger used to print debugging information.

init_weights() None[源代码]

Initialize the weights in backbone.

train(mode: bool = True) None[源代码]

Convert the model into training mode while keep layers frozen.

class mmaction.models.backbones.TANet(depth: int, num_segments: int, tam_cfg: dict | None = None, **kwargs)[源代码]

Temporal Adaptive Network (TANet) backbone.

This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.

参数:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments.

  • tam_cfg (dict, optional) – Config for temporal adaptive module (TAM). Defaults to None.

init_weights()[源代码]

Initialize weights.

make_tam_modeling()[源代码]

Replace ResNet-Block with TA-Block.

class mmaction.models.backbones.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[源代码]

TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?

参数:
  • num_frames (int) – Number of frames in the video.

  • img_size (int | tuple) – Size of input image.

  • patch_size (int) – Size of one patch.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • embed_dims (int) – Dimensions of embedding. Defaults to 768.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.

  • num_transformer_layers (int) – Number of transformer layers. Defaults to 12.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0..

  • (list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.

  • attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.

  • norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).

forward(x)[源代码]

Defines the computation performed at every call.

init_weights(pretrained=None)[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.UniFormer(depth: List[int] = [5, 8, 20, 7], img_size: int = 224, in_chans: int = 3, embed_dim: List[int] = [64, 128, 320, 512], head_dim: int = 64, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, pretrained2d: bool = True, pretrained: str | None = None, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[源代码]

UniFormer.

A pytorch implement of: UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning <https://arxiv.org/abs/2201.04676>

参数:
  • depth (List[int]) – List of depth in each stage. Defaults to [5, 8, 20, 7].

  • img_size (int) – Number of input size. Defaults to 224.

  • in_chans (int) – Number of input features. Defaults to 3.

  • head_dim (int) – Dimension of attention head. Defaults to 64.

  • embed_dim (List[int]) – List of embedding dimension in each layer. Defaults to [64, 128, 320, 512].

  • mlp_ratio (float) – Ratio of mlp hidden dimension to embedding dimension. Defaults to 4.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Defaults to True.

  • qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.

  • drop_rate (float) – Dropout rate. Defaults to 0.0.

  • attn_drop_rate (float) – Attention dropout rate. Defaults to 0.0.

  • drop_path_rate (float) – Stochastic depth rates. Defaults to 0.0.

  • pretrained2d (bool) – Whether to load pretrained from 2D model. Defaults to True.

  • pretrained (str) – Name of pretrained model. Defaults to None.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

备注

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[源代码]

Initialize the weights in backbone.

class mmaction.models.backbones.UniFormerV2(input_resolution: int = 224, patch_size: int = 16, width: int = 768, layers: int = 12, heads: int = 12, backbone_drop_path_rate: float = 0.0, t_size: int = 8, kernel_size: int = 3, dw_reduction: float = 1.5, temporal_downsample: bool = False, no_lmhra: bool = True, double_lmhra: bool = False, return_list: List[int] = [8, 9, 10, 11], n_layers: int = 4, n_dim: int = 768, n_head: int = 12, mlp_factor: float = 4.0, drop_path_rate: float = 0.0, mlp_dropout: List[float] = [0.5, 0.5, 0.5, 0.5], clip_pretrained: bool = True, pretrained: str | None = None, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[源代码]

UniFormerV2:

A pytorch implement of: UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer <https://arxiv.org/abs/2211.09552>

参数:
  • input_resolution (int) – Number of input resolution. Defaults to 224.

  • patch_size (int) – Number of patch size. Defaults to 16.

  • width (int) – Number of input channels in local UniBlock. Defaults to 768.

  • layers (int) – Number of layers of local UniBlock. Defaults to 12.

  • heads (int) – Number of attention head in local UniBlock. Defaults to 12.

  • backbone_drop_path_rate (float) – Stochastic depth rate in local UniBlock. Defaults to 0.0.

  • t_size (int) – Number of temporal dimension after patch embedding. Defaults to 8.

  • temporal_downsample (bool) – Whether downsampling temporal dimentison. Defaults to False.

  • dw_reduction (float) – Downsample ratio of input channels in local MHRA. Defaults to 1.5.

  • no_lmhra (bool) – Whether removing local MHRA in local UniBlock. Defaults to False.

  • double_lmhra (bool) – Whether using double local MHRA in local UniBlock. Defaults to True.

  • return_list (List[int]) – Layer index of input features for global UniBlock. Defaults to [8, 9, 10, 11].

  • n_dim (int) – Number of layers of global UniBlock. Defaults to 4.

  • n_dim – Number of layers of global UniBlock. Defaults to 4.

  • n_dim – Number of input channels in global UniBlock. Defaults to 768.

  • n_head (int) – Number of attention head in global UniBlock. Defaults to 12.

  • mlp_factor (float) – Ratio of hidden dimensions in MLP layers in global UniBlock. Defaults to 4.0.

  • drop_path_rate (float) – Stochastic depth rate in global UniBlock. Defaults to 0.0.

  • mlp_dropout (List[float]) – Stochastic dropout rate in each MLP layer in global UniBlock. Defaults to [0.5, 0.5, 0.5, 0.5].

  • clip_pretrained (bool) – Whether to load pretrained CLIP visual encoder. Defaults to True.

  • pretrained (str) – Name of pretrained model. Defaults to None.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

备注

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[源代码]

Initialize the weights in backbone.

class mmaction.models.backbones.VisionTransformer(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dims: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: int = 4.0, qkv_bias: bool = True, qk_scale: int | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_cfg: ConfigDict | dict = {'eps': 1e-06, 'type': 'LN'}, init_values: int = 0.0, use_learnable_pos_emb: bool = False, num_frames: int = 16, tubelet_size: int = 2, use_mean_pooling: int = True, pretrained: str | None = None, return_feat_map: bool = False, init_cfg: Dict | List[Dict] | None = [{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02, 'bias': 0.0}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}], **kwargs)[源代码]

Vision Transformer with support for patch or hybrid CNN input stage. An impl of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

参数:
  • img_size (int or tuple) – Size of input image. Defaults to 224.

  • patch_size (int) – Spatial size of one patch. Defaults to 16.

  • in_channels (int) – The number of channels of he input. Defaults to 3.

  • embed_dims (int) – Dimensions of embedding. Defaults to 768.

  • depth (int) – number of blocks in the transformer. Defaults to 12.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.

  • mlp_ratio (int) – The ratio between the hidden layer and the input layer in the FFN. Defaults to 4.

  • qkv_bias (bool) – If True, add a learnable bias to q and v. Defaults to True.

  • qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.

  • drop_rate (float) – Dropout ratio of output. Defaults to 0.

  • attn_drop_rate (float) – Dropout ratio of attention weight. Defaults to 0.

  • drop_path_rate (float) – Dropout ratio of the residual branch. Defaults to 0.

  • norm_cfg (dict or Configdict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).

  • init_values (float) – Value to init the multiplier of the residual branch. Defaults to 0.

  • use_learnable_pos_emb (bool) – If True, use learnable positional embedding, othersize use sinusoid encoding. Defaults to False.

  • num_frames (int) – Number of frames in the video. Defaults to 16.

  • tubelet_size (int) – Temporal size of one patch. Defaults to 2.

  • use_mean_pooling (bool) – If True, take the mean pooling over all positions. Defaults to True.

  • pretrained (str, optional) – Name of pretrained model. Default: None.

  • return_feat_map (bool) – If True, return the feature in the shape of [B, C, T, H, W]. Defaults to False.

  • init_cfg (dict or list[dict]) – Initialization config dict. Defaults to [ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.), dict(type='Constant', layer='LayerNorm', val=1., bias=0.) ].

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (Tensor) – The input data.

返回:

The feature of the input

samples extracted by the backbone.

返回类型:

Tensor

class mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=-1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[源代码]

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

参数:
  • gamma_w (float) – Global channel width expansion factor. Default: 1.

  • gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.

  • gamma_d (float) – Network depth expansion factor. Default: 1.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The feature of the input samples extracted by the backbone.

返回类型:

torch.Tensor

init_weights()[源代码]

Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[源代码]

Build residual layer for ResNet3D.

参数:
  • block (nn.Module) – Residual module to be built.

  • layer_inplanes (int) – Number of channels for the input feature of the res layer.

  • inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.

  • planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict | None) – Config for norm layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. Default: None.

  • act_cfg (dict | None) – Config for activate layers. Default: None.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

返回:

A residual layer for the given config.

返回类型:

nn.Module

train(mode=True)[源代码]

Set the optimization status when training.

common

class mmaction.models.common.Conv2plus1d(in_channels: int, out_channels: int, kernel_size: int | Tuple[int], stride: int | Tuple[int] = 1, padding: int | Tuple[int] = 0, dilation: int | Tuple[int] = 1, groups: int = 1, bias: bool | str = True, norm_cfg: ConfigDict | dict = {'type': 'BN3d'})[源代码]

(2+1)d Conv module for R(2+1)d backbone.

https://arxiv.org/pdf/1711.11248.pdf.

参数:
  • in_channels (int) – Same as nn.Conv3d.

  • out_channels (int) – Same as nn.Conv3d.

  • kernel_size (Union[int, Tuple[int]]) – Same as nn.Conv3d.

  • stride (Union[int, Tuple[int]]) – Same as nn.Conv3d. Defaults to 1.

  • padding (Union[int, Tuple[int]]) – Same as nn.Conv3d. Defaults to 0.

  • dilation (Union[int, Tuple[int]]) – Same as nn.Conv3d. Defaults to 1.

  • groups (int) – Same as nn.Conv3d. Defaults to 1.

  • bias (Union[bool, str]) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

  • norm_cfg (Union[dict, ConfigDict]) – Config for norm layers. Defaults to dict(type='BN3d').

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The output of the module.

返回类型:

torch.Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.

class mmaction.models.common.ConvAudio(in_channels: int, out_channels: int, kernel_size: int | Tuple[int], op: str = 'concat', stride: int | Tuple[int] = 1, padding: int | Tuple[int] = 0, dilation: int | Tuple[int] = 1, groups: int = 1, bias: bool | str = False)[源代码]

Conv2d module for AudioResNet backbone.

参数:
  • in_channels (int) – Same as nn.Conv2d.

  • out_channels (int) – Same as nn.Conv2d.

  • kernel_size (Union[int, Tuple[int]]) – Same as nn.Conv2d.

  • op (str) – Operation to merge the output of freq and time feature map. Choices are sum and concat. Defaults to concat.

  • stride (Union[int, Tuple[int]]) – Same as nn.Conv2d. Defaults to 1.

  • padding (Union[int, Tuple[int]]) – Same as nn.Conv2d. Defaults to 0.

  • dilation (Union[int, Tuple[int]]) – Same as nn.Conv2d. Defaults to 1.

  • groups (int) – Same as nn.Conv2d. Defaults to 1.

  • bias (Union[bool, str]) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False. Defaults to False.

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The output of the module.

返回类型:

torch.Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.

class mmaction.models.common.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[源代码]

Spatial Attention in Divided Space Time Attention.

参数:
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[源代码]

Defines the computation performed at every call.

init_weights()[源代码]

init DividedSpatialAttentionWithNorm by default.

class mmaction.models.common.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[源代码]

Temporal Attention in Divided Space Time Attention.

参数:
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[源代码]

Defines the computation performed at every call.

init_weights()[源代码]

Initialize weights.

class mmaction.models.common.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[源代码]

FFN with pre normalization layer.

FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.

FFNWithNorm has one main difference with FFN:

  • It apply one normalization layer before forwarding the input data to

    feed-forward networks.

参数:
  • embed_dims (int) – Dimensions of embedding. Defaults to 256.

  • feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.

  • num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.

  • act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)

  • ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..

  • add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.

  • dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.

  • init_cfg (dict) – The Config for initialization. Defaults to None.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

forward(x, residual=None)[源代码]

Defines the computation performed at every call.

class mmaction.models.common.SubBatchNorm3D(num_features, **cfg)[源代码]

Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.

参数:

num_features (int) – Dimensions of BatchNorm.

aggregate_stats()[源代码]

Synchronize running_mean, and running_var to self.bn.

Call this before eval, then call model.eval(); When eval, forward function will call self.bn instead of self.split_bn, During this time the running_mean, and running_var of self.bn has been obtained from self.split_bn.

forward(x)[源代码]

Defines the computation performed at every call.

init_weights(cfg)[源代码]

Initialize weights.

class mmaction.models.common.TAM(in_channels: int, num_segments: int, alpha: int = 2, adaptive_kernel_size: int = 3, beta: int = 4, conv1d_kernel_size: int = 3, adaptive_convolution_stride: int = 1, adaptive_convolution_padding: int = 1, init_std: float = 0.001)[源代码]

Temporal Adaptive Module(TAM) for TANet.

This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

参数:
  • in_channels (int) – Channel num of input features.

  • num_segments (int) – Number of frame segments.

  • alpha (int) – alpha in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Defaults to 2.

  • adaptive_kernel_size (int) – K in the paper and is the size of the adaptive kernel size in the global branch. Defaults to 3.

  • beta (int) – beta in the paper and is set to control the model complexity in the local branch. Defaults to 4.

  • conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Defaults to 3.

  • adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of Temporal Adaptive Aggregation. Defaults to 1.

  • adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of Temporal Adaptive Aggregation. Defaults to 1.

  • init_std (float) – Std value for initiation of nn.Linear. Defaults to 0.001.

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The output of the module.

返回类型:

torch.Tensor

data_preprocessors

class mmaction.models.data_preprocessors.ActionDataPreprocessor(mean: Sequence[float | int] | None = None, std: Sequence[float | int] | None = None, to_rgb: bool = False, to_float32: bool = True, blending: dict | None = None, format_shape: str = 'NCHW')[源代码]

Data pre-processor for action recognition tasks.

参数:
  • mean (Sequence[float or int], optional) – The pixel mean of channels of images or stacked optical flow. Defaults to None.

  • std (Sequence[float or int], optional) – The pixel standard deviation of channels of images or stacked optical flow. Defaults to None.

  • to_rgb (bool) – Whether to convert image from BGR to RGB. Defaults to False.

  • to_float32 (bool) – Whether to convert data to float32. Defaults to True.

  • blending (dict, optional) – Config for batch blending. Defaults to None.

  • format_shape (str) – Format shape of input data. Defaults to 'NCHW'.

forward(data: dict | Tuple[dict], training: bool = False) dict | Tuple[dict][源代码]

Perform normalization, padding, bgr2rgb conversion and batch augmentation based on BaseDataPreprocessor.

参数:
  • data (dict or Tuple[dict]) – data sampled from dataloader.

  • training (bool) – Whether to enable training time augmentation.

返回:

Data in the same format as the model input.

返回类型:

dict or Tuple[dict]

forward_onesample(data, training: bool = False) dict[源代码]

Perform normalization, padding, bgr2rgb conversion and batch augmentation on one data sample.

参数:
  • data (dict) – data sampled from dataloader.

  • training (bool) – Whether to enable training time augmentation.

返回:

Data in the same format as the model input.

返回类型:

dict

class mmaction.models.data_preprocessors.MultiModalDataPreprocessor(preprocessors: Dict)[源代码]

Multi-Modal data pre-processor for action recognition tasks.

forward(data: Dict, training: bool = False) Dict[源代码]

Preprocesses the data into the model input format.

参数:
  • data (dict) – Data returned by dataloader.

  • training (bool) – Whether to enable training time augmentation.

返回:

Data in the same format as the model input.

返回类型:

dict

heads

class mmaction.models.heads.BaseHead(num_classes: int, in_channels: int, loss_cls: Dict = {'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class: bool = False, label_smooth_eps: float = 0.0, topk: int | Tuple[int] = (1, 5), average_clips: Dict | None = None, init_cfg: Dict | None = None)[源代码]

Base class for head.

All Head should subclass it. All subclass should overwrite: - forward(), supporting to forward both for training and testing.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type='CrossEntropyLoss', loss_weight=1.0).

  • multi_class (bool) – Determines whether it is a multi-class recognition task. Defaults to False.

  • label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Defaults to 0.

  • topk (int or tuple) – Top-k accuracy. Defaults to (1, 5).

  • average_clips (dict, optional) – Config for averaging class scores over multiple clips. Defaults to None.

  • init_cfg (dict, optional) – Config to control the initialization. Defaults to None.

average_clip(cls_scores: Tensor, num_segs: int = 1) Tensor[源代码]

Averaging class scores over multiple clips.

Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.

参数:
  • cls_scores (torch.Tensor) – Class scores to be averaged.

  • num_segs (int) – Number of clips for each input sample.

返回:

Averaged class scores.

返回类型:

torch.Tensor

abstract forward(x, **kwargs) Dict[str, Tensor] | List[ActionDataSample] | Tuple[Tensor] | Tensor[源代码]

Defines the computation performed at every call.

loss(feats: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) Dict[源代码]

Perform forward propagation of head and loss calculation on the features of the upstream network.

参数:
  • feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

A dictionary of loss components.

返回类型:

dict

loss_by_feat(cls_scores: Tensor, data_samples: List[ActionDataSample]) Dict[源代码]

Calculate the loss based on the features extracted by the head.

参数:
  • cls_scores (torch.Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

A dictionary of loss components.

返回类型:

dict

predict(feats: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) List[ActionDataSample][源代码]

Perform forward propagation of head and predict recognition results on the features of the upstream network.

参数:
  • feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

Recognition results wrapped

by ActionDataSample.

返回类型:

list[ActionDataSample]

predict_by_feat(cls_scores: Tensor, data_samples: List[ActionDataSample]) List[ActionDataSample][源代码]

Transform a batch of output features extracted from the head into prediction results.

参数:
  • cls_scores (torch.Tensor) – Classification scores, has a shape (B*num_segs, num_classes)

  • data_samples (list[ActionDataSample]) – The annotation data of every samples. It usually includes information such as gt_label.

返回:

Recognition results wrapped

by ActionDataSample.

返回类型:

List[ActionDataSample]

class mmaction.models.heads.FeatureHead(spatial_type: str = 'avg', temporal_type: str = 'avg', backbone_name: str | None = None, num_segments: str | None = None, **kwargs)[源代码]

General head for feature extraction.

参数:
  • spatial_type (str, optional) – Pooling type in spatial dimension. Default: ‘avg’. If set to None, means keeping spatial dimension, and for GCN backbone, keeping last two dimension(T, V).

  • temporal_type (str, optional) – Pooling type in temporal dimension. Default: ‘avg’. If set to None, meanse keeping temporal dimnsion, and for GCN backbone, keeping dimesion M. Please note that the channel order would keep same with the output of backbone, [N, T, C, H, W] for 2D recognizer, and [N, M, C, T, V] for GCN recognizer.

  • backbone_name (str, optional) – Backbone name to specifying special operations.Currently supports: ‘tsm’, ‘slowfast’, and ‘gcn’. Defaults to None, means take the input as normal feature.

  • num_segments (int, optional) – Number of frame segments for TSM backbone. Defaults to None.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tensor, num_segs: int | None = None, **kwargs) Tensor[源代码]

Defines the computation performed at every call.

参数:
  • x (Tensor) – The input data.

  • num_segs (int) – For 2D backbone. Number of segments into which a video is divided. Defaults to None.

返回:

The output features after pooling.

返回类型:

Tensor

predict_by_feat(feats: Tensor | Tuple[Tensor], data_samples) Tensor[源代码]

Integrate multi-view features into one tensor.

参数:
  • feats (torch.Tensor | tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

The integrated multi-view features.

返回类型:

Tensor

class mmaction.models.heads.GCNHead(num_classes: int, in_channels: int, loss_cls: Dict = {'type': 'CrossEntropyLoss'}, dropout: float = 0.0, average_clips: str = 'prob', init_cfg: Dict | List[Dict] = {'layer': 'Linear', 'std': 0.01, 'type': 'Normal'}, **kwargs)[源代码]

The classification head for GCN.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type='CrossEntropyLoss').

  • dropout (float) – Probability of dropout layer. Defaults to 0.

  • init_cfg (dict or list[dict]) – Config to control the initialization. Defaults to dict(type='Normal', layer='Linear', std=0.01).

forward(x: Tensor, **kwargs) Tensor[源代码]

Forward features from the upstream network.

参数:

x (torch.Tensor) – Features from the upstream network.

返回:

Classification scores with shape (B, num_classes).

返回类型:

torch.Tensor

class mmaction.models.heads.I3DHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, **kwargs)[源代码]

Classification head for I3D.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tensor, **kwargs) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (Tensor) – The input data.

返回:

The classification scores for input samples.

返回类型:

Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.

class mmaction.models.heads.MViTHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, dropout_ratio: float = 0.5, init_std: float = 0.02, init_scale: float = 1.0, with_cls_token: bool = True, **kwargs)[源代码]

Classification head for Multi-scale ViT.

A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0.5.

  • init_std (float) – Std value for Initiation. Defaults to 0.02.

  • init_scale (float) – Scale factor for Initiation parameters. Defaults to 1.

  • with_cls_token (bool) – Whether the backbone output feature with cls_token. Defaults to True.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tuple[List[Tensor]], **kwargs) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (Tuple[List[Tensor]]) – The input data.

返回:

The classification scores for input samples.

返回类型:

Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.

pre_logits(feats: Tuple[List[Tensor]]) Tensor[源代码]

The process before the final classification head.

The input feats is a tuple of list of tensor, and each tensor is the feature of a backbone stage.

class mmaction.models.heads.OmniHead(image_classes: int, video_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, image_dropout_ratio: float = 0.2, video_dropout_ratio: float = 0.5, video_nl_head: bool = True, **kwargs)[源代码]

Classification head for OmniResNet that accepts both image and video inputs.

参数:
  • image_classes (int) – Number of image classes to be classified.

  • video_classes (int) – Number of video classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • image_dropout_ratio (float) – Probability of dropout layer for the image head. Defaults to 0.2.

  • video_dropout_ratio (float) – Probability of dropout layer for the video head. Defaults to 0.5.

  • video_nl_head (bool) – if true, use a non-linear head for the video head. Defaults to True.

forward(x: Tensor, **kwargs) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (Tensor) – The input data.

返回:

The classification scores for input samples.

返回类型:

Tensor

loss_by_feat(cls_scores: Tensor | Tuple[Tensor], data_samples: List[ActionDataSample]) dict[源代码]

Calculate the loss based on the features extracted by the head.

参数:
  • cls_scores (Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).

  • data_samples (List[ActionDataSample]) – The batch data samples.

返回:

A dictionary of loss components.

返回类型:

dict

class mmaction.models.heads.RGBPoseHead(num_classes: int, in_channels: Tuple[int], loss_cls: Dict = {'type': 'CrossEntropyLoss'}, loss_components: List[str] = ['rgb', 'pose'], loss_weights: float | Tuple[float] = 1.0, dropout: float = 0.5, init_std: float = 0.01, **kwargs)[源代码]

The classification head for RGBPoseConv3D.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (tuple[int]) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type='CrossEntropyLoss').

  • loss_components (list[str]) – The components of the loss. Defaults to ['rgb', 'pose'].

  • loss_weights (float or tuple[float]) – The weights of the losses. Defaults to 1.

  • dropout (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

forward(x: Tuple[Tensor]) Dict[源代码]

Defines the computation performed at every call.

init_weights() None[源代码]

Initiate the parameters from scratch.

loss(feats: Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) Dict[源代码]

Perform forward propagation of head and loss calculation on the features of the upstream network.

参数:
  • feats (tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

A dictionary of loss components.

返回类型:

dict

loss_by_feat(cls_scores: Dict[str, Tensor], data_samples: List[ActionDataSample]) Dict[源代码]

Calculate the loss based on the features extracted by the head.

参数:
  • cls_scores (dict[str, torch.Tensor]) – The dict of classification scores,

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

A dictionary of loss components.

返回类型:

dict

loss_by_scores(cls_scores: Tensor, labels: Tensor) Dict[源代码]

Calculate the loss based on the features extracted by the head.

参数:
  • cls_scores (torch.Tensor) – Classification prediction results of all class, has shape (batch_size, num_classes).

  • labels (torch.Tensor) – The labels used to calculate the loss.

返回:

A dictionary of loss components.

返回类型:

dict

predict(feats: Tuple[Tensor], data_samples: List[ActionDataSample], **kwargs) List[ActionDataSample][源代码]

Perform forward propagation of head and predict recognition results on the features of the upstream network.

参数:
  • feats (tuple[torch.Tensor]) – Features from upstream network.

  • data_samples (list[ActionDataSample]) – The batch data samples.

返回:

Recognition results wrapped

by ActionDataSample.

返回类型:

list[ActionDataSample]

predict_by_feat(cls_scores: Dict[str, Tensor], data_samples: List[ActionDataSample]) List[ActionDataSample][源代码]

Transform a batch of output features extracted from the head into prediction results.

参数:
  • cls_scores (dict[str, torch.Tensor]) – The dict of classification scores,

  • data_samples (list[ActionDataSample]) – The annotation data of every samples. It usually includes information such as gt_label.

返回:

Recognition results wrapped

by ActionDataSample.

返回类型:

list[ActionDataSample]

predict_by_scores(cls_scores: Tensor, data_samples: List[ActionDataSample]) Tensor[源代码]

Transform a batch of output features extracted from the head into prediction results.

参数:
  • cls_scores (torch.Tensor) – Classification scores, has a shape (B*num_segs, num_classes)

  • data_samples (list[ActionDataSample]) – The annotation data of every samples.

返回:

The averaged classification scores.

返回类型:

torch.Tensor

class mmaction.models.heads.SlowFastHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.8, init_std: float = 0.01, **kwargs)[源代码]

The classification head for SlowFast.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tuple[Tensor], **kwargs) None[源代码]

Defines the computation performed at every call.

参数:

x (tuple[torch.Tensor]) – The input data.

返回:

The classification scores for input samples.

返回类型:

Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.

class mmaction.models.heads.TPNHead(*args, **kwargs)[源代码]

Class head for TPN.

forward(x, num_segs: int | None = None, fcn_test: bool = False, **kwargs) Tensor[源代码]

Defines the computation performed at every call.

参数:
  • x (Tensor) – The input data.

  • num_segs (int, optional) – Number of segments into which a video is divided. Defaults to None.

  • fcn_test (bool) – Whether to apply full convolution (fcn) testing. Defaults to False.

返回:

The classification scores for input samples.

返回类型:

Tensor

class mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[源代码]

Class head for TRN.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.

  • hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.001.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs, **kwargs)[源代码]

Defines the computation performed at every call.

参数:
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.

返回:

The classification scores for input samples.

返回类型:

torch.Tensor

init_weights()[源代码]

Initiate the parameters from scratch.

class mmaction.models.heads.TSMHead(num_classes: int, in_channels: int, num_segments: int = 8, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', consensus: ConfigDict | dict = {'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio: float = 0.8, init_std: float = 0.001, is_shift: bool = True, temporal_pool: bool = False, **kwargs)[源代码]

Class head for TSM.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict or ConfigDict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict or ConfigDict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • is_shift (bool) – Indicating whether the feature is shifted. Default: True.

  • temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x: Tensor, num_segs: int, **kwargs) Tensor[源代码]

Defines the computation performed at every call.

参数:
  • x (Tensor) – The input data.

  • num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.

返回:

The classification scores for input samples.

返回类型:

Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.

class mmaction.models.heads.TSNAudioHead(num_classes: int, in_channels: int, loss_cls: ConfigDict | dict = {'type': 'CrossEntropyLoss'}, spatial_type: str = 'avg', dropout_ratio: float = 0.4, init_std: float = 0.01, **kwargs)[源代码]

Classification head for TSN on audio.

参数:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (Union[dict, ConfigDict]) – Config for building loss. Defaults to dict(type='CrossEntropyLoss').

  • spatial_type (str) – Pooling type in spatial dimension. Defaults to avg.

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0.4.

  • init_std (float) – Std value for Initiation. Defaults to 0.01.

forward(x: Tensor) Tensor[源代码]

Defines the computation performed at every call.

参数:

x (torch.Tensor) – The input data.

返回:

The classification scores for input samples.

返回类型:

torch.Tensor

init_weights() None[源代码]

Initiate the parameters from scratch.