Shortcuts

mmaction.apis

mmaction.apis.inference_recognizer(model, video, outputs=None, as_tensor=True, **kwargs)[source]

Inference a video with the recognizer.

Parameters
  • model (nn.Module) – The loaded recognizer.

  • video (str | dict | ndarray) – The video file path / url or the rawframes directory path / results dictionary (the input of pipeline) / a 4D array T x H x W x 3 (The input video).

  • outputs (list(str) | tuple(str) | str | None) – Names of layers whose outputs need to be returned, default: None.

  • as_tensor (bool) – Same as that in OutputHook. Default: True.

Returns

Top-5 recognition result dict. dict[torch.tensor | np.ndarray]:

Output feature maps from layers specified in outputs.

Return type

dict[tuple(str, float)]

mmaction.apis.init_random_seed(seed=None, device='cpu', distributed=True)[source]

Initialize random seed.

If the seed is not set, the seed will be automatically randomized, and then broadcast to all processes to prevent some potential bugs. :param seed: The seed. Default to None. :type seed: int, Optional :param device: The device where the seed will be put on.

Default to ‘cuda’.

Parameters

distributed (bool) – Whether to use distributed training. Default: True.

Returns

Seed to be used.

Return type

int

mmaction.apis.init_recognizer(config, checkpoint=None, device='cuda:0', **kwargs)[source]

Initialize a recognizer from config file.

Parameters
  • config (str | mmcv.Config) – Config file path or the config object.

  • checkpoint (str | None, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Default: None.

  • device (str | torch.device) – The desired device of returned tensor. Default: ‘cuda:0’.

Returns

The constructed recognizer.

Return type

nn.Module

mmaction.apis.multi_gpu_test(model: torch.nn.modules.module.Module, data_loader: torch.utils.data.dataloader.DataLoader, tmpdir: Optional[str] = None, gpu_collect: bool = False)Optional[list][source]

Test model with multiple gpus.

This method tests model with multiple gpus and collects the results under two different modes: gpu and cpu modes. By setting gpu_collect=True, it encodes results to gpu tensors and use gpu communication for results collection. On cpu mode it saves the results on different gpus to tmpdir and collects them by the rank 0 worker.

Parameters
  • model (nn.Module) – Model to be tested.

  • data_loader (nn.Dataloader) – Pytorch data loader.

  • tmpdir (str) – Path of directory to save the temporary results from different gpus under cpu mode.

  • gpu_collect (bool) – Option to use either gpu or cpu to collect results.

Returns

The prediction results.

Return type

list

mmaction.apis.single_gpu_test(model: torch.nn.modules.module.Module, data_loader: torch.utils.data.dataloader.DataLoader)list[source]

Test model with a single gpu.

This method tests model with a single gpu and displays test progress bar.

Parameters
  • model (nn.Module) – Model to be tested.

  • data_loader (nn.Dataloader) – Pytorch data loader.

Returns

The prediction results.

Return type

list

mmaction.apis.train_model(model, dataset, cfg, distributed=False, validate=False, test={'test_best': False, 'test_last': False}, timestamp=None, meta=None)[source]

Train model entry function.

Parameters
  • model (nn.Module) – The model to be trained.

  • dataset (Dataset) – Train dataset.

  • cfg (dict) – The config dict for training.

  • distributed (bool) – Whether to use distributed training. Default: False.

  • validate (bool) – Whether to do evaluation. Default: False.

  • test (dict) – The testing option, with two keys: test_last & test_best. The value is True or False, indicating whether to test the corresponding checkpoint. Default: dict(test_best=False, test_last=False).

  • timestamp (str | None) – Local time for runner. Default: None.

  • meta (dict | None) – Meta dict to record some important information. Default: None

mmaction.core

optimizer

class mmaction.core.optimizer.CopyOfSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, *, maximize=False, foreach: Optional[bool] = None, differentiable=False)[source]

A clone of torch.optim.SGD.

A customized optimizer could be defined like CopyOfSGD. You may derive from built-in optimizers in torch.optim, or directly implement a new optimizer.

class mmaction.core.optimizer.TSMOptimizerConstructor(optimizer_cfg: Dict, paramwise_cfg: Optional[Dict] = None)[source]

Optimizer constructor in TSM model.

This constructor builds optimizer in different ways from the default one.

  1. Parameters of the first conv layer have default lr and weight decay.

  2. Parameters of BN layers have default lr and zero weight decay.

  3. If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.

  4. Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.

add_params(params, model)[source]

Add parameters and their corresponding lr and wd to the params.

Parameters
  • params (list) – The list to be modified, containing all parameter groups and their corresponding lr and wd configurations.

  • model (nn.Module) – The model to be trained with the optimizer.

evaluation

class mmaction.core.evaluation.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[source]

Class to evaluate detection results on ActivityNet.

Parameters
  • ground_truth_filename (str | None) – The filename of groundtruth. Default: None.

  • prediction_filename (str | None) – The filename of action detection results. Default: None.

  • tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default: np.linspace(0.5, 0.95, 10).

  • verbose (bool) – Whether to print verbose logs. Default: False.

evaluate()[source]

Evaluates a prediction file.

For the detection task we measure the interpolated mean average precision to measure the performance of a method.

wrapper_compute_average_precision()[source]

Computes average precision for each class.

class mmaction.core.evaluation.DistEvalHook(*args, save_best='auto', **kwargs)[source]
class mmaction.core.evaluation.EvalHook(*args, save_best='auto', **kwargs)[source]
mmaction.core.evaluation.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]

Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.

Parameters
  • ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.

  • prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.

  • temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

Returns

1D array of average precision score.

Return type

np.ndarray

mmaction.core.evaluation.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]

Computes the average recall given an average number (percentile) of proposals per video.

Parameters
  • ground_truth (dict) – Dict containing the ground truth instances.

  • proposals (dict) – Dict containing the proposal instances.

  • total_num_proposals (int) – Total number of proposals in the proposal dict.

  • max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.

  • temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

Returns

(recall, average_recall, proposals_per_video, auc) In recall, recall[i,j] is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent to recall.mean(axis=0). The proposals_per_video is the average number of proposals per video. The auc is the area under AR@AN curve.

Return type

tuple([np.ndarray, np.ndarray, np.ndarray, float])

mmaction.core.evaluation.confusion_matrix(y_pred, y_real, normalize=None)[source]

Compute confusion matrix.

Parameters
  • y_pred (list[int] | np.ndarray[int]) – Prediction labels.

  • y_real (list[int] | np.ndarray[int]) – Ground truth labels.

  • normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.

Returns

Confusion matrix.

Return type

np.ndarray

mmaction.core.evaluation.get_weighted_score(score_list, coeff_list)[source]

Get weighted score with given scores and coefficients.

Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n

Parameters
  • score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes

  • coeff_list (list[float]) – List of coefficients, with shape n.

Returns

List of weighted scores.

Return type

list[np.ndarray]

mmaction.core.evaluation.interpolated_precision_recall(precision, recall)[source]

Interpolated AP - VOCdevkit from VOC 2011.

Parameters
  • precision (np.ndarray) – The precision of different thresholds.

  • recall (np.ndarray) – The recall of different thresholds.

Returns:

float: Average precision score.

mmaction.core.evaluation.mean_average_precision(scores, labels)[source]

Mean average precision for multi-label recognition.

Parameters
  • scores (list[np.ndarray]) – Prediction scores of different classes for each sample.

  • labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

Returns

The mean average precision.

Return type

np.float64

mmaction.core.evaluation.mean_class_accuracy(scores, labels)[source]

Calculate mean class accuracy.

Parameters
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int]) – Ground truth labels.

Returns

Mean class accuracy.

Return type

np.ndarray

mmaction.core.evaluation.mmit_mean_average_precision(scores, labels)[source]

Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.

Parameters
  • scores (list[np.ndarray]) – Prediction scores of different classes for each sample.

  • labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

Returns

The MMIT style mean average precision.

Return type

np.float64

mmaction.core.evaluation.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[source]

Compute intersection over union between segments.

Parameters
  • candidate_segments (np.ndarray) – 1-dim/2-dim array in format [init, end]/[m x 2:=[init, end]].

  • target_segments (np.ndarray) – 2-dim array in format [n x 2:=[init, end]].

  • calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.

Returns

1-dim array [n] /

2-dim array [n x m] with IoU ratio.

t_overlap_self (np.ndarray, optional): 1-dim array [n] /

2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.

Return type

t_iou (np.ndarray)

mmaction.core.evaluation.softmax(x, dim=1)[source]

Compute softmax values for each sets of scores in x.

mmaction.core.evaluation.top_k_accuracy(scores, labels, topk=(1))[source]

Calculate top k accuracy score.

Parameters
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int]) – Ground truth labels.

  • topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).

Returns

Top k accuracy score for each k.

Return type

list[float]

mmaction.core.evaluation.top_k_classes(scores, labels, k=10, mode='accurate')[source]

Calculate the most K accurate (inaccurate) classes.

Given the prediction scores, ground truth label and top-k value, compute the top K accurate (inaccurate) classes.

Parameters
  • scores (list[np.ndarray]) – Prediction scores for each class.

  • labels (list[int] | np.ndarray) – Ground truth labels.

  • k (int) – Top-k values. Default: 10.

  • mode (str) – Comparison mode for Top-k. Options are ‘accurate’ and ‘inaccurate’. Default: ‘accurate’.

Returns

List of sorted (from high accuracy to low accuracy for

’accurate’ mode, and from low accuracy to high accuracy for inaccurate mode) top K classes in format of (label_id, acc_ratio).

Return type

list

scheduler ^^ .. automodule:: mmaction.core.scheduler

members

mmaction.localization

localization

mmaction.localization.eval_ap(detections, gt_by_cls, iou_range)[source]

Evaluate average precisions.

Parameters
  • detections (dict) – Results of detections.

  • gt_by_cls (dict) – Information of groudtruth.

  • iou_range (list) – Ranges of iou.

Returns

Average precision values of classes at ious.

Return type

list

mmaction.localization.generate_bsp_feature(video_list, video_infos, tem_results_dir, pgm_proposals_dir, top_k=1000, bsp_boundary_ratio=0.2, num_sample_start=8, num_sample_end=8, num_sample_action=16, num_sample_interp=3, tem_results_ext='.csv', pgm_proposal_ext='.csv', result_dict=None)[source]

Generate Boundary-Sensitive Proposal Feature with given proposals.

Parameters
  • video_list (list[int]) – List of video indexes to generate bsp_feature.

  • video_infos (list[dict]) – List of video_info dict that contains ‘video_name’.

  • tem_results_dir (str) – Directory to load temporal evaluation results.

  • pgm_proposals_dir (str) – Directory to load proposals.

  • top_k (int) – Number of proposals to be considered. Default: 1000

  • bsp_boundary_ratio (float) – Ratio for proposal boundary (start/end). Default: 0.2.

  • num_sample_start (int) – Num of samples for actionness in start region. Default: 8.

  • num_sample_end (int) – Num of samples for actionness in end region. Default: 8.

  • num_sample_action (int) – Num of samples for actionness in center region. Default: 16.

  • num_sample_interp (int) – Num of samples for interpolation for each sample point. Default: 3.

  • tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.

  • pgm_proposal_ext (str) – File extension for proposals. Default: ‘.csv’.

  • result_dict (dict | None) – The dict to save the results. Default: None.

Returns

A dict contains video_name as keys and

bsp_feature as value. If result_dict is not None, save the results to it.

Return type

bsp_feature_dict (dict)

mmaction.localization.generate_candidate_proposals(video_list, video_infos, tem_results_dir, temporal_scale, peak_threshold, tem_results_ext='.csv', result_dict=None)[source]

Generate Candidate Proposals with given temporal evaluation results. Each proposal file will contain: ‘tmin,tmax,tmin_score,tmax_score,score,match_iou,match_ioa’.

Parameters
  • video_list (list[int]) – List of video indexes to generate proposals.

  • video_infos (list[dict]) – List of video_info dict that contains ‘video_name’, ‘duration_frame’, ‘duration_second’, ‘feature_frame’, and ‘annotations’.

  • tem_results_dir (str) – Directory to load temporal evaluation results.

  • temporal_scale (int) – The number (scale) on temporal axis.

  • peak_threshold (float) – The threshold for proposal generation.

  • tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.

  • result_dict (dict | None) – The dict to save the results. Default: None.

Returns

A dict contains video_name as keys and proposal list as value.

If result_dict is not None, save the results to it.

Return type

dict

mmaction.localization.load_localize_proposal_file(filename)[source]

Load the proposal file and split it into many parts which contain one video’s information separately.

Parameters

filename (str) – Path to the proposal file.

Returns

List of all videos’ information.

Return type

list

mmaction.localization.perform_regression(detections)[source]

Perform regression on detection results.

Parameters

detections (list) – Detection results before regression.

Returns

Detection results after regression.

Return type

list

mmaction.localization.soft_nms(proposals, alpha, low_threshold, high_threshold, top_k)[source]

Soft NMS for temporal proposals.

Parameters
  • proposals (np.ndarray) – Proposals generated by network.

  • alpha (float) – Alpha value of Gaussian decaying function.

  • low_threshold (float) – Low threshold for soft nms.

  • high_threshold (float) – High threshold for soft nms.

  • top_k (int) – Top k values to be considered.

Returns

The updated proposals.

Return type

np.ndarray

mmaction.localization.temporal_iop(proposal_min, proposal_max, gt_min, gt_max)[source]

Compute IoP score between a groundtruth bbox and the proposals.

Compute the IoP which is defined as the overlap ratio with groundtruth proportional to the duration of this proposal.

Parameters
  • proposal_min (list[float]) – List of temporal anchor min.

  • proposal_max (list[float]) – List of temporal anchor max.

  • gt_min (float) – Groundtruth temporal box min.

  • gt_max (float) – Groundtruth temporal box max.

Returns

List of intersection over anchor scores.

Return type

list[float]

mmaction.localization.temporal_iou(proposal_min, proposal_max, gt_min, gt_max)[source]

Compute IoU score between a groundtruth bbox and the proposals.

Parameters
  • proposal_min (list[float]) – List of temporal anchor min.

  • proposal_max (list[float]) – List of temporal anchor max.

  • gt_min (float) – Groundtruth temporal box min.

  • gt_max (float) – Groundtruth temporal box max.

Returns

List of iou scores.

Return type

list[float]

mmaction.localization.temporal_nms(detections, threshold)[source]

Parse the video’s information.

Parameters
  • detections (list) – Detection results before NMS.

  • threshold (float) – Threshold of NMS.

Returns

Detection results after NMS.

Return type

list

mmaction.models

models

class mmaction.models.ACRNHead(in_channels, out_channels, stride=1, num_convs=1, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, **kwargs)[source]

ACRN Head: Tile + 1x1 convolution + 3x3 convolution.

This module is proposed in Actor-Centric Relation Network

Parameters
  • in_channels (int) – The input channel.

  • out_channels (int) – The output channel.

  • stride (int) – The spatial stride.

  • num_convs (int) – The number of 3x3 convolutions in ACRNHead.

  • conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).

  • act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).

  • kwargs (dict) – Other new arguments, to be compatible with MMDet update.

forward(x, feat, rois, **kwargs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The extracted RoI feature.

  • feat (torch.Tensor) – The context feature.

  • rois (torch.Tensor) – The regions of interest.

Returns

The RoI features that have interacted with context

feature.

Return type

torch.Tensor

init_weights(**kwargs)[source]

Weight Initialization for ACRNHead.

class mmaction.models.AudioRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

Audio recognizer model framework.

forward(audios, label=None, return_loss=True)[source]

Define the computation performed at every call.

forward_gradcam(audios)[source]

Defines the computation performed at every all when using gradcam utils.

forward_test(audios)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(audios, labels)[source]

Defines the computation performed at every call when training.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]

Classification head for TSN on audio.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, focal_gamma=0.0, focal_alpha=1.0, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]

Simplest RoI head, with only two fc layers for classification and regression respectively.

Parameters
  • temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.

  • spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

  • in_channels (int) – The number of input channels. Default: 2048.

  • focal_alpha (float) – The hyper-parameter alpha for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 1.

  • focal_gamma (float) – The hyper-parameter gamma for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 0.

  • num_classes (int) – The number of classes. Default: 81.

  • dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.

  • dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.

  • topk (int or tuple[int]) – Parameter for evaluating Top-K accuracy. Default: (3, 5)

  • multilabel (bool) – Whether used for a multilabel task. Default: True.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

static get_recall_prec(pred_vec, target_vec)[source]

Computes the Recall/Precision for both multi-label and single label scenarios.

Note that the computation calculates the micro average.

Note, that in both cases, the concept of correct/incorrect is the same. :param pred_vec: each element is either 0 or 1 :type pred_vec: tensor[N x C] :param target_vec: each element is either 0 or 1 - for

single label it is expected that only one element is on (1) although this is not enforced.

topk_accuracy(pred, target, thr=0.5)[source]

Computes the Top-K Accuracies for both single and multi-label scenarios.

static topk_to_matrix(probs, k)[source]

Converts top-k to binary matrix.

class mmaction.models.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]

Binary Cross Entropy Loss with logits.

Parameters
  • loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

  • class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]

Boundary Matching Network for temporal action proposal generation.

Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network

Parameters
  • temporal_dim (int) – Total frames selected for each video.

  • boundary_ratio (float) – Ratio for determining video boundaries.

  • num_samples (int) – Number of samples for each proposal.

  • num_samples_per_bin (int) – Number of bin samples for each sample.

  • feat_dim (int) – Feature dimension.

  • soft_nms_alpha (float) – Soft NMS alpha.

  • soft_nms_low_threshold (float) – Soft NMS low threshold.

  • soft_nms_high_threshold (float) – Soft NMS high threshold.

  • post_process_top_k (int) – Top k proposals in post process.

  • feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.

  • loss_cls (dict) – Config for building loss. Default: dict(type='BMNLoss').

  • hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.

  • hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.

  • hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]

Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]

Define the computation performed at every call when testing.

forward_train(raw_feature, label_confidence, label_start, label_end)[source]

Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]

Generate training labels.

class mmaction.models.BMNLoss[source]

BMN Loss.

From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of

1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.

forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]

Calculate Boundary Matching Network Loss.

Parameters
  • pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.

  • pred_start (torch.Tensor) – Predicted confidence score for start.

  • pred_end (torch.Tensor) – Predicted confidence score for end.

  • gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.

  • gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.

  • gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.

  • bm_mask (torch.Tensor) – Boundary-Matching mask.

  • weight_tem (float) – Weight for tem loss. Default: 1.0.

  • weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.

  • weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.

Returns

(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.

Return type

tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])

static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]

Calculate Proposal Evaluation Module Classification Loss.

Parameters
  • pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.

  • gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.

  • mask (torch.Tensor) – Boundary-Matching mask.

  • threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.

  • ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)

  • eps (float) – Epsilon for small value. Default: 1e-5

Returns

Proposal evaluation classification loss.

Return type

torch.Tensor

static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]

Calculate Proposal Evaluation Module Regression Loss.

Parameters
  • pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.

  • gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.

  • mask (torch.Tensor) – Boundary-Matching mask.

  • high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.

  • low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.

Returns

Proposal evaluation regression loss.

Return type

torch.Tensor

static tem_loss(pred_start, pred_end, gt_start, gt_end)[source]

Calculate Temporal Evaluation Module Loss.

This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.

Parameters
  • pred_start (torch.Tensor) – Predicted start score by BMN model.

  • pred_end (torch.Tensor) – Predicted end score by BMN model.

  • gt_start (torch.Tensor) – Groundtruth confidence score for start.

  • gt_end (torch.Tensor) – Groundtruth confidence score for end.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.BaseGCN(backbone, cls_head=None, train_cfg=None, test_cfg=None)[source]

Base class for GCN-based action recognition.

All GCN-based recognizers should subclass it. All subclass should overwrite:

  • Methods:forward_train, supporting to forward when training.

  • Methods:forward_test, supporting to forward when testing.

Parameters
  • backbone (dict) – Backbone modules to extract feature.

  • cls_head (dict | None) – Classification head to process feature. Default: None.

  • train_cfg (dict | None) – Config for training. Default: None.

  • test_cfg (dict | None) – Config for testing. Default: None.

extract_feat(skeletons)[source]

Extract features through a backbone.

Parameters

skeletons (torch.Tensor) – The input skeletons.

Returns

The extracted features.

Return type

torch.tensor

forward(keypoint, label=None, return_loss=True, **kwargs)[source]

Define the computation performed at every call.

abstract forward_test(*args)[source]

Defines the computation performed at testing.

abstract forward_train(*args, **kwargs)[source]

Defines the computation performed at training.

init_weights()[source]

Initialize the model network weights.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_cls_head

whether the recognizer has a cls_head

Type

bool

class mmaction.models.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0, topk=(1, 5))[source]

Base class for head.

All Head should subclass it. All subclass should overwrite: - Methods:init_weights, initializing weights in some modules. - Methods:forward, supporting to forward both for training and testing.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).

  • multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.

  • label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.

  • topk (int | tuple) – Top-k accuracy. Default: (1, 5).

abstract forward(x)[source]

Defines the computation performed at every call.

abstract init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

loss(cls_score, labels, **kwargs)[source]

Calculate the loss given output cls_score, target labels.

Parameters
  • cls_score (torch.Tensor) – The output of the model.

  • labels (torch.Tensor) – The target output of the model.

Returns

A dict containing field ‘loss_cls’(mandatory) and ‘topk_acc’(optional).

Return type

dict

class mmaction.models.BaseRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

Base class for recognizers.

All recognizers should subclass it. All subclass should overwrite:

  • Methods:forward_train, supporting to forward when training.

  • Methods:forward_test, supporting to forward when testing.

Parameters
  • backbone (dict) – Backbone modules to extract feature.

  • cls_head (dict | None) – Classification head to process feature. Default: None.

  • neck (dict | None) – Neck for feature fusion. Default: None.

  • train_cfg (dict | None) – Config for training. Default: None.

  • test_cfg (dict | None) – Config for testing. Default: None.

average_clip(cls_score, num_segs=1)[source]

Averaging class score over multiple clips.

Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.

Parameters
  • cls_score (torch.Tensor) – Class score to be averaged.

  • num_segs (int) – Number of clips for each input sample.

Returns

Averaged class score.

Return type

torch.Tensor

extract_feat(imgs)[source]

Extract features through a backbone.

Parameters

imgs (torch.Tensor) – The input images.

Returns

The extracted features.

Return type

torch.tensor

forward(imgs, label=None, return_loss=True, **kwargs)[source]

Define the computation performed at every call.

abstract forward_gradcam(imgs)[source]

Defines the computation performed at every all when using gradcam utils.

abstract forward_test(imgs)[source]

Defines the computation performed at every call when evaluation and testing.

abstract forward_train(imgs, labels, **kwargs)[source]

Defines the computation performed at every call when training.

init_weights()[source]

Initialize the model network weights.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_cls_head

whether the recognizer has a cls_head

Type

bool

property with_neck

whether the recognizer has a neck

Type

bool

class mmaction.models.BinaryLogisticRegressionLoss[source]

Binary Logistic Regression Loss.

It will calculate binary logistic regression loss given reg_score and label.

forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]

Calculate Binary Logistic Regression Loss.

Parameters
  • reg_score (torch.Tensor) – Predicted score by model.

  • label (torch.Tensor) – Groundtruth labels.

  • threshold (float) – Threshold for positive instances. Default: 0.5.

  • ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)

  • eps (float) – Epsilon for small value. Default: 1e-5.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[source]

C3D backbone.

Parameters
  • pretrained (str | None) – Name of pretrained model.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses dict(type='Conv3d') to construct layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. required keys are type, Default: None.

  • act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses dict(type='ReLU') to construct layers. Default: None.

  • out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation of fc layers. Default: 0.01.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data. the size of x is (num_batches, 3, 16, 112, 112).

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.CBFocalLoss(loss_weight=1.0, samples_per_cls=[], beta=0.9999, gamma=2.0)[source]

Class Balanced Focal Loss. Adapted from https://github.com/abhinanda- punnakkal/BABEL/. This loss is used in the skeleton-based action recognition baseline for BABEL.

Parameters
  • loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

  • samples_per_cls (list[int]) – The number of samples per class. Default: [].

  • beta (float) – Hyperparameter that controls the per class loss weight. Default: 0.9999.

  • gamma (float) – Hyperparameter of the focal loss. Default: 2.0.

class mmaction.models.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]

(2+1)d Conv module for R(2+1)d backbone.

https://arxiv.org/pdf/1711.11248.pdf.

Parameters
  • in_channels (int) – Same as nn.Conv3d.

  • out_channels (int) – Same as nn.Conv3d.

  • kernel_size (int | tuple[int]) – Same as nn.Conv3d.

  • stride (int | tuple[int]) – Same as nn.Conv3d.

  • padding (int | tuple[int]) – Same as nn.Conv3d.

  • dilation (int | tuple[int]) – Same as nn.Conv3d.

  • groups (int) – Same as nn.Conv3d.

  • bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]

Conv2d module for AudioResNet backbone.

Parameters
  • in_channels (int) – Same as nn.Conv2d.

  • out_channels (int) – Same as nn.Conv2d.

  • kernel_size (int | tuple[int]) – Same as nn.Conv2d.

  • op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.

  • stride (int | tuple[int]) – Same as nn.Conv2d.

  • padding (int | tuple[int]) – Same as nn.Conv2d.

  • dilation (int | tuple[int]) – Same as nn.Conv2d.

  • groups (int) – Same as nn.Conv2d.

  • bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]

Cross Entropy Loss.

Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of cls_score and label. 1) Hard label: This label is an integer array and all of the elements are

in the range [0, num_classes - 1]. This label’s shape should be cls_score’s shape with the num_classes dimension removed.

  1. Soft label(probablity distribution over classes): This label is a

    probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as cls_score. For now, only 2-dim soft label is supported.

Parameters
  • loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

  • class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]

Spatial Attention in Divided Space Time Attention.

Parameters
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights.

class mmaction.models.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]

Temporal Attention in Divided Space Time Attention.

Parameters
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights.

class mmaction.models.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]

Feature Bank Operator Head.

Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.

Parameters
  • lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.

  • fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.

  • temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.

  • spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights(pretrained=None)[source]

Initialize the weights in the module.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Default: None.

sample_lfb(rois, img_metas)[source]

Sample long-term features for each ROI feature.

class mmaction.models.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[source]

FFN with pre normalization layer.

FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.

FFNWithNorm has one main difference with FFN:

  • It apply one normalization layer before forwarding the input data to

    feed-forward networks.

Parameters
  • embed_dims (int) – Dimensions of embedding. Defaults to 256.

  • feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.

  • num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.

  • act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)

  • ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..

  • add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.

  • dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.

  • init_cfg (dict) – The Config for initialization. Defaults to None.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

forward(x, residual=None)[source]

Forward function for FFN.

The function would add x to the output tensor if residue is None.

class mmaction.models.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]

Calculate the BCELoss for HVU.

Parameters
  • categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].

  • category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).

  • category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).

  • loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.

  • with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.

  • reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.

  • loss_weight (float) – The loss weight. Default: 1.0.

class mmaction.models.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]

Classification head for I3D.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]

Long-Term Feature Bank (LFB).

LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding

The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.

Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:

Parameters
  • lfb_prefix_path (str) – The storage path of lfb.

  • max_num_sampled_feat (int) – The max number of sampled features. Default: 5.

  • window_size (int) – Window size of sampling long term feature. Default: 60.

  • lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.

  • dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).

  • device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.

  • lmdb_map_size (int) – Map size of lmdb. Default: 4e9.

  • construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.

class mmaction.models.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]

Long-Term Feature Bank Infer Head.

This head is used to derive and save the LFB without affecting the input.

Parameters
  • lfb_prefix_path (str) – The prefix path to store the lfb.

  • dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.

  • use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.

  • temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.

  • spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]

MobileNetV2 backbone.

Parameters
  • pretrained (str | None) – Name of pretrained model. Default: None.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.

  • out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in MobileNetV2 is conv2. Default: -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_layer(out_channels, num_blocks, stride, expand_ratio)[source]

Stack InvertedResidual blocks to build a layer for MobileNetV2.

Parameters
  • out_channels (int) – out_channels of block.

  • num_blocks (int) – number of blocks.

  • stride (int) – stride of the first block. Default: 1

  • expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.

train(mode=True)[source]

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns

self

Return type

Module

class mmaction.models.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]

MobileNetV2 backbone for TSM.

Parameters
  • num_segments (int) – Number of frame segments. Default: 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.

  • shift_div (int) – Number of div for shift. Default: 8.

  • **kwargs (keyword arguments, optional) – Arguments for MobilNetV2.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_shift()[source]

Make temporal shift for some layers.

class mmaction.models.NLLLoss(loss_weight=1.0)[source]

NLL Loss.

It will calculate NLL loss given cls_score and label.

class mmaction.models.OHEMHingeLoss(*args, **kwargs)[source]

This class is the core implementation for the completeness loss in paper.

It compute class-wise hinge loss and performs online hard example mining (OHEM).

static backward(ctx, grad_output)[source]

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]

Calculate OHEM hinge loss.

Parameters
  • pred (torch.Tensor) – Predicted completeness score.

  • labels (torch.Tensor) – Groundtruth class label.

  • is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.

  • ohem_ratio (float) – Ratio of hard examples.

  • group_size (int) – Number of proposals sampled per video.

Returns

Returned class-wise hinge loss.

Return type

torch.Tensor

class mmaction.models.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]

Proposals Evaluation Model for Boundary Sensitive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters
  • pem_feat_dim (int) – Feature dimension.

  • pem_hidden_dim (int) – Hidden layer dimension.

  • pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.

  • pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.

  • pem_high_temporal_iou_threshold (float) – High IoU threshold.

  • pem_low_temporal_iou_threshold (float) – Low IoU threshold.

  • soft_nms_alpha (float) – Soft NMS alpha.

  • soft_nms_low_threshold (float) – Soft NMS low threshold.

  • soft_nms_high_threshold (float) – Soft NMS high threshold.

  • post_process_top_k (int) – Top k proposals in post process.

  • feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.

  • fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.

  • fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.

  • output_dim (int) – Output dimension. Default: 1.

forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]

Define the computation performed at every call.

forward_test(bsp_feature, tmin, tmax, tmin_score, tmax_score, video_meta)[source]

Define the computation performed at every call when testing.

forward_train(bsp_feature, reference_temporal_iou)[source]

Define the computation performed at every call when training.

class mmaction.models.Recognizer2D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

2D recognizer model framework.

forward_dummy(imgs, softmax=False)[source]

Used for computing network FLOPs.

See tools/analysis/get_flops.py.

Parameters

imgs (torch.Tensor) – Input images.

Returns

Class score.

Return type

Tensor

forward_gradcam(imgs)[source]

Defines the computation performed at every call when using gradcam utils.

forward_test(imgs)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(imgs, labels, **kwargs)[source]

Defines the computation performed at every call when training.

class mmaction.models.Recognizer3D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

3D recognizer model framework.

forward_dummy(imgs, softmax=False)[source]

Used for computing network FLOPs.

See tools/analysis/get_flops.py.

Parameters

imgs (torch.Tensor) – Input images.

Returns

Class score.

Return type

Tensor

forward_gradcam(imgs)[source]

Defines the computation performed at every call when using gradcam utils.

forward_test(imgs)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(imgs, labels, **kwargs)[source]

Defines the computation performed at every call when training.

class mmaction.models.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]

ResNet backbone.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).

  • dilations (Sequence[int]) – Dilation of each stage.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).

  • act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • partial_bn (bool) – Whether to use partial bn. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.ResNet2Plus1d(*args, **kwargs)[source]

ResNet (2+1)d backbone.

This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

class mmaction.models.ResNet3d(depth, pretrained, stage_blocks=None, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(3, 7, 7), conv1_stride_s=2, conv1_stride_t=1, pool1_stride_s=2, pool1_stride_t=1, with_pool1=True, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]

ResNet 3d backbone.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • stage_blocks (tuple | None) – Set number of stages for each res layer. Default: None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.

  • in_channels (int) – Channel num of input features. Default: 3.

  • base_channels (int) – Channel num of stem output features. Default: 64.

  • out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default: (1, 1, 1, 1).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).

  • conv1_stride_s (int) – Spatial stride of the first conv layer. Default: 2.

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_s (int) – Spatial stride of the first pooling layer. Default: 2.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • with_pool2 (bool) – Whether to use pool2. Default: True.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.

  • inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Default: dict().

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

static make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]

Build residual layer for ResNet3D.

Parameters
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.

  • temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.

  • dilation (int) – Spacing between kernel elements. Default: 1.

  • style (str) – pytorch or caffe. If set to pytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.

  • inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.

  • non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.

  • non_local_cfg (dict) – Config for non-local module. Default: dict().

  • conv_cfg (dict | None) – Config for norm layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. Default: None.

  • act_cfg (dict | None) – Config for activate layers. Default: None.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]

ResNet backbone for CSN.

Parameters
  • depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.

  • bottleneck_mode (str) –

    Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.

    If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.

    Default: ‘ip’.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]

ResNet 3d Layer.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.

  • stage (int) – The index of Resnet stage. Default: 3.

  • base_channels (int) – Channel num of stem output features. Default: 64.

  • spatial_stride (int) – The 1st res block’s spatial stride. Default 2.

  • temporal_stride (int) – The 1st res block’s temporal stride. Default 1.

  • dilation (int) – The dilation. Default: 1.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • all_frozen (bool) – Frozen all modules in the layer. Default: False.

  • inflate (int) – Inflate Dims of each block. Default: 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]

Slowfast backbone.

This module is proposed in SlowFast Networks for Video Recognition

Parameters
  • pretrained (str) – The file path to a pretrained model.

  • resample_rate (int) – A large temporal stride resample_rate on input frames. The actual resample rate is calculated by multipling the interval in SampleFrames in the pipeline with resample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out of resample_rate * interval frames. Default: 8.

  • speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.

  • channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Default: 8.

  • slow_pathway (dict) –

    Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:

    dict(type='ResNetPathway',
    lateral=True, depth=50, pretrained=None,
    conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1),
    conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
    

  • fast_pathway (dict) –

    Configuration of fast branch, similar to slow_pathway. Default:

    dict(type='ResNetPathway',
    lateral=False, depth=50, pretrained=None, base_channels=8,
    conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
    

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted

by the backbone.

Return type

tuple[torch.Tensor]

init_weights(pretrained=None)[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]

SlowOnly backbone based on ResNet3dPathway.

Parameters
  • *args (arguments) – Arguments same as ResNet3dPathway.

  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).

  • **kwargs (keyword arguments) – Keywords arguments for ResNet3dPathway.

class mmaction.models.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]

ResNet 2d audio backbone. Reference:

Parameters
  • depth (int) – Depth of resnet, from {50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • in_channels (int) – Channel num of input features. Default: 1.

  • base_channels (int) – Channel num of stem output features. Default: 32.

  • num_stages (int) – Resnet stages. Default: 4.

  • strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.

  • conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.

  • factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).

  • act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

static make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]

Build residual layer for ResNetAudio.

Parameters
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • stride (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • dilation (int) – Spacing between kernel elements. Default: 1.

  • factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]

ResNet backbone for TIN.

Parameters
  • depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments. Default: 8.

  • is_tin (bool) – Whether to apply temporal interlace. Default: True.

  • shift_div (int) – Number of division parts for shift. Default: 4.

  • kwargs (dict, optional) – Arguments for ResNet.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_interlace()[source]

Make temporal interlace for some layers.

class mmaction.models.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]

ResNet backbone for TSM.

Parameters
  • num_segments (int) – Number of frame segments. Default: 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Default: dict().

  • shift_div (int) – Number of div for shift. Default: 8.

  • shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.

  • temporal_pool (bool) – Whether to add temporal pooling. Default: False.

  • **kwargs (keyword arguments, optional) – Arguments for ResNet.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_pool()[source]

Make temporal pooling between layer1 and layer2, using a 3D max pooling layer.

make_temporal_shift()[source]

Make temporal shift for some layers.

class mmaction.models.SSNLoss[source]
static activity_loss(activity_score, labels, activity_indexer)[source]

Activity Loss.

It will calculate activity loss given activity_score and label.

Args:

activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.

Returns

Returned cross entropy loss.

Return type

torch.Tensor

static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]

Classwise Regression Loss.

It will calculate classwise_regression loss given class_reg_pred and targets.

Args:
bbox_pred (torch.Tensor): Predicted interval center and span

of positive proposals.

labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span

of positive proposals.

regression_indexer (torch.Tensor): Index slices of

positive proposals.

Returns

Returned class-wise regression loss.

Return type

torch.Tensor

static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]

Completeness Loss.

It will calculate completeness loss given completeness_score and label.

Args:

completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and

incomplete proposals.

positive_per_video (int): Number of positive proposals sampled

per video.

incomplete_per_video (int): Number of incomplete proposals sampled

pre video.

ohem_ratio (float): Ratio of online hard example mining.

Default: 0.17.

Returns

Returned class-wise completeness loss.

Return type

torch.Tensor

forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]

Calculate Boundary Matching Network Loss.

Parameters
  • activity_score (torch.Tensor) – Predicted activity score.

  • completeness_score (torch.Tensor) – Predicted completeness score.

  • bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.

  • proposal_type (torch.Tensor) – Type index slices of proposals.

  • labels (torch.Tensor) – Groundtruth class label.

  • bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.

  • train_cfg (dict) – Config for training.

Returns

(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.

Return type

dict([torch.Tensor, torch.Tensor, torch.Tensor])

class mmaction.models.STGCN(in_channels, graph_cfg, edge_importance_weighting=True, data_bn=True, pretrained=None, **kwargs)[source]

Backbone of Spatial temporal graph convolutional networks.

Parameters
  • in_channels (int) – Number of channels in the input data.

  • graph_cfg (dict) – The arguments for building the graph.

  • edge_importance_weighting (bool) – If True, adds a learnable importance weighting to the edges of the graph. Default: True.

  • data_bn (bool) – If ‘True’, adds data normalization to the inputs. Default: True.

  • pretrained (str | None) – Name of pretrained model.

  • **kwargs (optional) – Other parameters for graph convolution units.

Shape:
  • Input: \((N, in_channels, T_{in}, V_{in}, M_{in})\)

  • Output: \((N, num_class)\) where

    \(N\) is a batch size, \(T_{in}\) is a length of input sequence, \(V_{in}\) is the number of graph nodes, \(M_{in}\) is the number of instance in a frame.

forward(x)[source]

Defines the computation performed at every call. :param x: The input data. :type x: torch.Tensor

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.STGCNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', num_person=2, init_std=0.01, **kwargs)[source]

The classification head for STGCN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • num_person (int) – Number of person. Default: 2.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.SingleRoIExtractor3D(roi_layer_type='RoIAlign', featmap_stride=16, output_size=16, sampling_ratio=0, pool_mode='avg', aligned=True, with_temporal_pool=True, temporal_pool_mode='avg', with_global=False)[source]

Extract RoI features from a single level feature map.

Parameters
  • roi_layer_type (str) – Specify the RoI layer type. Default: ‘RoIAlign’.

  • featmap_stride (int) – Strides of input feature maps. Default: 16.

  • output_size (int | tuple) – Size or (Height, Width). Default: 16.

  • sampling_ratio (int) – number of inputs samples to take for each output sample. 0 to take samples densely for current models. Default: 0.

  • pool_mode (str, 'avg' or 'max') – pooling mode in each bin. Default: ‘avg’.

  • aligned (bool) – if False, use the legacy implementation in MMDetection. If True, align the results more perfectly. Default: True.

  • with_temporal_pool (bool) – if True, avgpool the temporal dim. Default: True.

  • with_global (bool) – if True, concatenate the RoI feature with global feature. Default: False.

Note that sampling_ratio, pool_mode, aligned only apply when roi_layer_type is set as RoIAlign.

forward(feat, rois)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.SkeletonGCN(backbone, cls_head=None, train_cfg=None, test_cfg=None)[source]

Spatial temporal graph convolutional networks.

forward_test(skeletons)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(skeletons, labels, **kwargs)[source]

Defines the computation performed at every call when training.

class mmaction.models.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]

The classification head for SlowFast.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.SubBatchNorm3D(num_features, **cfg)[source]

Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.

Parameters

num_features (int) – Dimensions of BatchNorm.

aggregate_stats()[source]

Synchronize running_mean, and running_var to self.bn.

Call this before eval, then call model.eval(); When eval, forward function will call self.bn instead of self.split_bn, During this time the running_mean, and running_var of self.bn has been obtained from self.split_bn.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]

Temporal Adaptive Module(TAM) for TANet.

This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Parameters
  • in_channels (int) – Channel num of input features.

  • num_segments (int) – Number of frame segments.

  • alpha (int) – `alpha` in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.

  • adaptive_kernel_size (int) – `K` in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.

  • beta (int) – `beta` in the paper and is set to control the model complexity in the local branch. Default: 4.

  • conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.

  • adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.

  • adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.

  • init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The output of the module.

Return type

torch.Tensor

class mmaction.models.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]

Temporal Adaptive Network (TANet) backbone.

This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments.

  • tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().

  • **kwargs (keyword arguments, optional) – Arguments for ResNet except `depth`.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_tam_modeling()[source]

Replace ResNet-Block with TA-Block.

class mmaction.models.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]

Temporal Evaluation Model for Boundary Sensitive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters
  • tem_feat_dim (int) – Feature dimension.

  • tem_hidden_dim (int) – Hidden layer dimension.

  • tem_match_threshold (float) – Temporal evaluation match threshold.

  • loss_cls (dict) – Config for building loss. Default: dict(type='BinaryLogisticRegressionLoss').

  • loss_weight (float) – Weight term for action_loss. Default: 2.

  • output_dim (int) – Output dimension. Default: 3.

  • conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.

  • conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.

  • conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]

Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]

Define the computation performed at every call when testing.

forward_train(raw_feature, label_action, label_start, label_end)[source]

Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]

Generate training labels.

class mmaction.models.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]

TPN neck.

This module is proposed in Temporal Pyramid Network for Action Recognition

Parameters
  • in_channels (tuple[int]) – Channel numbers of input features tuple.

  • out_channels (int) – Channel number of output feature.

  • spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.

  • temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.

  • upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:nn.Upsample. Default: None.

  • downsample_cfg (dict | None) – Config for downsample layers. Default: None.

  • level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.

  • aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.

  • flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.

forward(x, target=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.TPNHead(*args, **kwargs)[source]

Class head for TPN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.

  • label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.

forward(x, num_segs=None, fcn_test=False)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int | None) – Number of segments into which a video is divided. Default: None.

  • fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.

Returns

The classification scores for input samples.

Return type

torch.Tensor

class mmaction.models.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]

Class head for TRN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.

  • hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.001.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]

Class head for TSM.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • is_shift (bool) – Indicating whether the feature is shifted. Default: True.

  • temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]

Class head for TSN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Number of segments into which a video is divided.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[source]

TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?

Parameters
  • num_frames (int) – Number of frames in the video.

  • img_size (int | tuple) – Size of input image.

  • patch_size (int) – Size of one patch.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • embed_dims (int) – Dimensions of embedding. Defaults to 768.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.

  • num_transformer_layers (int) – Number of transformer layers. Defaults to 12.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0..

  • (list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.

  • attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.

  • norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).

forward(x)[source]

Defines the computation performed at every call.

init_weights(pretrained=None)[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.TimeSformerHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, init_std=0.02, **kwargs)[source]

Classification head for TimeSformer.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).

  • init_std (float) – Std value for Initiation. Defaults to 0.02.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters
  • gamma_w (float) – Global channel width expansion factor. Default: 1.

  • gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.

  • gamma_d (float) – Network depth expansion factor. Default: 1.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]

Build residual layer for ResNet3D.

Parameters
  • block (nn.Module) – Residual module to be built.

  • layer_inplanes (int) – Number of channels for the input feature of the res layer.

  • inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.

  • planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict | None) – Config for norm layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. Default: None.

  • act_cfg (dict | None) – Config for activate layers. Default: None.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]

Classification head for I3D.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • fc1_bias (bool) – If the first fc layer has bias. Default: False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

mmaction.models.build_backbone(cfg)[source]

Build backbone.

mmaction.models.build_head(cfg)[source]

Build head.

mmaction.models.build_localizer(cfg)[source]

Build localizer.

mmaction.models.build_loss(cfg)[source]

Build loss.

mmaction.models.build_model(cfg, train_cfg=None, test_cfg=None)[source]

Build model.

mmaction.models.build_neck(cfg)[source]

Build neck.

mmaction.models.build_recognizer(cfg, train_cfg=None, test_cfg=None)[source]

Build recognizer.

recognizers

class mmaction.models.recognizers.AudioRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

Audio recognizer model framework.

forward(audios, label=None, return_loss=True)[source]

Define the computation performed at every call.

forward_gradcam(audios)[source]

Defines the computation performed at every all when using gradcam utils.

forward_test(audios)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(audios, labels)[source]

Defines the computation performed at every call when training.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.recognizers.BaseRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

Base class for recognizers.

All recognizers should subclass it. All subclass should overwrite:

  • Methods:forward_train, supporting to forward when training.

  • Methods:forward_test, supporting to forward when testing.

Parameters
  • backbone (dict) – Backbone modules to extract feature.

  • cls_head (dict | None) – Classification head to process feature. Default: None.

  • neck (dict | None) – Neck for feature fusion. Default: None.

  • train_cfg (dict | None) – Config for training. Default: None.

  • test_cfg (dict | None) – Config for testing. Default: None.

average_clip(cls_score, num_segs=1)[source]

Averaging class score over multiple clips.

Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.

Parameters
  • cls_score (torch.Tensor) – Class score to be averaged.

  • num_segs (int) – Number of clips for each input sample.

Returns

Averaged class score.

Return type

torch.Tensor

extract_feat(imgs)[source]

Extract features through a backbone.

Parameters

imgs (torch.Tensor) – The input images.

Returns

The extracted features.

Return type

torch.tensor

forward(imgs, label=None, return_loss=True, **kwargs)[source]

Define the computation performed at every call.

abstract forward_gradcam(imgs)[source]

Defines the computation performed at every all when using gradcam utils.

abstract forward_test(imgs)[source]

Defines the computation performed at every call when evaluation and testing.

abstract forward_train(imgs, labels, **kwargs)[source]

Defines the computation performed at every call when training.

init_weights()[source]

Initialize the model network weights.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_cls_head

whether the recognizer has a cls_head

Type

bool

property with_neck

whether the recognizer has a neck

Type

bool

class mmaction.models.recognizers.Recognizer2D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

2D recognizer model framework.

forward_dummy(imgs, softmax=False)[source]

Used for computing network FLOPs.

See tools/analysis/get_flops.py.

Parameters

imgs (torch.Tensor) – Input images.

Returns

Class score.

Return type

Tensor

forward_gradcam(imgs)[source]

Defines the computation performed at every call when using gradcam utils.

forward_test(imgs)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(imgs, labels, **kwargs)[source]

Defines the computation performed at every call when training.

class mmaction.models.recognizers.Recognizer3D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]

3D recognizer model framework.

forward_dummy(imgs, softmax=False)[source]

Used for computing network FLOPs.

See tools/analysis/get_flops.py.

Parameters

imgs (torch.Tensor) – Input images.

Returns

Class score.

Return type

Tensor

forward_gradcam(imgs)[source]

Defines the computation performed at every call when using gradcam utils.

forward_test(imgs)[source]

Defines the computation performed at every call when evaluation and testing.

forward_train(imgs, labels, **kwargs)[source]

Defines the computation performed at every call when training.

localizers

class mmaction.models.localizers.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]

Boundary Matching Network for temporal action proposal generation.

Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network

Parameters
  • temporal_dim (int) – Total frames selected for each video.

  • boundary_ratio (float) – Ratio for determining video boundaries.

  • num_samples (int) – Number of samples for each proposal.

  • num_samples_per_bin (int) – Number of bin samples for each sample.

  • feat_dim (int) – Feature dimension.

  • soft_nms_alpha (float) – Soft NMS alpha.

  • soft_nms_low_threshold (float) – Soft NMS low threshold.

  • soft_nms_high_threshold (float) – Soft NMS high threshold.

  • post_process_top_k (int) – Top k proposals in post process.

  • feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.

  • loss_cls (dict) – Config for building loss. Default: dict(type='BMNLoss').

  • hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.

  • hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.

  • hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]

Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]

Define the computation performed at every call when testing.

forward_train(raw_feature, label_confidence, label_start, label_end)[source]

Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]

Generate training labels.

class mmaction.models.localizers.BaseTAGClassifier(backbone, cls_head, train_cfg=None, test_cfg=None)[source]

Base class for temporal action proposal classifier.

All temporal action generation classifier should subclass it. All subclass should overwrite: Methods:forward_train, supporting to forward when training. Methods:forward_test, supporting to forward when testing.

extract_feat(imgs)[source]

Extract features through a backbone.

Parameters

imgs (torch.Tensor) – The input images.

Returns

The extracted features.

Return type

torch.tensor

forward(*args, return_loss=True, **kwargs)[source]

Define the computation performed at every call.

abstract forward_test(*args, **kwargs)[source]

Defines the computation performed at testing.

abstract forward_train(*args, **kwargs)[source]

Defines the computation performed at training.

init_weights()[source]

Weight initialization for model.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.localizers.BaseTAPGenerator[source]

Base class for temporal action proposal generator.

All temporal action proposal generator should subclass it. All subclass should overwrite: Methods:forward_train, supporting to forward when training. Methods:forward_test, supporting to forward when testing.

abstract forward(*args, **kwargs)[source]

Define the computation performed at every call.

abstract forward_test(*args)[source]

Defines the computation performed at testing.

abstract forward_train(*args, **kwargs)[source]

Defines the computation performed at training.

train_step(data_batch, optimizer, **kwargs)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data_batch (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.localizers.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]

Proposals Evaluation Model for Boundary Sensitive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters
  • pem_feat_dim (int) – Feature dimension.

  • pem_hidden_dim (int) – Hidden layer dimension.

  • pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.

  • pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.

  • pem_high_temporal_iou_threshold (float) – High IoU threshold.

  • pem_low_temporal_iou_threshold (float) – Low IoU threshold.

  • soft_nms_alpha (float) – Soft NMS alpha.

  • soft_nms_low_threshold (float) – Soft NMS low threshold.

  • soft_nms_high_threshold (float) – Soft NMS high threshold.

  • post_process_top_k (int) – Top k proposals in post process.

  • feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.

  • fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.

  • fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.

  • output_dim (int) – Output dimension. Default: 1.

forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]

Define the computation performed at every call.

forward_test(bsp_feature, tmin, tmax, tmin_score, tmax_score, video_meta)[source]

Define the computation performed at every call when testing.

forward_train(bsp_feature, reference_temporal_iou)[source]

Define the computation performed at every call when training.

class mmaction.models.localizers.SSN(backbone, cls_head, in_channels=3, spatial_type='avg', dropout_ratio=0.5, loss_cls={'type': 'SSNLoss'}, train_cfg=None, test_cfg=None)[source]

Temporal Action Detection with Structured Segment Networks.

Parameters
  • backbone (dict) – Config for building backbone.

  • cls_head (dict) – Config for building classification head.

  • in_channels (int) – Number of channels for input data. Default: 3.

  • spatial_type (str) – Type of spatial pooling. Default: ‘avg’.

  • dropout_ratio (float) – Ratio of dropout. Default: 0.5.

  • loss_cls (dict) – Config for building loss. Default: dict(type='SSNLoss').

  • train_cfg (dict | None) – Config for training. Default: None.

  • test_cfg (dict | None) – Config for testing. Default: None.

forward_test(imgs, relative_proposal_list, scale_factor_list, proposal_tick_list, reg_norm_consts, **kwargs)[source]

Define the computation performed at every call when testing.

forward_train(imgs, proposal_scale_factor, proposal_type, proposal_labels, reg_targets, **kwargs)[source]

Define the computation performed at every call when training.

class mmaction.models.localizers.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]

Temporal Evaluation Model for Boundary Sensitive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters
  • tem_feat_dim (int) – Feature dimension.

  • tem_hidden_dim (int) – Hidden layer dimension.

  • tem_match_threshold (float) – Temporal evaluation match threshold.

  • loss_cls (dict) – Config for building loss. Default: dict(type='BinaryLogisticRegressionLoss').

  • loss_weight (float) – Weight term for action_loss. Default: 2.

  • output_dim (int) – Output dimension. Default: 3.

  • conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.

  • conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.

  • conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]

Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]

Define the computation performed at every call when testing.

forward_train(raw_feature, label_action, label_start, label_end)[source]

Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]

Generate training labels.

common

class mmaction.models.common.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]

(2+1)d Conv module for R(2+1)d backbone.

https://arxiv.org/pdf/1711.11248.pdf.

Parameters
  • in_channels (int) – Same as nn.Conv3d.

  • out_channels (int) – Same as nn.Conv3d.

  • kernel_size (int | tuple[int]) – Same as nn.Conv3d.

  • stride (int | tuple[int]) – Same as nn.Conv3d.

  • padding (int | tuple[int]) – Same as nn.Conv3d.

  • dilation (int | tuple[int]) – Same as nn.Conv3d.

  • groups (int) – Same as nn.Conv3d.

  • bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.common.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]

Conv2d module for AudioResNet backbone.

Parameters
  • in_channels (int) – Same as nn.Conv2d.

  • out_channels (int) – Same as nn.Conv2d.

  • kernel_size (int | tuple[int]) – Same as nn.Conv2d.

  • op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.

  • stride (int | tuple[int]) – Same as nn.Conv2d.

  • padding (int | tuple[int]) – Same as nn.Conv2d.

  • dilation (int | tuple[int]) – Same as nn.Conv2d.

  • groups (int) – Same as nn.Conv2d.

  • bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.common.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]

Spatial Attention in Divided Space Time Attention.

Parameters
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights.

class mmaction.models.common.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]

Temporal Attention in Divided Space Time Attention.

Parameters
  • embed_dims (int) – Dimensions of embedding.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder.

  • num_frames (int) – Number of frames in the video.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..

  • proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..

  • dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

  • init_cfg (dict | None) – The Config for initialization. Defaults to None.

forward(query, key=None, value=None, residual=None, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights.

class mmaction.models.common.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[source]

FFN with pre normalization layer.

FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.

FFNWithNorm has one main difference with FFN:

  • It apply one normalization layer before forwarding the input data to

    feed-forward networks.

Parameters
  • embed_dims (int) – Dimensions of embedding. Defaults to 256.

  • feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.

  • num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.

  • act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)

  • ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..

  • add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.

  • dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.

  • init_cfg (dict) – The Config for initialization. Defaults to None.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).

forward(x, residual=None)[source]

Forward function for FFN.

The function would add x to the output tensor if residue is None.

class mmaction.models.common.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]

Long-Term Feature Bank (LFB).

LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding

The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.

Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:

Parameters
  • lfb_prefix_path (str) – The storage path of lfb.

  • max_num_sampled_feat (int) – The max number of sampled features. Default: 5.

  • window_size (int) – Window size of sampling long term feature. Default: 60.

  • lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.

  • dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).

  • device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.

  • lmdb_map_size (int) – Map size of lmdb. Default: 4e9.

  • construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.

class mmaction.models.common.SubBatchNorm3D(num_features, **cfg)[source]

Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.

Parameters

num_features (int) – Dimensions of BatchNorm.

aggregate_stats()[source]

Synchronize running_mean, and running_var to self.bn.

Call this before eval, then call model.eval(); When eval, forward function will call self.bn instead of self.split_bn, During this time the running_mean, and running_var of self.bn has been obtained from self.split_bn.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.common.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]

Temporal Adaptive Module(TAM) for TANet.

This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Parameters
  • in_channels (int) – Channel num of input features.

  • num_segments (int) – Number of frame segments.

  • alpha (int) – `alpha` in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.

  • adaptive_kernel_size (int) – `K` in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.

  • beta (int) – `beta` in the paper and is set to control the model complexity in the local branch. Default: 4.

  • conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.

  • adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.

  • adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.

  • init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The output of the module.

Return type

torch.Tensor

backbones

class mmaction.models.backbones.AGCN(in_channels, graph_cfg, data_bn=True, pretrained=None, **kwargs)[source]

Backbone of Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition.

Parameters
  • in_channels (int) – Number of channels in the input data.

  • graph_cfg (dict) – The arguments for building the graph.

  • data_bn (bool) – If ‘True’, adds data normalization to the inputs. Default: True.

  • pretrained (str | None) – Name of pretrained model.

  • **kwargs (optional) – Other parameters for graph convolution units.

Shape:
  • Input: \((N, in_channels, T_{in}, V_{in}, M_{in})\)

  • Output: \((N, num_class)\) where

    \(N\) is a batch size, \(T_{in}\) is a length of input sequence, \(V_{in}\) is the number of graph nodes, \(M_{in}\) is the number of instance in a frame.

forward(x)[source]

Defines the computation performed at every call. :param x: The input data. :type x: torch.Tensor

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[source]

C3D backbone.

Parameters
  • pretrained (str | None) – Name of pretrained model.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses dict(type='Conv3d') to construct layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. required keys are type, Default: None.

  • act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses dict(type='ReLU') to construct layers. Default: None.

  • out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation of fc layers. Default: 0.01.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data. the size of x is (num_batches, 3, 16, 112, 112).

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]

MobileNetV2 backbone.

Parameters
  • pretrained (str | None) – Name of pretrained model. Default: None.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.

  • out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in MobileNetV2 is conv2. Default: -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_layer(out_channels, num_blocks, stride, expand_ratio)[source]

Stack InvertedResidual blocks to build a layer for MobileNetV2.

Parameters
  • out_channels (int) – out_channels of block.

  • num_blocks (int) – number of blocks.

  • stride (int) – stride of the first block. Default: 1

  • expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.

train(mode=True)[source]

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns

self

Return type

Module

class mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]

MobileNetV2 backbone for TSM.

Parameters
  • num_segments (int) – Number of frame segments. Default: 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.

  • shift_div (int) – Number of div for shift. Default: 8.

  • **kwargs (keyword arguments, optional) – Arguments for MobilNetV2.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_shift()[source]

Make temporal shift for some layers.

class mmaction.models.backbones.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]

ResNet backbone.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).

  • dilations (Sequence[int]) – Dilation of each stage.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).

  • act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • partial_bn (bool) – Whether to use partial bn. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[source]

ResNet (2+1)d backbone.

This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

class mmaction.models.backbones.ResNet3d(depth, pretrained, stage_blocks=None, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(3, 7, 7), conv1_stride_s=2, conv1_stride_t=1, pool1_stride_s=2, pool1_stride_t=1, with_pool1=True, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]

ResNet 3d backbone.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • stage_blocks (tuple | None) – Set number of stages for each res layer. Default: None.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.

  • in_channels (int) – Channel num of input features. Default: 3.

  • base_channels (int) – Channel num of stem output features. Default: 64.

  • out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default: (1, 1, 1, 1).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).

  • conv1_stride_s (int) – Spatial stride of the first conv layer. Default: 2.

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_s (int) – Spatial stride of the first pooling layer. Default: 2.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • with_pool2 (bool) – Whether to use pool2. Default: True.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.

  • inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Default: dict().

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

static make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]

Build residual layer for ResNet3D.

Parameters
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.

  • temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.

  • dilation (int) – Spacing between kernel elements. Default: 1.

  • style (str) – pytorch or caffe. If set to pytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.

  • inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.

  • non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.

  • non_local_cfg (dict) – Config for non-local module. Default: dict().

  • conv_cfg (dict | None) – Config for norm layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. Default: None.

  • act_cfg (dict | None) – Config for activate layers. Default: None.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]

ResNet backbone for CSN.

Parameters
  • depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.

  • bottleneck_mode (str) –

    Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.

    If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.

    Default: ‘ip’.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]

ResNet 3d Layer.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.

  • stage (int) – The index of Resnet stage. Default: 3.

  • base_channels (int) – Channel num of stem output features. Default: 64.

  • spatial_stride (int) – The 1st res block’s spatial stride. Default 2.

  • temporal_stride (int) – The 1st res block’s temporal stride. Default 1.

  • dilation (int) – The dilation. Default: 1.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.

  • all_frozen (bool) – Frozen all modules in the layer. Default: False.

  • inflate (int) – Inflate Dims of each block. Default: 1.

  • inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]

Slowfast backbone.

This module is proposed in SlowFast Networks for Video Recognition

Parameters
  • pretrained (str) – The file path to a pretrained model.

  • resample_rate (int) – A large temporal stride resample_rate on input frames. The actual resample rate is calculated by multipling the interval in SampleFrames in the pipeline with resample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out of resample_rate * interval frames. Default: 8.

  • speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.

  • channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Default: 8.

  • slow_pathway (dict) –

    Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:

    dict(type='ResNetPathway',
    lateral=True, depth=50, pretrained=None,
    conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1),
    conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
    

  • fast_pathway (dict) –

    Configuration of fast branch, similar to slow_pathway. Default:

    dict(type='ResNetPathway',
    lateral=False, depth=50, pretrained=None, base_channels=8,
    conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
    

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted

by the backbone.

Return type

tuple[torch.Tensor]

init_weights(pretrained=None)[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]

SlowOnly backbone based on ResNet3dPathway.

Parameters
  • *args (arguments) – Arguments same as ResNet3dPathway.

  • conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).

  • conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.

  • pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.

  • inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).

  • **kwargs (keyword arguments) – Keywords arguments for ResNet3dPathway.

class mmaction.models.backbones.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]

ResNet 2d audio backbone. Reference:

Parameters
  • depth (int) – Depth of resnet, from {50, 101, 152}.

  • pretrained (str | None) – Name of pretrained model.

  • in_channels (int) – Channel num of input features. Default: 1.

  • base_channels (int) – Channel num of stem output features. Default: 32.

  • num_stages (int) – Resnet stages. Default: 4.

  • strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.

  • conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.

  • factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).

  • act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

static make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]

Build residual layer for ResNetAudio.

Parameters
  • block (nn.Module) – Residual module to be built.

  • inplanes (int) – Number of channels for the input feature in each block.

  • planes (int) – Number of channels for the output feature in each block.

  • blocks (int) – Number of residual blocks.

  • stride (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • dilation (int) – Spacing between kernel elements. Default: 1.

  • factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

train(mode=True)[source]

Set the optimization status when training.

class mmaction.models.backbones.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]

ResNet backbone for TIN.

Parameters
  • depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments. Default: 8.

  • is_tin (bool) – Whether to apply temporal interlace. Default: True.

  • shift_div (int) – Number of division parts for shift. Default: 4.

  • kwargs (dict, optional) – Arguments for ResNet.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_interlace()[source]

Make temporal interlace for some layers.

class mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]

ResNet backbone for TSM.

Parameters
  • num_segments (int) – Number of frame segments. Default: 8.

  • is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.

  • non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).

  • non_local_cfg (dict) – Config for non-local module. Default: dict().

  • shift_div (int) – Number of div for shift. Default: 8.

  • shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.

  • temporal_pool (bool) – Whether to add temporal pooling. Default: False.

  • **kwargs (keyword arguments, optional) – Arguments for ResNet.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_pool()[source]

Make temporal pooling between layer1 and layer2, using a 3D max pooling layer.

make_temporal_shift()[source]

Make temporal shift for some layers.

class mmaction.models.backbones.STGCN(in_channels, graph_cfg, edge_importance_weighting=True, data_bn=True, pretrained=None, **kwargs)[source]

Backbone of Spatial temporal graph convolutional networks.

Parameters
  • in_channels (int) – Number of channels in the input data.

  • graph_cfg (dict) – The arguments for building the graph.

  • edge_importance_weighting (bool) – If True, adds a learnable importance weighting to the edges of the graph. Default: True.

  • data_bn (bool) – If ‘True’, adds data normalization to the inputs. Default: True.

  • pretrained (str | None) – Name of pretrained model.

  • **kwargs (optional) – Other parameters for graph convolution units.

Shape:
  • Input: \((N, in_channels, T_{in}, V_{in}, M_{in})\)

  • Output: \((N, num_class)\) where

    \(N\) is a batch size, \(T_{in}\) is a length of input sequence, \(V_{in}\) is the number of graph nodes, \(M_{in}\) is the number of instance in a frame.

forward(x)[source]

Defines the computation performed at every call. :param x: The input data. :type x: torch.Tensor

Returns

The output of the module.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]

Temporal Adaptive Network (TANet) backbone.

This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.

Parameters
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.

  • num_segments (int) – Number of frame segments.

  • tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().

  • **kwargs (keyword arguments, optional) – Arguments for ResNet except `depth`.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_tam_modeling()[source]

Replace ResNet-Block with TA-Block.

class mmaction.models.backbones.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[source]

TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?

Parameters
  • num_frames (int) – Number of frames in the video.

  • img_size (int | tuple) – Size of input image.

  • patch_size (int) – Size of one patch.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • embed_dims (int) – Dimensions of embedding. Defaults to 768.

  • num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.

  • num_transformer_layers (int) – Number of transformer layers. Defaults to 12.

  • in_channels (int) – Channel num of input features. Defaults to 3.

  • dropout_ratio (float) – Probability of dropout layer. Defaults to 0..

  • (list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.

  • attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.

  • norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).

forward(x)[source]

Defines the computation performed at every call.

init_weights(pretrained=None)[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters
  • gamma_w (float) – Global channel width expansion factor. Default: 1.

  • gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.

  • gamma_d (float) – Network depth expansion factor. Default: 1.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted by the backbone.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]

Build residual layer for ResNet3D.

Parameters
  • block (nn.Module) – Residual module to be built.

  • layer_inplanes (int) – Number of channels for the input feature of the res layer.

  • inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.

  • planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • conv_cfg (dict | None) – Config for norm layers. Default: None.

  • norm_cfg (dict | None) – Config for norm layers. Default: None.

  • act_cfg (dict | None) – Config for activate layers. Default: None.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]

Set the optimization status when training.

heads

class mmaction.models.heads.ACRNHead(in_channels, out_channels, stride=1, num_convs=1, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, **kwargs)[source]

ACRN Head: Tile + 1x1 convolution + 3x3 convolution.

This module is proposed in Actor-Centric Relation Network

Parameters
  • in_channels (int) – The input channel.

  • out_channels (int) – The output channel.

  • stride (int) – The spatial stride.

  • num_convs (int) – The number of 3x3 convolutions in ACRNHead.

  • conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).

  • norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).

  • act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).

  • kwargs (dict) – Other new arguments, to be compatible with MMDet update.

forward(x, feat, rois, **kwargs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The extracted RoI feature.

  • feat (torch.Tensor) – The context feature.

  • rois (torch.Tensor) – The regions of interest.

Returns

The RoI features that have interacted with context

feature.

Return type

torch.Tensor

init_weights(**kwargs)[source]

Weight Initialization for ACRNHead.

class mmaction.models.heads.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]

Classification head for TSN on audio.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, focal_gamma=0.0, focal_alpha=1.0, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]

Simplest RoI head, with only two fc layers for classification and regression respectively.

Parameters
  • temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.

  • spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

  • in_channels (int) – The number of input channels. Default: 2048.

  • focal_alpha (float) – The hyper-parameter alpha for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 1.

  • focal_gamma (float) – The hyper-parameter gamma for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 0.

  • num_classes (int) – The number of classes. Default: 81.

  • dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.

  • dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.

  • topk (int or tuple[int]) – Parameter for evaluating Top-K accuracy. Default: (3, 5)

  • multilabel (bool) – Whether used for a multilabel task. Default: True.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

static get_recall_prec(pred_vec, target_vec)[source]

Computes the Recall/Precision for both multi-label and single label scenarios.

Note that the computation calculates the micro average.

Note, that in both cases, the concept of correct/incorrect is the same. :param pred_vec: each element is either 0 or 1 :type pred_vec: tensor[N x C] :param target_vec: each element is either 0 or 1 - for

single label it is expected that only one element is on (1) although this is not enforced.

topk_accuracy(pred, target, thr=0.5)[source]

Computes the Top-K Accuracies for both single and multi-label scenarios.

static topk_to_matrix(probs, k)[source]

Converts top-k to binary matrix.

class mmaction.models.heads.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0, topk=(1, 5))[source]

Base class for head.

All Head should subclass it. All subclass should overwrite: - Methods:init_weights, initializing weights in some modules. - Methods:forward, supporting to forward both for training and testing.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).

  • multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.

  • label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.

  • topk (int | tuple) – Top-k accuracy. Default: (1, 5).

abstract forward(x)[source]

Defines the computation performed at every call.

abstract init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

loss(cls_score, labels, **kwargs)[source]

Calculate the loss given output cls_score, target labels.

Parameters
  • cls_score (torch.Tensor) – The output of the model.

  • labels (torch.Tensor) – The target output of the model.

Returns

A dict containing field ‘loss_cls’(mandatory) and ‘topk_acc’(optional).

Return type

dict

class mmaction.models.heads.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]

Feature Bank Operator Head.

Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.

Parameters
  • lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.

  • fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.

  • temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.

  • spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights(pretrained=None)[source]

Initialize the weights in the module.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Default: None.

sample_lfb(rois, img_metas)[source]

Sample long-term features for each ROI feature.

class mmaction.models.heads.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]

Classification head for I3D.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]

Long-Term Feature Bank Infer Head.

This head is used to derive and save the LFB without affecting the input.

Parameters
  • lfb_prefix_path (str) – The prefix path to store the lfb.

  • dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.

  • use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.

  • temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.

  • spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.heads.SSNHead(dropout_ratio=0.8, in_channels=1024, num_classes=20, consensus={'num_seg': (2, 5, 2), 'standalong_classifier': True, 'stpp_cfg': (1, 1, 1), 'type': 'STPPTrain'}, use_regression=True, init_std=0.001)[source]

The classification head for SSN.

Parameters
  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • in_channels (int) – Number of channels for input data. Default: 1024.

  • num_classes (int) – Number of classes to be classified. Default: 20.

  • consensus (dict) – Config of segmental consensus.

  • use_regression (bool) – Whether to perform regression or not. Default: True.

  • init_std (float) – Std value for Initiation. Default: 0.001.

forward(x, test_mode=False)[source]

Defines the computation performed at every call.

init_weights()[source]

Initiate the parameters from scratch.

prepare_test_fc(stpp_feat_multiplier)[source]

Reorganize the shape of fully connected layer at testing, in order to improve testing efficiency.

Parameters

stpp_feat_multiplier (int) – Total number of parts.

Returns

Whether the shape transformation is ready for testing.

Return type

bool

class mmaction.models.heads.STGCNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', num_person=2, init_std=0.01, **kwargs)[source]

The classification head for STGCN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • num_person (int) – Number of person. Default: 2.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

init_weights()[source]

Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.heads.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]

The classification head for SlowFast.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TPNHead(*args, **kwargs)[source]

Class head for TPN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.

  • label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.

forward(x, num_segs=None, fcn_test=False)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int | None) – Number of segments into which a video is divided. Default: None.

  • fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.

Returns

The classification scores for input samples.

Return type

torch.Tensor

class mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]

Class head for TRN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.

  • hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.8.

  • init_std (float) – Std value for Initiation. Default: 0.001.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]

Class head for TSM.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • num_segments (int) – Number of frame segments. Default: 8.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • is_shift (bool) – Indicating whether the feature is shifted. Default: True.

  • temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]

Class head for TSN.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • consensus (dict) – Consensus config dict.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.4.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]

Defines the computation performed at every call.

Parameters
  • x (torch.Tensor) – The input data.

  • num_segs (int) – Number of segments into which a video is divided.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.TimeSformerHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, init_std=0.02, **kwargs)[source]

Classification head for TimeSformer.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).

  • init_std (float) – Std value for Initiation. Defaults to 0.02.

  • kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]

Defines the computation performed at every call.

init_weights()[source]

Initiate the parameters from scratch.

class mmaction.models.heads.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]

Classification head for I3D.

Parameters
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • fc1_bias (bool) – If the first fc layer has bias. Default: False.

forward(x)[source]

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]

Initiate the parameters from scratch.

necks

class mmaction.models.necks.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]

TPN neck.

This module is proposed in Temporal Pyramid Network for Action Recognition

Parameters
  • in_channels (tuple[int]) – Channel numbers of input features tuple.

  • out_channels (int) – Channel number of output feature.

  • spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.

  • temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.

  • upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:nn.Upsample. Default: None.

  • downsample_cfg (dict | None) – Config for downsample layers. Default: None.

  • level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.

  • aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.

  • flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.

forward(x, target=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

losses

class mmaction.models.losses.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]

Binary Cross Entropy Loss with logits.

Parameters
  • loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

  • class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.losses.BMNLoss[source]

BMN Loss.

From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of

1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.

forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]

Calculate Boundary Matching Network Loss.

Parameters
  • pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.

  • pred_start (torch.Tensor) – Predicted confidence score for start.

  • pred_end (torch.Tensor) – Predicted confidence score for end.

  • gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.

  • gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.

  • gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.

  • bm_mask (torch.Tensor) – Boundary-Matching mask.

  • weight_tem (float) – Weight for tem loss. Default: 1.0.

  • weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.

  • weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.

Returns

(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.

Return type

tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])

static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]

Calculate Proposal Evaluation Module Classification Loss.

Parameters
  • pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.

  • gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.

  • mask (torch.Tensor) – Boundary-Matching mask.

  • threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.

  • ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)

  • eps (float) – Epsilon for small value. Default: 1e-5

Returns

Proposal evaluation classification loss.

Return type

torch.Tensor

static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]

Calculate Proposal Evaluation Module Regression Loss.

Parameters
  • pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.

  • gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.

  • mask (torch.Tensor) – Boundary-Matching mask.

  • high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.

  • low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.

Returns

Proposal evaluation regression loss.

Return type

torch.Tensor

static tem_loss(pred_start, pred_end, gt_start, gt_end)[source]

Calculate Temporal Evaluation Module Loss.

This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.

Parameters
  • pred_start (torch.Tensor) – Predicted start score by BMN model.

  • pred_end (torch.Tensor) – Predicted end score by BMN model.

  • gt_start (torch.Tensor) – Groundtruth confidence score for start.

  • gt_end (torch.Tensor) – Groundtruth confidence score for end.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.losses.BaseWeightedLoss(loss_weight=1.0)[source]

Base class for loss.

All subclass should overwrite the _forward() method which returns the normal loss without loss weights.

Parameters

loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Parameters
  • *args – The positional arguments for the corresponding loss.

  • **kwargs – The keyword arguments for the corresponding loss.

Returns

The calculated loss.

Return type

torch.Tensor

class mmaction.models.losses.BinaryLogisticRegressionLoss[source]

Binary Logistic Regression Loss.

It will calculate binary logistic regression loss given reg_score and label.

forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]

Calculate Binary Logistic Regression Loss.

Parameters
  • reg_score (torch.Tensor) – Predicted score by model.

  • label (torch.Tensor) – Groundtruth labels.

  • threshold (float) – Threshold for positive instances. Default: 0.5.

  • ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)

  • eps (float) – Epsilon for small value. Default: 1e-5.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.losses.CBFocalLoss(loss_weight=1.0, samples_per_cls=[], beta=0.9999, gamma=2.0)[source]

Class Balanced Focal Loss. Adapted from https://github.com/abhinanda- punnakkal/BABEL/. This loss is used in the skeleton-based action recognition baseline for BABEL.

Parameters
  • loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

  • samples_per_cls (list[int]) – The number of samples per class. Default: [].

  • beta (float) – Hyperparameter that controls the per class loss weight. Default: 0.9999.

  • gamma (float) – Hyperparameter of the focal loss. Default: 2.0.

class mmaction.models.losses.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]

Cross Entropy Loss.

Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of cls_score and label. 1) Hard label: This label is an integer array and all of the elements are

in the range [0, num_classes - 1]. This label’s shape should be cls_score’s shape with the num_classes dimension removed.

  1. Soft label(probablity distribution over classes): This label is a

    probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as cls_score. For now, only 2-dim soft label is supported.

Parameters
  • loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

  • class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.losses.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]

Calculate the BCELoss for HVU.

Parameters
  • categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].

  • category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).

  • category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).

  • loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.

  • with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.

  • reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.

  • loss_weight (float) – The loss weight. Default: 1.0.

class mmaction.models.losses.NLLLoss(loss_weight=1.0)[source]

NLL Loss.

It will calculate NLL loss given cls_score and label.

class mmaction.models.losses.OHEMHingeLoss(*args, **kwargs)[source]

This class is the core implementation for the completeness loss in paper.

It compute class-wise hinge loss and performs online hard example mining (OHEM).

static backward(ctx, grad_output)[source]

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]

Calculate OHEM hinge loss.

Parameters
  • pred (torch.Tensor) – Predicted completeness score.

  • labels (torch.Tensor) – Groundtruth class label.

  • is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.

  • ohem_ratio (float) – Ratio of hard examples.

  • group_size (int) – Number of proposals sampled per video.

Returns

Returned class-wise hinge loss.

Return type

torch.Tensor

class mmaction.models.losses.SSNLoss[source]
static activity_loss(activity_score, labels, activity_indexer)[source]

Activity Loss.

It will calculate activity loss given activity_score and label.

Args:

activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.

Returns

Returned cross entropy loss.

Return type

torch.Tensor

static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]

Classwise Regression Loss.

It will calculate classwise_regression loss given class_reg_pred and targets.

Args:
bbox_pred (torch.Tensor): Predicted interval center and span

of positive proposals.

labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span

of positive proposals.

regression_indexer (torch.Tensor): Index slices of

positive proposals.

Returns

Returned class-wise regression loss.

Return type

torch.Tensor

static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]

Completeness Loss.

It will calculate completeness loss given completeness_score and label.

Args:

completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and

incomplete proposals.

positive_per_video (int): Number of positive proposals sampled

per video.

incomplete_per_video (int): Number of incomplete proposals sampled

pre video.

ohem_ratio (float): Ratio of online hard example mining.

Default: 0.17.

Returns

Returned class-wise completeness loss.

Return type

torch.Tensor

forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]

Calculate Boundary Matching Network Loss.

Parameters
  • activity_score (torch.Tensor) – Predicted activity score.

  • completeness_score (torch.Tensor) – Predicted completeness score.

  • bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.

  • proposal_type (torch.Tensor) – Type index slices of proposals.

  • labels (torch.Tensor) – Groundtruth class label.

  • bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.

  • train_cfg (dict) – Config for training.

Returns

(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.

Return type

dict([torch.Tensor, torch.Tensor, torch.Tensor])

mmaction.datasets

datasets

class mmaction.datasets.AVADataset(ann_file, exclude_file, pipeline, label_file=None, filename_tmpl='img_{:05}.jpg', start_index=0, proposal_file=None, person_det_score_thr=0.9, num_classes=81, custom_classes=None, data_prefix=None, test_mode=False, modality='RGB', num_max_proposals=1000, timestamp_start=900, timestamp_end=1800, fps=30)[source]

AVA dataset for spatial temporal detection.

Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.

This datasets can load information from the following files:

ann_file -> ava_{train, val}_{v2.1, v2.2}.csv
exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
label_file -> ava_action_list_{v2.1, v2.2}.pbtxt /
              ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl

Particularly, the proposal_file is a pickle file which contains img_key (in format of {video_id},{timestamp}). Example of a pickle file:

{
    ...
    '0f39OWEqJ24,0902':
        array([[0.011   , 0.157   , 0.655   , 0.983   , 0.998163]]),
    '0f39OWEqJ24,0912':
        array([[0.054   , 0.088   , 0.91    , 0.998   , 0.068273],
               [0.016   , 0.161   , 0.519   , 0.974   , 0.984025],
               [0.493   , 0.283   , 0.981   , 0.984   , 0.983621]]),
    ...
}
Parameters
  • ann_file (str) – Path to the annotation file like ava_{train, val}_{v2.1, v2.2}.csv.

  • exclude_file (str) – Path to the excluded timestamp file like ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • label_file (str) – Path to the label file like ava_action_list_{v2.1, v2.2}.pbtxt or ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Default: None.

  • filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 0.

  • proposal_file (str) – Path to the proposal file like ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Default: None.

  • person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Default: 0.9. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used.

  • num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)

  • custom_classes (list[int]) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and num_classes should be equal to len(custom_classes) + 1

  • data_prefix (str) – Path to a directory where videos are held. Default: None.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

  • modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.

  • num_max_proposals (int) – Max proposals number to store. Default: 1000.

  • timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Default: 902.

  • timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Default: 1798.

  • fps (int) – Overrides the default FPS for the dataset. Default: 30.

dump_results(results, out)[source]

Dump predictions into a csv file.

evaluate(results, metrics=('mAP'), metric_options=None, logger=None)[source]

Evaluate the prediction results and report mAP.

filter_exclude_file()[source]

Filter out records in the exclude_file.

load_annotations()[source]

Load AVA annotations.

parse_img_record(img_records)[source]

Merge image records of the same entity at the same time.

Parameters

img_records (list[dict]) – List of img_records (lines in AVA annotations).

Returns

A tuple consists of lists of bboxes, action labels and

entity_ids

Return type

tuple(list)

prepare_test_frames(idx)[source]

Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]

Prepare the frames for training given the index.

class mmaction.datasets.ActivityNetDataset(ann_file, pipeline, data_prefix=None, test_mode=False)[source]

ActivityNet dataset for temporal action localization.

The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:

{
    "v_--1DO2V4K74":  {
        "duration_second": 211.53,
        "duration_frame": 6337,
        "annotations": [
            {
                "segment": [
                    30.025882995319815,
                    205.2318595943838
                ],
                "label": "Rock climbing"
            }
        ],
        "feature_frame": 6336,
        "fps": 30.0,
        "rfps": 29.9579255898
    },
    "v_--6bJUbfpnQ": {
        "duration_second": 26.75,
        "duration_frame": 647,
        "annotations": [
            {
                "segment": [
                    2.578755070202808,
                    24.914101404056165
                ],
                "label": "Drinking beer"
            }
        ],
        "feature_frame": 624,
        "fps": 24.0,
        "rfps": 24.1869158879
    },
    ...
}
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (str | None) – Path to a directory where videos are held. Default: None.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

dump_results(results, out, output_format, version='VERSION 1.3')[source]

Dump data to json/csv files.

evaluate(results, metrics='AR@AN', metric_options={'AR@AN': {'max_avg_proposals': 100, 'temporal_iou_thresholds': array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95])}}, logger=None, **deprecated_kwargs)[source]

Evaluation in feature dataset.

Parameters
  • results (list[dict]) – Output results.

  • metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘AR@AN’.

  • metric_options (dict) – Dict for metric options. Options are max_avg_proposals, temporal_iou_thresholds for AR@AN. default: {'AR@AN': dict(max_avg_proposals=100, temporal_iou_thresholds=np.linspace(0.5, 0.95, 10))}.

  • logger (logging.Logger | None) – Training logger. Defaults: None.

  • deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results for evaluation metrics.

Return type

dict

load_annotations()[source]

Load the annotation according to ann_file into video_infos.

prepare_test_frames(idx)[source]

Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]

Prepare the frames for training given the index.

static proposals2json(results, show_progress=False)[source]

Convert all proposals to a final dict(json) format.

Parameters
  • results (list[dict]) – All proposals.

  • show_progress (bool) – Whether to show the progress bar. Defaults: False.

Returns

The final result dict. E.g.

dict(video-1=[dict(segment=[1.1,2.0]. score=0.9),
              dict(segment=[50.1, 129.3], score=0.6)])

Return type

dict

class mmaction.datasets.AudioDataset(ann_file, pipeline, suffix='.wav', **kwargs)[source]

Audio dataset for video recognition. Extracts the audio feature on-the- fly. Annotation file can be that of the rawframe dataset, or:

some/directory-1.wav 163 1
some/directory-2.wav 122 1
some/directory-3.wav 258 2
some/directory-4.wav 234 2
some/directory-5.wav 295 3
some/directory-6.wav 121 3
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • suffix (str) – The suffix of the audio file. Default: ‘.wav’.

  • kwargs (dict) – Other keyword args for BaseDataset.

load_annotations()[source]

Load annotation file to get video information.

class mmaction.datasets.AudioFeatureDataset(ann_file, pipeline, suffix='.npy', **kwargs)[source]

Audio feature dataset for video recognition. Reads the features extracted off-line. Annotation file can be that of the rawframe dataset, or:

some/directory-1.npy 163 1
some/directory-2.npy 122 1
some/directory-3.npy 258 2
some/directory-4.npy 234 2
some/directory-5.npy 295 3
some/directory-6.npy 121 3
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • suffix (str) – The suffix of the audio feature file. Default: ‘.npy’.

  • kwargs (dict) – Other keyword args for BaseDataset.

load_annotations()[source]

Load annotation file to get video information.

class mmaction.datasets.AudioVisualDataset(ann_file, pipeline, audio_prefix, **kwargs)[source]

Dataset that reads both audio and visual data, supporting both rawframes and videos. The annotation file is same as that of the rawframe dataset, such as:

some/directory-1 163 1
some/directory-2 122 1
some/directory-3 258 2
some/directory-4 234 2
some/directory-5 295 3
some/directory-6 121 3
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • audio_prefix (str) – Directory of the audio files.

  • kwargs (dict) – Other keyword args for RawframeDataset. video_prefix is also allowed if pipeline is designed for videos.

load_annotations()[source]

Load annotation file to get video information.

class mmaction.datasets.BaseDataset(ann_file, pipeline, data_prefix=None, test_mode=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=0, dynamic_length=False)[source]

Base class for datasets.

All datasets to process video should subclass it. All subclasses should overwrite:

  • Methods:load_annotations, supporting to load information from an

annotation file. - Methods:prepare_train_frames, providing train data. - Methods:prepare_test_frames, providing test data.

Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (str | None) – Path to a directory where videos are held. Default: None.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

  • multi_class (bool) – Determines whether the dataset is a multi-class dataset. Default: False.

  • num_classes (int | None) – Number of classes of the dataset, used in multi-class datasets. Default: None.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 1.

  • modality (str) – Modality of data. Support ‘RGB’, ‘Flow’, ‘Audio’. Default: ‘RGB’.

  • sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.

  • power (float) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: 0.

  • dynamic_length (bool) – If the dataset length is dynamic (used by ClassSpecificDistributedSampler). Default: False.

static dump_results(results, out)[source]

Dump data to json/yaml/pickle strings or files.

evaluate(results, metrics='top_k_accuracy', metric_options={'top_k_accuracy': {'topk': (1, 5)}}, logger=None, **deprecated_kwargs)[source]

Perform evaluation for common datasets.

Parameters
  • results (list) – Output results.

  • metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘top_k_accuracy’.

  • metric_options (dict) – Dict for metric options. Options are topk for top_k_accuracy. Default: dict(top_k_accuracy=dict(topk=(1, 5))).

  • logger (logging.Logger | None) – Logger for recording. Default: None.

  • deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results dict.

Return type

dict

abstract load_annotations()[source]

Load the annotation according to ann_file into video_infos.

load_json_annotations()[source]

Load json annotation file to get video information.

prepare_test_frames(idx)[source]

Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]

Prepare the frames for training given the index.

class mmaction.datasets.BaseMiniBatchBlending(num_classes)[source]

Base class for Image Aliasing.

class mmaction.datasets.ConcatDataset(datasets, test_mode=False)[source]

A wrapper of concatenated dataset.

The length of concatenated dataset will be the sum of lengths of all datasets. This is useful when you want to train a model with multiple data sources.

Parameters
  • datasets (list[dict]) – The configs of the datasets.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

class mmaction.datasets.CutmixBlending(num_classes, alpha=0.2)[source]

Implementing Cutmix in a mini-batch.

This module is proposed in CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. Code Reference https://github.com/clovaai/CutMix-PyTorch

Parameters
  • num_classes (int) – The number of classes.

  • alpha (float) – Parameters for Beta distribution.

do_blending(imgs, label, **kwargs)[source]

Blending images with cutmix.

static rand_bbox(img_size, lam)[source]

Generate a random boudning box.

class mmaction.datasets.HVUDataset(ann_file, pipeline, tag_categories, tag_category_nums, filename_tmpl=None, **kwargs)[source]

HVU dataset, which supports the recognition tags of multiple categories. Accept both video annotation files or rawframe annotation files.

The dataset loads videos or raw frames and applies specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a json file with multiple dictionaries, and each dictionary indicates a sample video with the filename and tags, the tags are organized as different categories. Example of a video dictionary:

{
    'filename': 'gD_G1b0wV5I_001015_001035.mp4',
    'label': {
        'concept': [250, 131, 42, 51, 57, 155, 122],
        'object': [1570, 508],
        'event': [16],
        'action': [180],
        'scene': [206]
    }
}

Example of a rawframe dictionary:

{
    'frame_dir': 'gD_G1b0wV5I_001015_001035',
    'total_frames': 61
    'label': {
        'concept': [250, 131, 42, 51, 57, 155, 122],
        'object': [1570, 508],
        'event': [16],
        'action': [180],
        'scene': [206]
    }
}
Parameters
  • ann_file (str) – Path to the annotation file, should be a json file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • tag_categories (list[str]) – List of category names of tags.

  • tag_category_nums (list[int]) – List of number of tags in each category.

  • filename_tmpl (str | None) – Template for each filename. If set to None, video dataset is used. Default: None.

  • **kwargs – Keyword arguments for BaseDataset.

evaluate(results, metrics='mean_average_precision', metric_options=None, logger=None)[source]

Evaluation in HVU Video Dataset. We only support evaluating mAP for each tag categories. Since some tag categories are missing for some videos, we can not evaluate mAP for all tags.

Parameters
  • results (list) – Output results.

  • metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mean_average_precision’.

  • metric_options (dict | None) – Dict for metric options. Default: None.

  • logger (logging.Logger | None) – Logger for recording. Default: None.

Returns

Evaluation results dict.

Return type

dict

load_annotations()[source]

Load annotation file to get video information.

load_json_annotations()[source]

Load json annotation file to get video information.

class mmaction.datasets.ImageDataset(ann_file, pipeline, **kwargs)[source]

Image dataset for action recognition, used in the Project OmniSource.

The dataset loads image list and apply specified transforms to return a dict containing the image tensors and other information. For the ImageDataset

The ann_file is a text file with multiple lines, and each line indicates the image path and the image label, which are split with a whitespace. Example of a annotation file:

path/to/image1.jpg 1
path/to/image2.jpg 1
path/to/image3.jpg 2
path/to/image4.jpg 2
path/to/image5.jpg 3
path/to/image6.jpg 3

Example of a multi-class annotation file:

path/to/image1.jpg 1 3 5
path/to/image2.jpg 1 2
path/to/image3.jpg 2
path/to/image4.jpg 2 4 6 8
path/to/image5.jpg 3
path/to/image6.jpg 3
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • **kwargs – Keyword arguments for BaseDataset.

class mmaction.datasets.MixupBlending(num_classes, alpha=0.2)[source]

Implementing Mixup in a mini-batch.

This module is proposed in mixup: Beyond Empirical Risk Minimization. Code Reference https://github.com/open-mmlab/mmclassification/blob/master/mmcls/models/utils/mixup.py # noqa

Parameters
  • num_classes (int) – The number of classes.

  • alpha (float) – Parameters for Beta distribution.

do_blending(imgs, label, **kwargs)[source]

Blending images with mixup.

class mmaction.datasets.PoseDataset(ann_file, pipeline, split=None, valid_ratio=None, box_thr=None, class_prob=None, **kwargs)[source]

Pose dataset for action recognition.

The dataset loads pose and apply specified transforms to return a dict containing pose information.

The ann_file is a pickle file, the json file contains a list of annotations, the fields of an annotation include frame_dir(video_id), total_frames, label, kp, kpscore.

Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • split (str | None) – The dataset split used. Only applicable to UCF or HMDB. Allowed choiced are ‘train1’, ‘test1’, ‘train2’, ‘test2’, ‘train3’, ‘test3’. Default: None.

  • valid_ratio (float | None) – The valid_ratio for videos in KineticsPose. For a video with n frames, it is a valid training sample only if n * valid_ratio frames have human pose. None means not applicable (only applicable to Kinetics Pose). Default: None.

  • box_thr (str | None) – The threshold for human proposals. Only boxes with confidence score larger than box_thr is kept. None means not applicable (only applicable to Kinetics Pose [ours]). Allowed choices are ‘0.5’, ‘0.6’, ‘0.7’, ‘0.8’, ‘0.9’. Default: None.

  • class_prob (dict | None) – The per class sampling probability. If not None, it will override the class_prob calculated in BaseDataset.__init__(). Default: None.

  • **kwargs – Keyword arguments for BaseDataset.

load_annotations()[source]

Load annotation file to get video information.

class mmaction.datasets.RawVideoDataset(ann_file, pipeline, clipname_tmpl='part_{}.mp4', sampling_strategy='positive', **kwargs)[source]

RawVideo dataset for action recognition, used in the Project OmniSource.

The dataset loads clips of raw videos and apply specified transforms to return a dict containing the frame tensors and other information. Not that for this dataset, multi_class should be False.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath (without suffix), label, number of clips and index of positive clips (starting from 0), which are split with a whitespace. Raw videos should be first trimmed into 10 second clips, organized in the following format:

some/path/D32_1gwq35E/part_0.mp4
some/path/D32_1gwq35E/part_1.mp4
......
some/path/D32_1gwq35E/part_n.mp4

Example of a annotation file:

some/path/D32_1gwq35E 66 10 0 1 2
some/path/-G-5CJ0JkKY 254 5 3 4
some/path/T4h1bvOd9DA 33 1 0
some/path/4uZ27ivBl00 341 2 0 1
some/path/0LfESFkfBSw 186 234 7 9 11
some/path/-YIsNpBEx6c 169 100 9 10 11

The first line indicates that the raw video some/path/D32_1gwq35E has action label 66, consists of 10 clips (from part_0.mp4 to part_9.mp4). The 1st, 2nd and 3rd clips are positive clips.

Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • sampling_strategy (str) – The strategy to sample clips from raw videos. Choices are ‘random’ or ‘positive’. Default: ‘positive’.

  • clipname_tmpl (str) – The template of clip name in the raw video. Default: ‘part_{}.mp4’.

  • **kwargs – Keyword arguments for BaseDataset.

load_annotations()[source]

Load annotation file to get video information.

load_json_annotations()[source]

Load json annotation file to get video information.

prepare_test_frames(idx)[source]

Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]

Prepare the frames for training given the index.

sample_clip(results)[source]

Sample a clip from the raw video given the sampling strategy.

class mmaction.datasets.RawframeDataset(ann_file, pipeline, data_prefix=None, test_mode=False, filename_tmpl='img_{:05}.jpg', with_offset=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=0.0, dynamic_length=False, **kwargs)[source]

Rawframe dataset for action recognition.

The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:

some/directory-1 163 1
some/directory-2 122 1
some/directory-3 258 2
some/directory-4 234 2
some/directory-5 295 3
some/directory-6 121 3

Example of a multi-class annotation file:

some/directory-1 163 1 3 5
some/directory-2 122 1 2
some/directory-3 258 2
some/directory-4 234 2 4 6 8
some/directory-5 295 3
some/directory-6 121 3

Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.

some/directory-1 12 163 3
some/directory-2 213 122 4
some/directory-3 100 258 5
some/directory-4 98 234 2
some/directory-5 0 295 3
some/directory-6 50 121 3
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • data_prefix (str | None) – Path to a directory where videos are held. Default: None.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

  • filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.

  • with_offset (bool) – Determines whether the offset information is in ann_file. Default: False.

  • multi_class (bool) – Determines whether it is a multi-class recognition dataset. Default: False.

  • num_classes (int | None) – Number of classes in the dataset. Default: None.

  • modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.

  • sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.

  • power (float) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: 0.

  • dynamic_length (bool) – If the dataset length is dynamic (used by ClassSpecificDistributedSampler). Default: False.

load_annotations()[source]

Load annotation file to get video information.

prepare_test_frames(idx)[source]

Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]

Prepare the frames for training given the index.

class mmaction.datasets.RepeatDataset(dataset, times, test_mode=False)[source]

A wrapper of repeated dataset.

The length of repeated dataset will be times larger than the original dataset. This is useful when the data loading time is long but the dataset is small. Using RepeatDataset can reduce the data loading time between epochs.

Parameters
  • dataset (dict) – The config of the dataset to be repeated.

  • times (int) – Repeat times.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

class mmaction.datasets.SSNDataset(ann_file, pipeline, train_cfg, test_cfg, data_prefix, test_mode=False, filename_tmpl='img_{:05d}.jpg', start_index=1, modality='RGB', video_centric=True, reg_normalize_constants=None, body_segments=5, aug_segments=(2, 2), aug_ratio=(0.5, 0.5), clip_len=1, frame_interval=1, filter_gt=True, use_regression=True, verbose=False)[source]

Proposal frame dataset for Structured Segment Networks.

Based on proposal information, the dataset loads raw frames and applies specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines and each video’s information takes up several lines. This file can be a normalized file with percent or standard file with specific frame indexes. If the file is a normalized file, it will be converted into a standard file first.

Template information of a video in a standard file: .. code-block:: txt

# index video_id num_frames fps num_gts label, start_frame, end_frame label, start_frame, end_frame … num_proposals label, best_iou, overlap_self, start_frame, end_frame label, best_iou, overlap_self, start_frame, end_frame …

Example of a standard annotation file: .. code-block:: txt

# 0 video_validation_0000202 5666 1 3 8 130 185 8 832 1136 8 1303 1381 5 8 0.0620 0.0620 790 5671 8 0.1656 0.1656 790 2619 8 0.0833 0.0833 3945 5671 8 0.0960 0.0960 4173 5671 8 0.0614 0.0614 3327 5671

Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • train_cfg (dict) – Config for training.

  • test_cfg (dict) – Config for testing.

  • data_prefix (str) – Path to a directory where videos are held.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

  • filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. Default: 1.

  • modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.

  • video_centric (bool) – Whether to sample proposals just from this video or sample proposals randomly from the entire dataset. Default: True.

  • reg_normalize_constants (list) – Regression target normalized constants, including mean and standard deviation of location and duration.

  • body_segments (int) – Number of segments in course period. Default: 5.

  • aug_segments (list[int]) – Number of segments in starting and ending period. Default: (2, 2).

  • aug_ratio (int | float | tuple[int | float]) – The ratio of the length of augmentation to that of the proposal. Default: (0.5, 0.5).

  • clip_len (int) – Frames of each sampled output clip. Default: 1.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.

  • filter_gt (bool) – Whether to filter videos with no annotation during training. Default: True.

  • use_regression (bool) – Whether to perform regression. Default: True.

  • verbose (bool) – Whether to print full information or not. Default: False.

construct_proposal_pools()[source]

Construct positive proposal pool, incomplete proposal pool and background proposal pool of the entire dataset.

evaluate(results, metrics='mAP', metric_options={'mAP': {'eval_dataset': 'thumos14'}}, logger=None, **deprecated_kwargs)[source]

Evaluation in SSN proposal dataset.

Parameters
  • results (list[dict]) – Output results.

  • metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mAP’.

  • metric_options (dict) – Dict for metric options. Options are eval_dataset for mAP. Default: dict(mAP=dict(eval_dataset='thumos14')).

  • logger (logging.Logger | None) – Logger for recording. Default: None.

  • deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results for evaluation metrics.

Return type

dict

get_all_gts()[source]

Fetch groundtruth instances of the entire dataset.

static get_negatives(proposals, incomplete_iou_threshold, background_iou_threshold, background_coverage_threshold=0.01, incomplete_overlap_threshold=0.7)[source]

Get negative proposals, including incomplete proposals and background proposals.

Parameters
  • proposals (list) – List of proposal instances(SSNInstance).

  • incomplete_iou_threshold (float) – Maximum threshold of overlap of incomplete proposals and groundtruths.

  • background_iou_threshold (float) – Maximum threshold of overlap of background proposals and groundtruths.

  • background_coverage_threshold (float) – Minimum coverage of background proposals in video duration. Default: 0.01.

  • incomplete_overlap_threshold (float) – Minimum percent of incomplete proposals’ own span contained in a groundtruth instance. Default: 0.7.

Returns

(incompletes, backgrounds), incompletes

and backgrounds are lists comprised of incomplete proposal instances and background proposal instances.

Return type

list[SSNInstance]

static get_positives(gts, proposals, positive_threshold, with_gt=True)[source]

Get positive/foreground proposals.

Parameters
  • gts (list) – List of groundtruth instances(SSNInstance).

  • proposals (list) – List of proposal instances(SSNInstance).

  • positive_threshold (float) – Minimum threshold of overlap of positive/foreground proposals and groundtruths.

  • with_gt (bool) – Whether to include groundtruth instances in positive proposals. Default: True.

Returns

(positives), positives is a list

comprised of positive proposal instances.

Return type

list[SSNInstance]

load_annotations()[source]

Load annotation file to get video information.

prepare_test_frames(idx)[source]

Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]

Prepare the frames for training given the index.

results_to_detections(results, top_k=2000, **kwargs)[source]

Convert prediction results into detections.

Parameters
  • results (list) – Prediction results.

  • top_k (int) – Number of top results. Default: 2000.

Returns

Detection results.

Return type

list

class mmaction.datasets.VideoDataset(ann_file, pipeline, start_index=0, **kwargs)[source]

Video dataset for action recognition.

The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:

some/path/000.mp4 1
some/path/001.mp4 1
some/path/002.mp4 2
some/path/003.mp4 2
some/path/004.mp4 3
some/path/005.mp4 3
Parameters
  • ann_file (str) – Path to the annotation file.

  • pipeline (list[dict | callable]) – A sequence of data transforms.

  • start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 0.

  • **kwargs – Keyword arguments for BaseDataset.

load_annotations()[source]

Load annotation file to get video information.

mmaction.datasets.build_dataloader(dataset, videos_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, drop_last=False, pin_memory=True, persistent_workers=False, **kwargs)[source]

Build PyTorch DataLoader.

In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.

Parameters
  • dataset (Dataset) – A PyTorch dataset.

  • videos_per_gpu (int) – Number of videos on each GPU, i.e., batch size of each GPU.

  • workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.

  • num_gpus (int) – Number of GPUs. Only used in non-distributed training. Default: 1.

  • dist (bool) – Distributed training/test or not. Default: True.

  • shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.

  • seed (int | None) – Seed to be used. Default: None.

  • drop_last (bool) – Whether to drop the last incomplete batch in epoch. Default: False

  • pin_memory (bool) – Whether to use pin_memory in DataLoader. Default: True

  • persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. The argument also has effect in PyTorch>=1.8.0. Default: False

  • kwargs (dict, optional) – Any keyword argument to be used to initialize DataLoader.

Returns

A PyTorch dataloader.

Return type

DataLoader

mmaction.datasets.build_dataset(cfg, default_args=None)[source]

Build a dataset from config dict.

Parameters
  • cfg (dict) – Config dict. It should at least contain the key “type”.

  • default_args (dict | None, optional) – Default initialization arguments. Default: None.

Returns

The constructed dataset.

Return type

Dataset

pipelines

class mmaction.datasets.pipelines.ArrayDecode[source]

Load and decode frames with given indices from a 4D array.

Required keys are “array and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

class mmaction.datasets.pipelines.AudioAmplify(ratio)[source]

Amplify the waveform.

Required keys are “audios”, added or modified keys are “audios”, “amplify_ratio”.

Parameters

ratio (float) – The ratio used to amplify the audio waveform.

class mmaction.datasets.pipelines.AudioDecode(fixed_length=32000)[source]

Sample the audio w.r.t. the frames selected.

Parameters

fixed_length (int) – As the audio clip selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Default: 32000.

Required keys are “frame_inds”, “num_clips”, “total_frames”, “length”, added or modified keys are “audios”, “audios_shape”.

class mmaction.datasets.pipelines.AudioDecodeInit(io_backend='disk', sample_rate=16000, pad_method='zero', **kwargs)[source]

Using librosa to initialize the audio reader.

Required keys are “audio_path”, added or modified keys are “length”, “sample_rate”, “audios”.

Parameters
  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • sample_rate (int) – Audio sampling times per second. Default: 16000.

class mmaction.datasets.pipelines.AudioFeatureSelector(fixed_length=128)[source]

Sample the audio feature w.r.t. the frames selected.

Required keys are “audios”, “frame_inds”, “num_clips”, “length”, “total_frames”, added or modified keys are “audios”, “audios_shape”.

Parameters

fixed_length (int) – As the features selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Default: 128.

class mmaction.datasets.pipelines.BuildPseudoClip(clip_len)[source]

Build pseudo clips with one single image by repeating it n times.

Required key is “imgs”, added or modified key is “imgs”, “num_clips”,

“clip_len”.

Parameters

clip_len (int) – Frames of the generated pseudo clips.

class mmaction.datasets.pipelines.CenterCrop(crop_size, lazy=False)[source]

Crop the center area from images.

Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.

Parameters
  • crop_size (int | tuple[int]) – (w, h) of crop size.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.Collect(keys, meta_keys=('filename', 'label', 'original_shape', 'img_shape', 'pad_shape', 'flip_direction', 'img_norm_cfg'), meta_name='img_metas', nested=False)[source]

Collect data from the loader relevant to the specific task.

This keeps the items in keys as it is, and collect items in meta_keys into a meta item called meta_name.This is usually the last stage of the data loader pipeline. For example, when keys=’imgs’, meta_keys=(‘filename’, ‘label’, ‘original_shape’), meta_name=’img_metas’, the results will be a dict with keys ‘imgs’ and ‘img_metas’, where ‘img_metas’ is a DataContainer of another dict with keys ‘filename’, ‘label’, ‘original_shape’.

Parameters
  • keys (Sequence[str]) – Required keys to be collected.

  • meta_name (str) – The name of the key that contains meta information. This key is always populated. Default: “img_metas”.

  • meta_keys (Sequence[str]) –

    Keys that are collected under meta_name. The contents of the meta_name dictionary depends on meta_keys. By default this includes:

    • ”filename”: path to the image file

    • ”label”: label of the image file

    • ”original_shape”: original shape of the image as a tuple

      (h, w, c)

    • ”img_shape”: shape of the image input to the network as a tuple

      (h, w, c). Note that images may be zero padded on the bottom/right, if the batch tensor is larger than this shape.

    • ”pad_shape”: image shape after padding

    • ”flip_direction”: a str in (“horiziontal”, “vertival”) to

      indicate if the image is fliped horizontally or vertically.

    • ”img_norm_cfg”: a dict of normalization information:
      • mean - per channel mean subtraction

      • std - per channel std divisor

      • to_rgb - bool indicating if bgr was converted to rgb

  • nested (bool) – If set as True, will apply data[x] = [data[x]] to all items in data. The arg is added for compatibility. Default: False.

class mmaction.datasets.pipelines.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1)[source]

Perform ColorJitter to each img.

Required keys are “imgs”, added or modified keys are “imgs”.

Parameters
  • brightness (float | tuple[float]) – The jitter range for brightness, if set as a float, the range will be (1 - brightness, 1 + brightness). Default: 0.5.

  • contrast (float | tuple[float]) – The jitter range for contrast, if set as a float, the range will be (1 - contrast, 1 + contrast). Default: 0.5.

  • saturation (float | tuple[float]) – The jitter range for saturation, if set as a float, the range will be (1 - saturation, 1 + saturation). Default: 0.5.

  • hue (float | tuple[float]) – The jitter range for hue, if set as a float, the range will be (-hue, hue). Default: 0.1.

class mmaction.datasets.pipelines.Compose(transforms)[source]

Compose a data pipeline with a sequence of transforms.

Parameters

transforms (list[dict | callable]) – Either config dicts of transforms or transform objects.

class mmaction.datasets.pipelines.DecordDecode(mode='accurate')[source]

Using decord to decode the video.

Decord: https://github.com/dmlc/decord

Required keys are “video_reader”, “filename” and “frame_inds”, added or modified keys are “imgs” and “original_shape”.

Parameters

mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Default: ‘accurate’.

class mmaction.datasets.pipelines.DecordInit(io_backend='disk', num_threads=1, **kwargs)[source]

Using decord to initialize the video_reader.

Decord: https://github.com/dmlc/decord

Required keys are “filename”, added or modified keys are “video_reader” and “total_frames”.

Parameters
  • io_backend (str) – io backend where frames are store. Default: ‘disk’.

  • num_threads (int) – Number of thread to decode the video. Default: 1.

  • kwargs (dict) – Args for file client.

class mmaction.datasets.pipelines.DenseSampleFrames(*args, sample_range=64, num_sample_positions=10, **kwargs)[source]

Select frames from the video by dense sample strategy.

Required keys are “filename”, added or modified keys are “total_frames”, “frame_inds”, “frame_interval” and “num_clips”.

Parameters
  • clip_len (int) – Frames of each sampled output clip.

  • frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.

  • num_clips (int) – Number of clips to be sampled. Default: 1.

  • sample_range (int) – Total sample range for dense sample. Default: 64.

  • num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Default: 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.

  • temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.

  • test_mode (bool) – Store True when building test or validation dataset. Default: False.

class mmaction.datasets.pipelines.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, left_kp=None, right_kp=None, lazy=False)[source]

Flip the input images with a probability.

Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered.

Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.

Parameters
  • flip_ratio (float) – Probability of implementing flip. Default: 0.5.

  • direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.

  • flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.

  • left_kp (list[int]) – Indexes of left keypoints, used to flip keypoints. Default: None.

  • right_kp (list[ind]) – Indexes of right keypoints, used to flip keypoints. Default: None.

  • lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.FormatAudioShape(input_format)[source]

Format final audio shape to the given input_format.

Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.

Parameters

input_format (str) – Define the final imgs format.

class mmaction.datasets.pipelines.FormatGCNInput(input_format, num_person=2)[source]

Format final skeleton shape to the given input_format.

Required keys are “keypoint” and “keypoint_score”(optional), added or modified keys are “keypoint” and “input_shape”.

Parameters

input_format (str) – Define the final skeleton format.

class mmaction.datasets.pipelines.FormatShape(input_format, collapse=False)[source]

Format final imgs shape to the given input_format.

Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.

Parameters
  • input_format (str) – Define the final imgs format.

  • collapse (bool) – To collpase input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Default: False.

class mmaction.datasets.pipelines.Fuse[source]

Fuse lazy operations.

Fusion order:

crop -> resize -> flip

Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.

class mmaction.datasets.pipelines.GenerateLocalizationLabels[source]

Load video label for localizer with given video_name list.

Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.

class mmaction.datasets.pipelines.GeneratePoseTarget(sigma=0.6, use_score=True, with_kp=True, with_limb=False, skeletons=((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)), double=False, left_kp=(1, 3, 5, 7, 9, 11, 13, 15), right_kp=(2, 4, 6, 8, 10, 12, 14, 16))[source]

Generate pseudo heatmaps based on joint coordinates and confidence.

Required keys are “keypoint”, “img_shape”, “keypoint_score” (optional), added or modified keys are “imgs”.

Parameters
  • sigma (float) – The sigma of the generated gaussian map. Default: 0.6.

  • use_score (bool) – Use the confidence score of keypoints as the maximum of the gaussian maps. Default: True.

  • with_kp (bool) – Generate pseudo heatmaps for keypoints. Default: True.

  • with_limb (bool) – Generate pseudo heatmaps for limbs. At least one of ‘with_kp’ and ‘with_limb’ should be True. Default: False.

  • skeletons (tuple[tuple]) –

    The definition of human skeletons. Default: ((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9),

    (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)),

    which is the definition of COCO-17p skeletons.

  • double (bool) – Output both original heatmaps and flipped heatmaps. Default: False.

  • left_kp (tuple[int]) – Indexes of left keypoints, which is used when flipping heatmaps. Default: (1, 3, 5, 7, 9, 11, 13, 15), which is left keypoints in COCO-17p.

  • right_kp (tuple[int]) – Indexes of right keypoints, which is used when flipping heatmaps. Default: (2, 4, 6, 8, 10, 12, 14, 16), which is right keypoints in COCO-17p.

gen_an_aug(results)[source]

Generate pseudo heatmaps for all frames.

Parameters

results (dict) – The dictionary that contains all info of a sample.

Returns

The generated pseudo heatmaps.

Return type

list[np.ndarray]

generate_a_heatmap(img_h, img_w, centers, sigma, max_values)[source]

Generate pseudo heatmap for one keypoint in one frame.

Parameters
  • img_h (int) – The height of the heatmap.

  • img_w (int) – The width of the heatmap.

  • centers (np.ndarray) – The coordinates of corresponding keypoints (of multiple persons).

  • sigma (float) – The sigma of generated gaussian.

  • max_values (np.ndarray) – The max values of each keypoint.

Returns

The generated pseudo heatmap.

Return type

np.ndarray

generate_a_limb_heatmap(img_h, img_w, starts, ends, sigma, start_values, end_values)[source]

Generate pseudo heatmap for one limb in one frame.

Parameters
  • img_h (int) – The height of the heatmap.

  • img_w (int) – The width of the heatmap.

  • starts (np.ndarray) – The coordinates of one keypoint in the corresponding limbs (of multiple persons).

  • ends (np.ndarray) – The coordinates of the other keypoint in the corresponding limbs (of multiple persons).

  • sigma (float) – The sigma of generated gaussian.

  • start_values (np.ndarray) – The max values of one keypoint in the corresponding limbs.

  • end_values (np.ndarray) – The max values of the other keypoint in the corresponding limbs.

Returns

The generated pseudo heatmap.

Return type

np.ndarray

generate_heatmap(img_h, img_w, kps, sigma, max_values)[source]

Generate pseudo heatmap for all keypoints and limbs in one frame (if needed).

Parameters
  • img_h (int) – The height of the heatmap.

  • img_w (int) – The width of the heatmap.

  • kps (np.ndarray) – The coordinates of keypoints in this frame.

  • sigma (float) – The sigma of generated gaussian.

  • max_values (np.ndarray) – The confidence score of each keypoint.

Returns

The generated pseudo heatmap.

Return type

np.ndarray

class mmaction.datasets.pipelines.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]

Load and decode images.

Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

Parameters
  • io_backend (str) – IO backend where frames are stored. Default: ‘disk’.

  • decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.

  • kwargs (dict, optional) – Arguments for FileClient.

class mmaction.datasets.pipelines.ImageToTensor(keys)[source]

Convert image type to torch.Tensor type.

Parameters

keys (Sequence[str]) – Required keys to be converted.

class mmaction.datasets.pipelines.Imgaug(transforms)[source]

Imgaug augmentation.

Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.

It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.

Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.

It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.

Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways

to create transforms. 1) string: only support default for now.

e.g. transforms=’default’

  1. list[dict]: create a list of augmenters by a list of dicts, each

    dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]

  2. iaa.Augmenter: create an imgaug.Augmenter object.

    e.g. transforms=iaa.Rotate(rotate=(-20, 20))

  1. Add Imgaug in dataset pipeline. It is recommended to insert imgaug

    pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [

    dict(

    type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,

    ), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(

    type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),

    dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])

Parameters

transforms (str | list[dict] | iaa.Augmenter) – Three different ways to create imgaug augmenter.

static default_transforms()[source]

Default transforms for imgaug.

Implement RandAugment by imgaug. Please visit https://arxiv.org/abs/1909.13719 for more information.

Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa

Miss one augmenter SolarizeAdd since imgaug doesn’t support this.

Returns

The constructed RandAugment transforms.

Return type

dict

imgaug_builder(cfg)[source]

Import a module from imgaug.

It follows the logic of build_from_cfg(). Use a dict object to create an iaa.Augmenter object.

Parameters

cfg (dict) – Config dict. It should at least contain the key “type”.

Returns

iaa.Augmenter: The constructed imgaug augmenter.

Return type

obj

class mmaction.datasets.pipelines.JointToBone(dataset='nturgb+d')[source]

Convert the joint information to bone information.

Required keys are “keypoint” , added or modified keys are “keypoint”.

Parameters

dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose-18’, ‘coco’. Default: ‘nturgb+d’.

class mmaction.datasets.pipelines.LoadAudioFeature(pad_method='zero')[source]

Load offline extracted audio features.

Required keys are “audio_path”, added or modified keys are “length”, audios”.

class mmaction.datasets.pipelines.LoadHVULabel(**kwargs)[source]

Convert the HVU label from dictionaries to torch tensors.

Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.

class mmaction.datasets.pipelines.LoadKineticsPose(io_backend='disk', squeeze=True, max_person=100, keypoint_weight={'face': 1, 'limb': 3, 'torso': 2}, source='mmpose', **kwargs)[source]

Load Kinetics Pose given filename (The format should be pickle)

Required keys are “filename”, “total_frames”, “img_shape”, “frame_inds”, “anno_inds” (for mmpose source, optional), added or modified keys are “keypoint”, “keypoint_score”.

Parameters
  • io_backend (str) – IO backend where frames are stored. Default: ‘disk’.

  • squeeze (bool) – Whether to remove frames with no human pose. Default: True.

  • max_person (int) – The max number of persons in a frame. Default: 10.

  • keypoint_weight (dict) – The weight of keypoints. We set the confidence score of a person as the weighted sum of confidence scores of each joint. Persons with low confidence scores are dropped (if exceed max_person). Default: dict(face=1, torso=2, limb=3).

  • source (str) – The sources of the keypoints used. Choices are ‘mmpose’ and ‘openpose-18’. Default: ‘mmpose’.

  • kwargs (dict, optional) – Arguments for FileClient.

class mmaction.datasets.pipelines.LoadLocalizationFeature(raw_feature_ext='.csv')[source]

Load Video features for localizer with given video_name list.

Required keys are “video_name” and “data_prefix”, added or modified keys are “raw_feature”.

Parameters

raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.

class mmaction.datasets.pipelines.Loa