mmaction.apis¶
- mmaction.apis.inference_recognizer(model, video, outputs=None, as_tensor=True, **kwargs)[source]¶
Inference a video with the recognizer.
- Parameters
model (nn.Module) – The loaded recognizer.
video (str | dict | ndarray) – The video file path / url or the rawframes directory path / results dictionary (the input of pipeline) / a 4D array T x H x W x 3 (The input video).
outputs (list(str) | tuple(str) | str | None) – Names of layers whose outputs need to be returned, default: None.
as_tensor (bool) – Same as that in
OutputHook
. Default: True.
- Returns
Top-5 recognition result dict. dict[torch.tensor | np.ndarray]:
Output feature maps from layers specified in outputs.
- Return type
dict[tuple(str, float)]
- mmaction.apis.init_random_seed(seed=None, device='cpu', distributed=True)[source]¶
Initialize random seed.
If the seed is not set, the seed will be automatically randomized, and then broadcast to all processes to prevent some potential bugs. :param seed: The seed. Default to None. :type seed: int, Optional :param device: The device where the seed will be put on.
Default to ‘cuda’.
- Parameters
distributed (bool) – Whether to use distributed training. Default: True.
- Returns
Seed to be used.
- Return type
int
- mmaction.apis.init_recognizer(config, checkpoint=None, device='cuda:0', **kwargs)[source]¶
Initialize a recognizer from config file.
- Parameters
config (str |
mmcv.Config
) – Config file path or the config object.checkpoint (str | None, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Default: None.
device (str |
torch.device
) – The desired device of returned tensor. Default: ‘cuda:0’.
- Returns
The constructed recognizer.
- Return type
nn.Module
- mmaction.apis.multi_gpu_test(model: torch.nn.modules.module.Module, data_loader: torch.utils.data.dataloader.DataLoader, tmpdir: Optional[str] = None, gpu_collect: bool = False) → Optional[list][source]¶
Test model with multiple gpus.
This method tests model with multiple gpus and collects the results under two different modes: gpu and cpu modes. By setting
gpu_collect=True
, it encodes results to gpu tensors and use gpu communication for results collection. On cpu mode it saves the results on different gpus totmpdir
and collects them by the rank 0 worker.- Parameters
model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
tmpdir (str) – Path of directory to save the temporary results from different gpus under cpu mode.
gpu_collect (bool) – Option to use either gpu or cpu to collect results.
- Returns
The prediction results.
- Return type
list
- mmaction.apis.single_gpu_test(model: torch.nn.modules.module.Module, data_loader: torch.utils.data.dataloader.DataLoader) → list[source]¶
Test model with a single gpu.
This method tests model with a single gpu and displays test progress bar.
- Parameters
model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
- Returns
The prediction results.
- Return type
list
- mmaction.apis.train_model(model, dataset, cfg, distributed=False, validate=False, test={'test_best': False, 'test_last': False}, timestamp=None, meta=None)[source]¶
Train model entry function.
- Parameters
model (nn.Module) – The model to be trained.
dataset (
Dataset
) – Train dataset.cfg (dict) – The config dict for training.
distributed (bool) – Whether to use distributed training. Default: False.
validate (bool) – Whether to do evaluation. Default: False.
test (dict) – The testing option, with two keys: test_last & test_best. The value is True or False, indicating whether to test the corresponding checkpoint. Default: dict(test_best=False, test_last=False).
timestamp (str | None) – Local time for runner. Default: None.
meta (dict | None) – Meta dict to record some important information. Default: None
mmaction.core¶
optimizer¶
- class mmaction.core.optimizer.CopyOfSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, *, maximize=False, foreach: Optional[bool] = None, differentiable=False)[source]¶
A clone of torch.optim.SGD.
A customized optimizer could be defined like CopyOfSGD. You may derive from built-in optimizers in torch.optim, or directly implement a new optimizer.
- class mmaction.core.optimizer.TSMOptimizerConstructor(optimizer_cfg: Dict, paramwise_cfg: Optional[Dict] = None)[source]¶
Optimizer constructor in TSM model.
This constructor builds optimizer in different ways from the default one.
Parameters of the first conv layer have default lr and weight decay.
Parameters of BN layers have default lr and zero weight decay.
If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.
Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.
evaluation¶
- class mmaction.core.evaluation.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[source]¶
Class to evaluate detection results on ActivityNet.
- Parameters
ground_truth_filename (str | None) – The filename of groundtruth. Default: None.
prediction_filename (str | None) – The filename of action detection results. Default: None.
tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default:
np.linspace(0.5, 0.95, 10)
.verbose (bool) – Whether to print verbose logs. Default: False.
- mmaction.core.evaluation.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]¶
Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.
- Parameters
ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.
prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default:
np.linspace(0.5, 0.95, 10)
.
- Returns
1D array of average precision score.
- Return type
np.ndarray
- mmaction.core.evaluation.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]¶
Computes the average recall given an average number (percentile) of proposals per video.
- Parameters
ground_truth (dict) – Dict containing the ground truth instances.
proposals (dict) – Dict containing the proposal instances.
total_num_proposals (int) – Total number of proposals in the proposal dict.
max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default:
np.linspace(0.5, 0.95, 10)
.
- Returns
(recall, average_recall, proposals_per_video, auc) In recall,
recall[i,j]
is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent torecall.mean(axis=0)
. Theproposals_per_video
is the average number of proposals per video. The auc is the area underAR@AN
curve.- Return type
tuple([np.ndarray, np.ndarray, np.ndarray, float])
- mmaction.core.evaluation.confusion_matrix(y_pred, y_real, normalize=None)[source]¶
Compute confusion matrix.
- Parameters
y_pred (list[int] | np.ndarray[int]) – Prediction labels.
y_real (list[int] | np.ndarray[int]) – Ground truth labels.
normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.
- Returns
Confusion matrix.
- Return type
np.ndarray
- mmaction.core.evaluation.get_weighted_score(score_list, coeff_list)[source]¶
Get weighted score with given scores and coefficients.
Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n
- Parameters
score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes
coeff_list (list[float]) – List of coefficients, with shape n.
- Returns
List of weighted scores.
- Return type
list[np.ndarray]
- mmaction.core.evaluation.interpolated_precision_recall(precision, recall)[source]¶
Interpolated AP - VOCdevkit from VOC 2011.
- Parameters
precision (np.ndarray) – The precision of different thresholds.
recall (np.ndarray) – The recall of different thresholds.
- Returns:
float: Average precision score.
- mmaction.core.evaluation.mean_average_precision(scores, labels)[source]¶
Mean average precision for multi-label recognition.
- Parameters
scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.
- Returns
The mean average precision.
- Return type
np.float64
- mmaction.core.evaluation.mean_class_accuracy(scores, labels)[source]¶
Calculate mean class accuracy.
- Parameters
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
- Returns
Mean class accuracy.
- Return type
np.ndarray
- mmaction.core.evaluation.mmit_mean_average_precision(scores, labels)[source]¶
Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.
- Parameters
scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.
- Returns
The MMIT style mean average precision.
- Return type
np.float64
- mmaction.core.evaluation.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[source]¶
Compute intersection over union between segments.
- Parameters
candidate_segments (np.ndarray) – 1-dim/2-dim array in format
[init, end]/[m x 2:=[init, end]]
.target_segments (np.ndarray) – 2-dim array in format
[n x 2:=[init, end]]
.calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.
- Returns
- 1-dim array [n] /
2-dim array [n x m] with IoU ratio.
- t_overlap_self (np.ndarray, optional): 1-dim array [n] /
2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.
- Return type
t_iou (np.ndarray)
- mmaction.core.evaluation.softmax(x, dim=1)[source]¶
Compute softmax values for each sets of scores in x.
- mmaction.core.evaluation.top_k_accuracy(scores, labels, topk=(1))[source]¶
Calculate top k accuracy score.
- Parameters
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).
- Returns
Top k accuracy score for each k.
- Return type
list[float]
- mmaction.core.evaluation.top_k_classes(scores, labels, k=10, mode='accurate')[source]¶
Calculate the most K accurate (inaccurate) classes.
Given the prediction scores, ground truth label and top-k value, compute the top K accurate (inaccurate) classes.
- Parameters
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int] | np.ndarray) – Ground truth labels.
k (int) – Top-k values. Default: 10.
mode (str) – Comparison mode for Top-k. Options are ‘accurate’ and ‘inaccurate’. Default: ‘accurate’.
- Returns
- List of sorted (from high accuracy to low accuracy for
’accurate’ mode, and from low accuracy to high accuracy for inaccurate mode) top K classes in format of (label_id, acc_ratio).
- Return type
list
scheduler ^^ .. automodule:: mmaction.core.scheduler
- members
mmaction.localization¶
localization¶
- mmaction.localization.eval_ap(detections, gt_by_cls, iou_range)[source]¶
Evaluate average precisions.
- Parameters
detections (dict) – Results of detections.
gt_by_cls (dict) – Information of groudtruth.
iou_range (list) – Ranges of iou.
- Returns
Average precision values of classes at ious.
- Return type
list
- mmaction.localization.generate_bsp_feature(video_list, video_infos, tem_results_dir, pgm_proposals_dir, top_k=1000, bsp_boundary_ratio=0.2, num_sample_start=8, num_sample_end=8, num_sample_action=16, num_sample_interp=3, tem_results_ext='.csv', pgm_proposal_ext='.csv', result_dict=None)[source]¶
Generate Boundary-Sensitive Proposal Feature with given proposals.
- Parameters
video_list (list[int]) – List of video indexes to generate bsp_feature.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’.
tem_results_dir (str) – Directory to load temporal evaluation results.
pgm_proposals_dir (str) – Directory to load proposals.
top_k (int) – Number of proposals to be considered. Default: 1000
bsp_boundary_ratio (float) – Ratio for proposal boundary (start/end). Default: 0.2.
num_sample_start (int) – Num of samples for actionness in start region. Default: 8.
num_sample_end (int) – Num of samples for actionness in end region. Default: 8.
num_sample_action (int) – Num of samples for actionness in center region. Default: 16.
num_sample_interp (int) – Num of samples for interpolation for each sample point. Default: 3.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
pgm_proposal_ext (str) – File extension for proposals. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.
- Returns
- A dict contains video_name as keys and
bsp_feature as value. If result_dict is not None, save the results to it.
- Return type
bsp_feature_dict (dict)
- mmaction.localization.generate_candidate_proposals(video_list, video_infos, tem_results_dir, temporal_scale, peak_threshold, tem_results_ext='.csv', result_dict=None)[source]¶
Generate Candidate Proposals with given temporal evaluation results. Each proposal file will contain: ‘tmin,tmax,tmin_score,tmax_score,score,match_iou,match_ioa’.
- Parameters
video_list (list[int]) – List of video indexes to generate proposals.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’, ‘duration_frame’, ‘duration_second’, ‘feature_frame’, and ‘annotations’.
tem_results_dir (str) – Directory to load temporal evaluation results.
temporal_scale (int) – The number (scale) on temporal axis.
peak_threshold (float) – The threshold for proposal generation.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.
- Returns
- A dict contains video_name as keys and proposal list as value.
If result_dict is not None, save the results to it.
- Return type
dict
- mmaction.localization.load_localize_proposal_file(filename)[source]¶
Load the proposal file and split it into many parts which contain one video’s information separately.
- Parameters
filename (str) – Path to the proposal file.
- Returns
List of all videos’ information.
- Return type
list
- mmaction.localization.perform_regression(detections)[source]¶
Perform regression on detection results.
- Parameters
detections (list) – Detection results before regression.
- Returns
Detection results after regression.
- Return type
list
- mmaction.localization.soft_nms(proposals, alpha, low_threshold, high_threshold, top_k)[source]¶
Soft NMS for temporal proposals.
- Parameters
proposals (np.ndarray) – Proposals generated by network.
alpha (float) – Alpha value of Gaussian decaying function.
low_threshold (float) – Low threshold for soft nms.
high_threshold (float) – High threshold for soft nms.
top_k (int) – Top k values to be considered.
- Returns
The updated proposals.
- Return type
np.ndarray
- mmaction.localization.temporal_iop(proposal_min, proposal_max, gt_min, gt_max)[source]¶
Compute IoP score between a groundtruth bbox and the proposals.
Compute the IoP which is defined as the overlap ratio with groundtruth proportional to the duration of this proposal.
- Parameters
proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.
- Returns
List of intersection over anchor scores.
- Return type
list[float]
- mmaction.localization.temporal_iou(proposal_min, proposal_max, gt_min, gt_max)[source]¶
Compute IoU score between a groundtruth bbox and the proposals.
- Parameters
proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.
- Returns
List of iou scores.
- Return type
list[float]
mmaction.models¶
models¶
- class mmaction.models.ACRNHead(in_channels, out_channels, stride=1, num_convs=1, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, **kwargs)[source]¶
ACRN Head: Tile + 1x1 convolution + 3x3 convolution.
This module is proposed in Actor-Centric Relation Network
- Parameters
in_channels (int) – The input channel.
out_channels (int) – The output channel.
stride (int) – The spatial stride.
num_convs (int) – The number of 3x3 convolutions in ACRNHead.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
kwargs (dict) – Other new arguments, to be compatible with MMDet update.
- forward(x, feat, rois, **kwargs)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The extracted RoI feature.
feat (torch.Tensor) – The context feature.
rois (torch.Tensor) – The regions of interest.
- Returns
- The RoI features that have interacted with context
feature.
- Return type
torch.Tensor
- class mmaction.models.AudioRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
Audio recognizer model framework.
- forward(audios, label=None, return_loss=True)[source]¶
Define the computation performed at every call.
- forward_gradcam(audios)[source]¶
Defines the computation performed at every all when using gradcam utils.
- forward_test(audios)[source]¶
Defines the computation performed at every call when evaluation and testing.
- forward_train(audios, labels)[source]¶
Defines the computation performed at every call when training.
- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- class mmaction.models.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶
Classification head for TSN on audio.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, focal_gamma=0.0, focal_alpha=1.0, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]¶
Simplest RoI head, with only two fc layers for classification and regression respectively.
- Parameters
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
in_channels (int) – The number of input channels. Default: 2048.
focal_alpha (float) – The hyper-parameter alpha for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 1.
focal_gamma (float) – The hyper-parameter gamma for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 0.
num_classes (int) – The number of classes. Default: 81.
dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.
dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.
topk (int or tuple[int]) – Parameter for evaluating Top-K accuracy. Default: (3, 5)
multilabel (bool) – Whether used for a multilabel task. Default: True.
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- static get_recall_prec(pred_vec, target_vec)[source]¶
Computes the Recall/Precision for both multi-label and single label scenarios.
Note that the computation calculates the micro average.
Note, that in both cases, the concept of correct/incorrect is the same. :param pred_vec: each element is either 0 or 1 :type pred_vec: tensor[N x C] :param target_vec: each element is either 0 or 1 - for
single label it is expected that only one element is on (1) although this is not enforced.
- class mmaction.models.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]¶
Binary Cross Entropy Loss with logits.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
- class mmaction.models.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]¶
Boundary Matching Network for temporal action proposal generation.
Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network
- Parameters
temporal_dim (int) – Total frames selected for each video.
boundary_ratio (float) – Ratio for determining video boundaries.
num_samples (int) – Number of samples for each proposal.
num_samples_per_bin (int) – Number of bin samples for each sample.
feat_dim (int) – Feature dimension.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
loss_cls (dict) – Config for building loss. Default:
dict(type='BMNLoss')
.hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.
- forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶
Define the computation performed at every call.
- forward_test(raw_feature, video_meta)[source]¶
Define the computation performed at every call when testing.
- class mmaction.models.BMNLoss[source]¶
BMN Loss.
From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of
1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.
- forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]¶
Calculate Boundary Matching Network Loss.
- Parameters
pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.
- Returns
(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.
- Return type
tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])
- static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]¶
Calculate Proposal Evaluation Module Classification Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5
- Returns
Proposal evaluation classification loss.
- Return type
torch.Tensor
- static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]¶
Calculate Proposal Evaluation Module Regression Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.
- Returns
Proposal evaluation regression loss.
- Return type
torch.Tensor
- static tem_loss(pred_start, pred_end, gt_start, gt_end)[source]¶
Calculate Temporal Evaluation Module Loss.
This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.
- Parameters
pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
- class mmaction.models.BaseGCN(backbone, cls_head=None, train_cfg=None, test_cfg=None)[source]¶
Base class for GCN-based action recognition.
All GCN-based recognizers should subclass it. All subclass should overwrite:
Methods:
forward_train
, supporting to forward when training.Methods:
forward_test
, supporting to forward when testing.
- Parameters
backbone (dict) – Backbone modules to extract feature.
cls_head (dict | None) – Classification head to process feature. Default: None.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
- extract_feat(skeletons)[source]¶
Extract features through a backbone.
- Parameters
skeletons (torch.Tensor) – The input skeletons.
- Returns
The extracted features.
- Return type
torch.tensor
- forward(keypoint, label=None, return_loss=True, **kwargs)[source]¶
Define the computation performed at every call.
- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- property with_cls_head¶
whether the recognizer has a cls_head
- Type
bool
- class mmaction.models.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0, topk=(1, 5))[source]¶
Base class for head.
All Head should subclass it. All subclass should overwrite: - Methods:
init_weights
, initializing weights in some modules. - Methods:forward
, supporting to forward both for training and testing.- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.
topk (int | tuple) – Top-k accuracy. Default: (1, 5).
- abstract init_weights()[source]¶
Initiate the parameters either from existing checkpoint or from scratch.
- loss(cls_score, labels, **kwargs)[source]¶
Calculate the loss given output
cls_score
, targetlabels
.- Parameters
cls_score (torch.Tensor) – The output of the model.
labels (torch.Tensor) – The target output of the model.
- Returns
A dict containing field ‘loss_cls’(mandatory) and ‘topk_acc’(optional).
- Return type
dict
- class mmaction.models.BaseRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
Base class for recognizers.
All recognizers should subclass it. All subclass should overwrite:
Methods:
forward_train
, supporting to forward when training.Methods:
forward_test
, supporting to forward when testing.
- Parameters
backbone (dict) – Backbone modules to extract feature.
cls_head (dict | None) – Classification head to process feature. Default: None.
neck (dict | None) – Neck for feature fusion. Default: None.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
- average_clip(cls_score, num_segs=1)[source]¶
Averaging class score over multiple clips.
Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.
- Parameters
cls_score (torch.Tensor) – Class score to be averaged.
num_segs (int) – Number of clips for each input sample.
- Returns
Averaged class score.
- Return type
torch.Tensor
- extract_feat(imgs)[source]¶
Extract features through a backbone.
- Parameters
imgs (torch.Tensor) – The input images.
- Returns
The extracted features.
- Return type
torch.tensor
- forward(imgs, label=None, return_loss=True, **kwargs)[source]¶
Define the computation performed at every call.
- abstract forward_gradcam(imgs)[source]¶
Defines the computation performed at every all when using gradcam utils.
- abstract forward_test(imgs)[source]¶
Defines the computation performed at every call when evaluation and testing.
- abstract forward_train(imgs, labels, **kwargs)[source]¶
Defines the computation performed at every call when training.
- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- property with_cls_head¶
whether the recognizer has a cls_head
- Type
bool
- property with_neck¶
whether the recognizer has a neck
- Type
bool
- class mmaction.models.BinaryLogisticRegressionLoss[source]¶
Binary Logistic Regression Loss.
It will calculate binary logistic regression loss given reg_score and label.
- forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]¶
Calculate Binary Logistic Regression Loss.
- Parameters
reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
- class mmaction.models.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[source]¶
C3D backbone.
- Parameters
pretrained (str | None) – Name of pretrained model.
style (str) –
pytorch
orcaffe
. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses
dict(type='Conv3d')
to construct layers. Default: None.norm_cfg (dict | None) – Config for norm layers. required keys are
type
, Default: None.act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses
dict(type='ReLU')
to construct layers. Default: None.out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.
- class mmaction.models.CBFocalLoss(loss_weight=1.0, samples_per_cls=[], beta=0.9999, gamma=2.0)[source]¶
Class Balanced Focal Loss. Adapted from https://github.com/abhinanda- punnakkal/BABEL/. This loss is used in the skeleton-based action recognition baseline for BABEL.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
samples_per_cls (list[int]) – The number of samples per class. Default: [].
beta (float) – Hyperparameter that controls the per class loss weight. Default: 0.9999.
gamma (float) – Hyperparameter of the focal loss. Default: 2.0.
- class mmaction.models.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]¶
(2+1)d Conv module for R(2+1)d backbone.
https://arxiv.org/pdf/1711.11248.pdf.
- Parameters
in_channels (int) – Same as nn.Conv3d.
out_channels (int) – Same as nn.Conv3d.
kernel_size (int | tuple[int]) – Same as nn.Conv3d.
stride (int | tuple[int]) – Same as nn.Conv3d.
padding (int | tuple[int]) – Same as nn.Conv3d.
dilation (int | tuple[int]) – Same as nn.Conv3d.
groups (int) – Same as nn.Conv3d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
- class mmaction.models.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]¶
Conv2d module for AudioResNet backbone.
- Parameters
in_channels (int) – Same as nn.Conv2d.
out_channels (int) – Same as nn.Conv2d.
kernel_size (int | tuple[int]) – Same as nn.Conv2d.
op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.
stride (int | tuple[int]) – Same as nn.Conv2d.
padding (int | tuple[int]) – Same as nn.Conv2d.
dilation (int | tuple[int]) – Same as nn.Conv2d.
groups (int) – Same as nn.Conv2d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
- class mmaction.models.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]¶
Cross Entropy Loss.
Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of
cls_score
andlabel
. 1) Hard label: This label is an integer array and all of the elements arein the range [0, num_classes - 1]. This label’s shape should be
cls_score
’s shape with the num_classes dimension removed.- Soft label(probablity distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as
cls_score
. For now, only 2-dim soft label is supported.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
- class mmaction.models.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]¶
Spatial Attention in Divided Space Time Attention.
- Parameters
embed_dims (int) – Dimensions of embedding.
num_heads (int) – Number of parallel attention heads in TransformerCoder.
num_frames (int) – Number of frames in the video.
attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..
proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..
dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
init_cfg (dict | None) – The Config for initialization. Defaults to None.
- forward(query, key=None, value=None, residual=None, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]¶
Temporal Attention in Divided Space Time Attention.
- Parameters
embed_dims (int) – Dimensions of embedding.
num_heads (int) – Number of parallel attention heads in TransformerCoder.
num_frames (int) – Number of frames in the video.
attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..
proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..
dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
init_cfg (dict | None) – The Config for initialization. Defaults to None.
- forward(query, key=None, value=None, residual=None, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]¶
Feature Bank Operator Head.
Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.
- Parameters
lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.
fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
- forward(x, rois, img_metas, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[source]¶
FFN with pre normalization layer.
FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.
FFNWithNorm has one main difference with FFN:
- It apply one normalization layer before forwarding the input data to
feed-forward networks.
- Parameters
embed_dims (int) – Dimensions of embedding. Defaults to 256.
feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.
num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.
act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)
ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..
add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.
dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.
init_cfg (dict) – The Config for initialization. Defaults to None.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
- class mmaction.models.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]¶
Calculate the BCELoss for HVU.
- Parameters
categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.
- class mmaction.models.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]¶
Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]¶
Long-Term Feature Bank (LFB).
LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding
The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.
Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:
- Parameters
lfb_prefix_path (str) – The storage path of lfb.
max_num_sampled_feat (int) – The max number of sampled features. Default: 5.
window_size (int) – Window size of sampling long term feature. Default: 60.
lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.
dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).
device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.
lmdb_map_size (int) – Map size of lmdb. Default: 4e9.
construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.
- class mmaction.models.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]¶
Long-Term Feature Bank Infer Head.
This head is used to derive and save the LFB without affecting the input.
- Parameters
lfb_prefix_path (str) – The prefix path to store the lfb.
dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.
use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
- forward(x, rois, img_metas, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]¶
MobileNetV2 backbone.
- Parameters
pretrained (str | None) – Name of pretrained model. Default: None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in
MobileNetV2
isconv2
. Default: -1, which means not freezing any parameters.conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- make_layer(out_channels, num_blocks, stride, expand_ratio)[source]¶
Stack InvertedResidual blocks to build a layer for MobileNetV2.
- Parameters
out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Default: 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.
- train(mode=True)[source]¶
Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,BatchNorm
, etc.- Parameters
mode (bool) – whether to set training mode (
True
) or evaluation mode (False
). Default:True
.- Returns
self
- Return type
Module
- class mmaction.models.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]¶
MobileNetV2 backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
shift_div (int) – Number of div for shift. Default: 8.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.
- class mmaction.models.NLLLoss(loss_weight=1.0)[source]¶
NLL Loss.
It will calculate NLL loss given cls_score and label.
- class mmaction.models.OHEMHingeLoss(*args, **kwargs)[source]¶
This class is the core implementation for the completeness loss in paper.
It compute class-wise hinge loss and performs online hard example mining (OHEM).
- static backward(ctx, grad_output)[source]¶
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs as theforward()
returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
- static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]¶
Calculate OHEM hinge loss.
- Parameters
pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.
- Returns
Returned class-wise hinge loss.
- Return type
torch.Tensor
- class mmaction.models.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]¶
Proposals Evaluation Model for Boundary Sensitive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
pem_feat_dim (int) – Feature dimension.
pem_hidden_dim (int) – Hidden layer dimension.
pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.
- forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]¶
Define the computation performed at every call.
- class mmaction.models.Recognizer2D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
2D recognizer model framework.
- forward_dummy(imgs, softmax=False)[source]¶
Used for computing network FLOPs.
See
tools/analysis/get_flops.py
.- Parameters
imgs (torch.Tensor) – Input images.
- Returns
Class score.
- Return type
Tensor
- forward_gradcam(imgs)[source]¶
Defines the computation performed at every call when using gradcam utils.
- class mmaction.models.Recognizer3D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
3D recognizer model framework.
- forward_dummy(imgs, softmax=False)[source]¶
Used for computing network FLOPs.
See
tools/analysis/get_flops.py
.- Parameters
imgs (torch.Tensor) – Input images.
- Returns
Class score.
- Return type
Tensor
- forward_gradcam(imgs)[source]¶
Defines the computation performed at every call when using gradcam utils.
- class mmaction.models.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]¶
ResNet backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
dilations (Sequence[int]) – Dilation of each stage.
style (str) –
pytorch
orcaffe
. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch
.frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
partial_bn (bool) – Whether to use partial bn. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- class mmaction.models.ResNet2Plus1d(*args, **kwargs)[source]¶
ResNet (2+1)d backbone.
This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition
- class mmaction.models.ResNet3d(depth, pretrained, stage_blocks=None, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(3, 7, 7), conv1_stride_s=2, conv1_stride_t=1, pool1_stride_s=2, pool1_stride_t=1, with_pool1=True, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]¶
ResNet 3d backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
stage_blocks (tuple | None) – Set number of stages for each res layer. Default: None.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
in_channels (int) – Channel num of input features. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2)
.temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default:
(1, 1, 1, 1)
.dilations (Sequence[int]) – Dilation of each stage. Default:
(1, 1, 1, 1)
.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default:
(3, 7, 7)
.conv1_stride_s (int) – Spatial stride of the first conv layer. Default: 2.
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_s (int) – Spatial stride of the first pooling layer. Default: 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
with_pool2 (bool) – Whether to use pool2. Default: True.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict()
.zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- forward(x)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
- static make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶
Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.
dilation (int) – Spacing between kernel elements. Default: 1.
style (str) –
pytorch
orcaffe
. If set topytorch
, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch
.inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.
non_local_cfg (dict) – Config for non-local module. Default:
dict()
.conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
- class mmaction.models.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]¶
ResNet backbone for CSN.
- Parameters
depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.
If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.
Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- class mmaction.models.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶
ResNet 3d Layer.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
stage (int) – The index of Resnet stage. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
spatial_stride (int) – The 1st res block’s spatial stride. Default 2.
temporal_stride (int) – The 1st res block’s temporal stride. Default 1.
dilation (int) – The dilation. Default: 1.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
all_frozen (bool) – Frozen all modules in the layer. Default: False.
inflate (int) – Inflate Dims of each block. Default: 1.
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- class mmaction.models.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]¶
Slowfast backbone.
This module is proposed in SlowFast Networks for Video Recognition
- Parameters
pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride
resample_rate
on input frames. The actual resample rate is calculated by multipling theinterval
inSampleFrames
in the pipeline withresample_rate
, equivalent to the \(\tau\) in the paper, i.e. it processes only one out ofresample_rate * interval
frames. Default: 8.speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.
channel_ratio (int) – Reduce the channel number of fast pathway by
channel_ratio
, corresponding to \(\beta\) in the paper. Default: 8.slow_pathway (dict) –
Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:
dict(type='ResNetPathway', lateral=True, depth=50, pretrained=None, conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
fast_pathway (dict) –
Configuration of fast branch, similar to slow_pathway. Default:
dict(type='ResNetPathway', lateral=False, depth=50, pretrained=None, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
- class mmaction.models.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]¶
SlowOnly backbone based on ResNet3dPathway.
- Parameters
*args (arguments) – Arguments same as
ResNet3dPathway
.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).
**kwargs (keyword arguments) – Keywords arguments for
ResNet3dPathway
.
- class mmaction.models.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]¶
ResNet 2d audio backbone. Reference:
- Parameters
depth (int) – Depth of resnet, from {50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
in_channels (int) – Channel num of input features. Default: 1.
base_channels (int) – Channel num of stem output features. Default: 32.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.
conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
- forward(x)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
- static make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]¶
Build residual layer for ResNetAudio.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
stride (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilation (int) – Spacing between kernel elements. Default: 1.
factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- class mmaction.models.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]¶
ResNet backbone for TIN.
- Parameters
depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.
- class mmaction.models.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]¶
ResNet backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict()
.shift_div (int) – Number of div for shift. Default: 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Default: False.
**kwargs (keyword arguments, optional) – Arguments for ResNet.
- class mmaction.models.SSNLoss[source]¶
- static activity_loss(activity_score, labels, activity_indexer)[source]¶
Activity Loss.
It will calculate activity loss given activity_score and label.
- Args:
activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.
- Returns
Returned cross entropy loss.
- Return type
torch.Tensor
- static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]¶
Classwise Regression Loss.
It will calculate classwise_regression loss given class_reg_pred and targets.
- Args:
- bbox_pred (torch.Tensor): Predicted interval center and span
of positive proposals.
labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span
of positive proposals.
- regression_indexer (torch.Tensor): Index slices of
positive proposals.
- Returns
Returned class-wise regression loss.
- Return type
torch.Tensor
- static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]¶
Completeness Loss.
It will calculate completeness loss given completeness_score and label.
- Args:
completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and
incomplete proposals.
- positive_per_video (int): Number of positive proposals sampled
per video.
- incomplete_per_video (int): Number of incomplete proposals sampled
pre video.
- ohem_ratio (float): Ratio of online hard example mining.
Default: 0.17.
- Returns
Returned class-wise completeness loss.
- Return type
torch.Tensor
- forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]¶
Calculate Boundary Matching Network Loss.
- Parameters
activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.
- Returns
(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.
- Return type
dict([torch.Tensor, torch.Tensor, torch.Tensor])
- class mmaction.models.STGCN(in_channels, graph_cfg, edge_importance_weighting=True, data_bn=True, pretrained=None, **kwargs)[source]¶
Backbone of Spatial temporal graph convolutional networks.
- Parameters
in_channels (int) – Number of channels in the input data.
graph_cfg (dict) – The arguments for building the graph.
edge_importance_weighting (bool) – If
True
, adds a learnable importance weighting to the edges of the graph. Default: True.data_bn (bool) – If ‘True’, adds data normalization to the inputs. Default: True.
pretrained (str | None) – Name of pretrained model.
**kwargs (optional) – Other parameters for graph convolution units.
- Shape:
Input: \((N, in_channels, T_{in}, V_{in}, M_{in})\)
- Output: \((N, num_class)\) where
\(N\) is a batch size, \(T_{in}\) is a length of input sequence, \(V_{in}\) is the number of graph nodes, \(M_{in}\) is the number of instance in a frame.
- class mmaction.models.STGCNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', num_person=2, init_std=0.01, **kwargs)[source]¶
The classification head for STGCN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
num_person (int) – Number of person. Default: 2.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.SingleRoIExtractor3D(roi_layer_type='RoIAlign', featmap_stride=16, output_size=16, sampling_ratio=0, pool_mode='avg', aligned=True, with_temporal_pool=True, temporal_pool_mode='avg', with_global=False)[source]¶
Extract RoI features from a single level feature map.
- Parameters
roi_layer_type (str) – Specify the RoI layer type. Default: ‘RoIAlign’.
featmap_stride (int) – Strides of input feature maps. Default: 16.
output_size (int | tuple) – Size or (Height, Width). Default: 16.
sampling_ratio (int) – number of inputs samples to take for each output sample. 0 to take samples densely for current models. Default: 0.
pool_mode (str, 'avg' or 'max') – pooling mode in each bin. Default: ‘avg’.
aligned (bool) – if False, use the legacy implementation in MMDetection. If True, align the results more perfectly. Default: True.
with_temporal_pool (bool) – if True, avgpool the temporal dim. Default: True.
with_global (bool) – if True, concatenate the RoI feature with global feature. Default: False.
Note that sampling_ratio, pool_mode, aligned only apply when roi_layer_type is set as RoIAlign.
- forward(feat, rois)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.SkeletonGCN(backbone, cls_head=None, train_cfg=None, test_cfg=None)[source]¶
Spatial temporal graph convolutional networks.
- class mmaction.models.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]¶
The classification head for SlowFast.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.SubBatchNorm3D(num_features, **cfg)[source]¶
Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.
- Parameters
num_features (int) – Dimensions of BatchNorm.
- aggregate_stats()[source]¶
Synchronize running_mean, and running_var to self.bn.
Call this before eval, then call model.eval(); When eval, forward function will call self.bn instead of self.split_bn, During this time the running_mean, and running_var of self.bn has been obtained from self.split_bn.
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]¶
Temporal Adaptive Module(TAM) for TANet.
This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
- Parameters
in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) –
`alpha`
in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.adaptive_kernel_size (int) –
`K`
in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.beta (int) –
`beta`
in the paper and is set to control the model complexity in the local branch. Default: 4.conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of
`Temporal Adaptive Aggregation`
. Default: 1.adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of
`Temporal Adaptive Aggregation`
. Default: 1.init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.
- class mmaction.models.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]¶
Temporal Adaptive Network (TANet) backbone.
This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments.
tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().
**kwargs (keyword arguments, optional) – Arguments for ResNet except
`depth`
.
- class mmaction.models.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]¶
Temporal Evaluation Model for Boundary Sensitive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
tem_feat_dim (int) – Feature dimension.
tem_hidden_dim (int) – Hidden layer dimension.
tem_match_threshold (float) – Temporal evaluation match threshold.
loss_cls (dict) – Config for building loss. Default:
dict(type='BinaryLogisticRegressionLoss')
.loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.
- forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶
Define the computation performed at every call.
- forward_test(raw_feature, video_meta)[source]¶
Define the computation performed at every call when testing.
- class mmaction.models.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]¶
TPN neck.
This module is proposed in Temporal Pyramid Network for Action Recognition
- Parameters
in_channels (tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.
temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.
upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:
nn.Upsample
. Default: None.downsample_cfg (dict | None) – Config for downsample layers. Default: None.
level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.
aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.
flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.
- forward(x, target=None)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.TPNHead(*args, **kwargs)[source]¶
Class head for TPN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.
- forward(x, num_segs=None, fcn_test=False)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int | None) – Number of segments into which a video is divided. Default: None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
- class mmaction.models.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]¶
Class head for TRN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x, num_segs)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
- class mmaction.models.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]¶
Class head for TSM.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x, num_segs)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
- class mmaction.models.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶
Class head for TSN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[source]¶
TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?
- Parameters
num_frames (int) – Number of frames in the video.
img_size (int | tuple) – Size of input image.
patch_size (int) – Size of one patch.
pretrained (str | None) – Name of pretrained model. Default: None.
embed_dims (int) – Dimensions of embedding. Defaults to 768.
num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.
num_transformer_layers (int) – Number of transformer layers. Defaults to 12.
in_channels (int) – Channel num of input features. Defaults to 3.
dropout_ratio (float) – Probability of dropout layer. Defaults to 0..
(list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.
attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.
norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).
- class mmaction.models.TimeSformerHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, init_std=0.02, **kwargs)[source]¶
Classification head for TimeSformer.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).
init_std (float) – Std value for Initiation. Defaults to 0.02.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶
X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.
- Parameters
gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2)
.frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- forward(x)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
- make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶
Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
- class mmaction.models.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]¶
Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.
recognizers¶
- class mmaction.models.recognizers.AudioRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
Audio recognizer model framework.
- forward(audios, label=None, return_loss=True)[source]¶
Define the computation performed at every call.
- forward_gradcam(audios)[source]¶
Defines the computation performed at every all when using gradcam utils.
- forward_test(audios)[source]¶
Defines the computation performed at every call when evaluation and testing.
- forward_train(audios, labels)[source]¶
Defines the computation performed at every call when training.
- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- class mmaction.models.recognizers.BaseRecognizer(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
Base class for recognizers.
All recognizers should subclass it. All subclass should overwrite:
Methods:
forward_train
, supporting to forward when training.Methods:
forward_test
, supporting to forward when testing.
- Parameters
backbone (dict) – Backbone modules to extract feature.
cls_head (dict | None) – Classification head to process feature. Default: None.
neck (dict | None) – Neck for feature fusion. Default: None.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
- average_clip(cls_score, num_segs=1)[source]¶
Averaging class score over multiple clips.
Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.
- Parameters
cls_score (torch.Tensor) – Class score to be averaged.
num_segs (int) – Number of clips for each input sample.
- Returns
Averaged class score.
- Return type
torch.Tensor
- extract_feat(imgs)[source]¶
Extract features through a backbone.
- Parameters
imgs (torch.Tensor) – The input images.
- Returns
The extracted features.
- Return type
torch.tensor
- forward(imgs, label=None, return_loss=True, **kwargs)[source]¶
Define the computation performed at every call.
- abstract forward_gradcam(imgs)[source]¶
Defines the computation performed at every all when using gradcam utils.
- abstract forward_test(imgs)[source]¶
Defines the computation performed at every call when evaluation and testing.
- abstract forward_train(imgs, labels, **kwargs)[source]¶
Defines the computation performed at every call when training.
- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- property with_cls_head¶
whether the recognizer has a cls_head
- Type
bool
- property with_neck¶
whether the recognizer has a neck
- Type
bool
- class mmaction.models.recognizers.Recognizer2D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
2D recognizer model framework.
- forward_dummy(imgs, softmax=False)[source]¶
Used for computing network FLOPs.
See
tools/analysis/get_flops.py
.- Parameters
imgs (torch.Tensor) – Input images.
- Returns
Class score.
- Return type
Tensor
- forward_gradcam(imgs)[source]¶
Defines the computation performed at every call when using gradcam utils.
- class mmaction.models.recognizers.Recognizer3D(backbone, cls_head=None, neck=None, train_cfg=None, test_cfg=None)[source]¶
3D recognizer model framework.
- forward_dummy(imgs, softmax=False)[source]¶
Used for computing network FLOPs.
See
tools/analysis/get_flops.py
.- Parameters
imgs (torch.Tensor) – Input images.
- Returns
Class score.
- Return type
Tensor
- forward_gradcam(imgs)[source]¶
Defines the computation performed at every call when using gradcam utils.
localizers¶
- class mmaction.models.localizers.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]¶
Boundary Matching Network for temporal action proposal generation.
Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network
- Parameters
temporal_dim (int) – Total frames selected for each video.
boundary_ratio (float) – Ratio for determining video boundaries.
num_samples (int) – Number of samples for each proposal.
num_samples_per_bin (int) – Number of bin samples for each sample.
feat_dim (int) – Feature dimension.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
loss_cls (dict) – Config for building loss. Default:
dict(type='BMNLoss')
.hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.
- forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶
Define the computation performed at every call.
- forward_test(raw_feature, video_meta)[source]¶
Define the computation performed at every call when testing.
- class mmaction.models.localizers.BaseTAGClassifier(backbone, cls_head, train_cfg=None, test_cfg=None)[source]¶
Base class for temporal action proposal classifier.
All temporal action generation classifier should subclass it. All subclass should overwrite: Methods:
forward_train
, supporting to forward when training. Methods:forward_test
, supporting to forward when testing.- extract_feat(imgs)[source]¶
Extract features through a backbone.
- Parameters
imgs (torch.Tensor) – The input images.
- Returns
The extracted features.
- Return type
torch.tensor
- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- class mmaction.models.localizers.BaseTAPGenerator[source]¶
Base class for temporal action proposal generator.
All temporal action proposal generator should subclass it. All subclass should overwrite: Methods:
forward_train
, supporting to forward when training. Methods:forward_test
, supporting to forward when testing.- train_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which can be a weighted sum of multiple losses.log_vars
contains all the variables to be sent to the logger.num_samples
indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
- val_step(data_batch, optimizer, **kwargs)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step()
, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- class mmaction.models.localizers.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]¶
Proposals Evaluation Model for Boundary Sensitive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
pem_feat_dim (int) – Feature dimension.
pem_hidden_dim (int) – Hidden layer dimension.
pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.
- forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]¶
Define the computation performed at every call.
- class mmaction.models.localizers.SSN(backbone, cls_head, in_channels=3, spatial_type='avg', dropout_ratio=0.5, loss_cls={'type': 'SSNLoss'}, train_cfg=None, test_cfg=None)[source]¶
Temporal Action Detection with Structured Segment Networks.
- Parameters
backbone (dict) – Config for building backbone.
cls_head (dict) – Config for building classification head.
in_channels (int) – Number of channels for input data. Default: 3.
spatial_type (str) – Type of spatial pooling. Default: ‘avg’.
dropout_ratio (float) – Ratio of dropout. Default: 0.5.
loss_cls (dict) – Config for building loss. Default:
dict(type='SSNLoss')
.train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
- class mmaction.models.localizers.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]¶
Temporal Evaluation Model for Boundary Sensitive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
tem_feat_dim (int) – Feature dimension.
tem_hidden_dim (int) – Hidden layer dimension.
tem_match_threshold (float) – Temporal evaluation match threshold.
loss_cls (dict) – Config for building loss. Default:
dict(type='BinaryLogisticRegressionLoss')
.loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.
- forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶
Define the computation performed at every call.
- forward_test(raw_feature, video_meta)[source]¶
Define the computation performed at every call when testing.
common¶
- class mmaction.models.common.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]¶
(2+1)d Conv module for R(2+1)d backbone.
https://arxiv.org/pdf/1711.11248.pdf.
- Parameters
in_channels (int) – Same as nn.Conv3d.
out_channels (int) – Same as nn.Conv3d.
kernel_size (int | tuple[int]) – Same as nn.Conv3d.
stride (int | tuple[int]) – Same as nn.Conv3d.
padding (int | tuple[int]) – Same as nn.Conv3d.
dilation (int | tuple[int]) – Same as nn.Conv3d.
groups (int) – Same as nn.Conv3d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
- class mmaction.models.common.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]¶
Conv2d module for AudioResNet backbone.
- Parameters
in_channels (int) – Same as nn.Conv2d.
out_channels (int) – Same as nn.Conv2d.
kernel_size (int | tuple[int]) – Same as nn.Conv2d.
op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.
stride (int | tuple[int]) – Same as nn.Conv2d.
padding (int | tuple[int]) – Same as nn.Conv2d.
dilation (int | tuple[int]) – Same as nn.Conv2d.
groups (int) – Same as nn.Conv2d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
- class mmaction.models.common.DividedSpatialAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]¶
Spatial Attention in Divided Space Time Attention.
- Parameters
embed_dims (int) – Dimensions of embedding.
num_heads (int) – Number of parallel attention heads in TransformerCoder.
num_frames (int) – Number of frames in the video.
attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..
proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..
dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
init_cfg (dict | None) – The Config for initialization. Defaults to None.
- forward(query, key=None, value=None, residual=None, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.common.DividedTemporalAttentionWithNorm(embed_dims, num_heads, num_frames, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.1, 'type': 'DropPath'}, norm_cfg={'type': 'LN'}, init_cfg=None, **kwargs)[source]¶
Temporal Attention in Divided Space Time Attention.
- Parameters
embed_dims (int) – Dimensions of embedding.
num_heads (int) – Number of parallel attention heads in TransformerCoder.
num_frames (int) – Number of frames in the video.
attn_drop (float) – A Dropout layer on attn_output_weights. Defaults to 0..
proj_drop (float) – A Dropout layer after nn.MultiheadAttention. Defaults to 0..
dropout_layer (dict) – The dropout_layer used when adding the shortcut. Defaults to dict(type=’DropPath’, drop_prob=0.1).
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
init_cfg (dict | None) – The Config for initialization. Defaults to None.
- forward(query, key=None, value=None, residual=None, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.common.FFNWithNorm(*args, norm_cfg={'type': 'LN'}, **kwargs)[source]¶
FFN with pre normalization layer.
FFNWithNorm is implemented to be compatible with BaseTransformerLayer when using DividedTemporalAttentionWithNorm and DividedSpatialAttentionWithNorm.
FFNWithNorm has one main difference with FFN:
- It apply one normalization layer before forwarding the input data to
feed-forward networks.
- Parameters
embed_dims (int) – Dimensions of embedding. Defaults to 256.
feedforward_channels (int) – Hidden dimension of FFNs. Defaults to 1024.
num_fcs (int, optional) – Number of fully-connected layers in FFNs. Defaults to 2.
act_cfg (dict) – Config for activate layers. Defaults to dict(type=’ReLU’)
ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Defaults to 0..
add_residual (bool, optional) – Whether to add the residual connection. Defaults to True.
dropout_layer (dict | None) – The dropout_layer used when adding the shortcut. Defaults to None.
init_cfg (dict) – The Config for initialization. Defaults to None.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’LN’).
- class mmaction.models.common.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]¶
Long-Term Feature Bank (LFB).
LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding
The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.
Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:
- Parameters
lfb_prefix_path (str) – The storage path of lfb.
max_num_sampled_feat (int) – The max number of sampled features. Default: 5.
window_size (int) – Window size of sampling long term feature. Default: 60.
lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.
dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).
device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.
lmdb_map_size (int) – Map size of lmdb. Default: 4e9.
construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.
- class mmaction.models.common.SubBatchNorm3D(num_features, **cfg)[source]¶
Sub BatchNorm3d splits the batch dimension into N splits, and run BN on each of them separately (so that the stats are computed on each subset of examples (1/N of batch) independently). During evaluation, it aggregates the stats from all splits into one BN.
- Parameters
num_features (int) – Dimensions of BatchNorm.
- aggregate_stats()[source]¶
Synchronize running_mean, and running_var to self.bn.
Call this before eval, then call model.eval(); When eval, forward function will call self.bn instead of self.split_bn, During this time the running_mean, and running_var of self.bn has been obtained from self.split_bn.
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.common.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]¶
Temporal Adaptive Module(TAM) for TANet.
This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
- Parameters
in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) –
`alpha`
in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.adaptive_kernel_size (int) –
`K`
in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.beta (int) –
`beta`
in the paper and is set to control the model complexity in the local branch. Default: 4.conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of
`Temporal Adaptive Aggregation`
. Default: 1.adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of
`Temporal Adaptive Aggregation`
. Default: 1.init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.
backbones¶
- class mmaction.models.backbones.AGCN(in_channels, graph_cfg, data_bn=True, pretrained=None, **kwargs)[source]¶
Backbone of Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition.
- Parameters
in_channels (int) – Number of channels in the input data.
graph_cfg (dict) – The arguments for building the graph.
data_bn (bool) – If ‘True’, adds data normalization to the inputs. Default: True.
pretrained (str | None) – Name of pretrained model.
**kwargs (optional) – Other parameters for graph convolution units.
- Shape:
Input: \((N, in_channels, T_{in}, V_{in}, M_{in})\)
- Output: \((N, num_class)\) where
\(N\) is a batch size, \(T_{in}\) is a length of input sequence, \(V_{in}\) is the number of graph nodes, \(M_{in}\) is the number of instance in a frame.
- class mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, out_dim=8192, dropout_ratio=0.5, init_std=0.005)[source]¶
C3D backbone.
- Parameters
pretrained (str | None) – Name of pretrained model.
style (str) –
pytorch
orcaffe
. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses
dict(type='Conv3d')
to construct layers. Default: None.norm_cfg (dict | None) – Config for norm layers. required keys are
type
, Default: None.act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses
dict(type='ReLU')
to construct layers. Default: None.out_dim (int) – The dimension of last layer feature (after flatten). Depends on the input shape. Default: 8192.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.
- class mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]¶
MobileNetV2 backbone.
- Parameters
pretrained (str | None) – Name of pretrained model. Default: None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Note that the last stage in
MobileNetV2
isconv2
. Default: -1, which means not freezing any parameters.conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- make_layer(out_channels, num_blocks, stride, expand_ratio)[source]¶
Stack InvertedResidual blocks to build a layer for MobileNetV2.
- Parameters
out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Default: 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.
- train(mode=True)[source]¶
Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,BatchNorm
, etc.- Parameters
mode (bool) – whether to set training mode (
True
) or evaluation mode (False
). Default:True
.- Returns
self
- Return type
Module
- class mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]¶
MobileNetV2 backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
shift_div (int) – Number of div for shift. Default: 8.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.
- class mmaction.models.backbones.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]¶
ResNet backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
dilations (Sequence[int]) – Dilation of each stage.
style (str) –
pytorch
orcaffe
. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch
.frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
partial_bn (bool) – Whether to use partial bn. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- class mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[source]¶
ResNet (2+1)d backbone.
This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition
- class mmaction.models.backbones.ResNet3d(depth, pretrained, stage_blocks=None, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(3, 7, 7), conv1_stride_s=2, conv1_stride_t=1, pool1_stride_s=2, pool1_stride_t=1, with_pool1=True, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]¶
ResNet 3d backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
stage_blocks (tuple | None) – Set number of stages for each res layer. Default: None.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
in_channels (int) – Channel num of input features. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2)
.temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default:
(1, 1, 1, 1)
.dilations (Sequence[int]) – Dilation of each stage. Default:
(1, 1, 1, 1)
.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default:
(3, 7, 7)
.conv1_stride_s (int) – Spatial stride of the first conv layer. Default: 2.
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_s (int) – Spatial stride of the first pooling layer. Default: 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
with_pool2 (bool) – Whether to use pool2. Default: True.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict()
.zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- forward(x)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
- static make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶
Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.
dilation (int) – Spacing between kernel elements. Default: 1.
style (str) –
pytorch
orcaffe
. If set topytorch
, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch
.inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.
non_local_cfg (dict) – Config for non-local module. Default:
dict()
.conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
- class mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]¶
ResNet backbone for CSN.
- Parameters
depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 3x3x3. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.
If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.
Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- class mmaction.models.backbones.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶
ResNet 3d Layer.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
stage (int) – The index of Resnet stage. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
spatial_stride (int) – The 1st res block’s spatial stride. Default 2.
temporal_stride (int) – The 1st res block’s temporal stride. Default 1.
dilation (int) – The dilation. Default: 1.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
all_frozen (bool) – Frozen all modules in the layer. Default: False.
inflate (int) – Inflate Dims of each block. Default: 1.
inflate_style (str) –
3x1x1
or3x3x3
. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- class mmaction.models.backbones.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]¶
Slowfast backbone.
This module is proposed in SlowFast Networks for Video Recognition
- Parameters
pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride
resample_rate
on input frames. The actual resample rate is calculated by multipling theinterval
inSampleFrames
in the pipeline withresample_rate
, equivalent to the \(\tau\) in the paper, i.e. it processes only one out ofresample_rate * interval
frames. Default: 8.speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.
channel_ratio (int) – Reduce the channel number of fast pathway by
channel_ratio
, corresponding to \(\beta\) in the paper. Default: 8.slow_pathway (dict) –
Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:
dict(type='ResNetPathway', lateral=True, depth=50, pretrained=None, conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
fast_pathway (dict) –
Configuration of fast branch, similar to slow_pathway. Default:
dict(type='ResNetPathway', lateral=False, depth=50, pretrained=None, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
- class mmaction.models.backbones.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]¶
SlowOnly backbone based on ResNet3dPathway.
- Parameters
*args (arguments) – Arguments same as
ResNet3dPathway
.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).
**kwargs (keyword arguments) – Keywords arguments for
ResNet3dPathway
.
- class mmaction.models.backbones.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]¶
ResNet 2d audio backbone. Reference:
- Parameters
depth (int) – Depth of resnet, from {50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
in_channels (int) – Channel num of input features. Default: 1.
base_channels (int) – Channel num of stem output features. Default: 32.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.
conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
- forward(x)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
- static make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]¶
Build residual layer for ResNetAudio.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
stride (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilation (int) – Spacing between kernel elements. Default: 1.
factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- class mmaction.models.backbones.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]¶
ResNet backbone for TIN.
- Parameters
depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.
- class mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]¶
ResNet backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict()
.shift_div (int) – Number of div for shift. Default: 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Default: False.
**kwargs (keyword arguments, optional) – Arguments for ResNet.
- class mmaction.models.backbones.STGCN(in_channels, graph_cfg, edge_importance_weighting=True, data_bn=True, pretrained=None, **kwargs)[source]¶
Backbone of Spatial temporal graph convolutional networks.
- Parameters
in_channels (int) – Number of channels in the input data.
graph_cfg (dict) – The arguments for building the graph.
edge_importance_weighting (bool) – If
True
, adds a learnable importance weighting to the edges of the graph. Default: True.data_bn (bool) – If ‘True’, adds data normalization to the inputs. Default: True.
pretrained (str | None) – Name of pretrained model.
**kwargs (optional) – Other parameters for graph convolution units.
- Shape:
Input: \((N, in_channels, T_{in}, V_{in}, M_{in})\)
- Output: \((N, num_class)\) where
\(N\) is a batch size, \(T_{in}\) is a length of input sequence, \(V_{in}\) is the number of graph nodes, \(M_{in}\) is the number of instance in a frame.
- class mmaction.models.backbones.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]¶
Temporal Adaptive Network (TANet) backbone.
This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments.
tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().
**kwargs (keyword arguments, optional) – Arguments for ResNet except
`depth`
.
- class mmaction.models.backbones.TimeSformer(num_frames, img_size, patch_size, pretrained=None, embed_dims=768, num_heads=12, num_transformer_layers=12, in_channels=3, dropout_ratio=0.0, transformer_layers=None, attention_type='divided_space_time', norm_cfg={'eps': 1e-06, 'type': 'LN'}, **kwargs)[source]¶
TimeSformer. A PyTorch impl of Is Space-Time Attention All You Need for Video Understanding?
- Parameters
num_frames (int) – Number of frames in the video.
img_size (int | tuple) – Size of input image.
patch_size (int) – Size of one patch.
pretrained (str | None) – Name of pretrained model. Default: None.
embed_dims (int) – Dimensions of embedding. Defaults to 768.
num_heads (int) – Number of parallel attention heads in TransformerCoder. Defaults to 12.
num_transformer_layers (int) – Number of transformer layers. Defaults to 12.
in_channels (int) – Channel num of input features. Defaults to 3.
dropout_ratio (float) – Probability of dropout layer. Defaults to 0..
(list[obj (transformer_layers) – mmcv.ConfigDict] | obj:mmcv.ConfigDict | None): Config of transformerlayer in TransformerCoder. If it is obj:mmcv.ConfigDict, it would be repeated num_transformer_layers times to a list[obj:mmcv.ConfigDict]. Defaults to None.
attention_type (str) – Type of attentions in TransformerCoder. Choices are ‘divided_space_time’, ‘space_only’ and ‘joint_space_time’. Defaults to ‘divided_space_time’.
norm_cfg (dict) – Config for norm layers. Defaults to dict(type=’LN’, eps=1e-6).
- class mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶
X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.
- Parameters
gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2)
.frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are
type
Default:dict(type='Conv3d')
.norm_cfg (dict) – Config for norm layers. required keys are
type
andrequires_grad
. Default:dict(type='BN3d', requires_grad=True)
.act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True)
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
- forward(x)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
- make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶
Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
heads¶
- class mmaction.models.heads.ACRNHead(in_channels, out_channels, stride=1, num_convs=1, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, **kwargs)[source]¶
ACRN Head: Tile + 1x1 convolution + 3x3 convolution.
This module is proposed in Actor-Centric Relation Network
- Parameters
in_channels (int) – The input channel.
out_channels (int) – The output channel.
stride (int) – The spatial stride.
num_convs (int) – The number of 3x3 convolutions in ACRNHead.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
kwargs (dict) – Other new arguments, to be compatible with MMDet update.
- forward(x, feat, rois, **kwargs)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The extracted RoI feature.
feat (torch.Tensor) – The context feature.
rois (torch.Tensor) – The regions of interest.
- Returns
- The RoI features that have interacted with context
feature.
- Return type
torch.Tensor
- class mmaction.models.heads.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶
Classification head for TSN on audio.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, focal_gamma=0.0, focal_alpha=1.0, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]¶
Simplest RoI head, with only two fc layers for classification and regression respectively.
- Parameters
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
in_channels (int) – The number of input channels. Default: 2048.
focal_alpha (float) – The hyper-parameter alpha for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 1.
focal_gamma (float) – The hyper-parameter gamma for Focal Loss. When alpha == 1 and gamma == 0, Focal Loss degenerates to BCELossWithLogits. Default: 0.
num_classes (int) – The number of classes. Default: 81.
dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.
dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.
topk (int or tuple[int]) – Parameter for evaluating Top-K accuracy. Default: (3, 5)
multilabel (bool) – Whether used for a multilabel task. Default: True.
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- static get_recall_prec(pred_vec, target_vec)[source]¶
Computes the Recall/Precision for both multi-label and single label scenarios.
Note that the computation calculates the micro average.
Note, that in both cases, the concept of correct/incorrect is the same. :param pred_vec: each element is either 0 or 1 :type pred_vec: tensor[N x C] :param target_vec: each element is either 0 or 1 - for
single label it is expected that only one element is on (1) although this is not enforced.
- class mmaction.models.heads.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0, topk=(1, 5))[source]¶
Base class for head.
All Head should subclass it. All subclass should overwrite: - Methods:
init_weights
, initializing weights in some modules. - Methods:forward
, supporting to forward both for training and testing.- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.
topk (int | tuple) – Top-k accuracy. Default: (1, 5).
- abstract init_weights()[source]¶
Initiate the parameters either from existing checkpoint or from scratch.
- loss(cls_score, labels, **kwargs)[source]¶
Calculate the loss given output
cls_score
, targetlabels
.- Parameters
cls_score (torch.Tensor) – The output of the model.
labels (torch.Tensor) – The target output of the model.
- Returns
A dict containing field ‘loss_cls’(mandatory) and ‘topk_acc’(optional).
- Return type
dict
- class mmaction.models.heads.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]¶
Feature Bank Operator Head.
Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.
- Parameters
lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.
fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
- forward(x, rois, img_metas, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.heads.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]¶
Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max', pretrained=None)[source]¶
Long-Term Feature Bank Infer Head.
This head is used to derive and save the LFB without affecting the input.
- Parameters
lfb_prefix_path (str) – The prefix path to store the lfb.
dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.
use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
- forward(x, rois, img_metas, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmaction.models.heads.SSNHead(dropout_ratio=0.8, in_channels=1024, num_classes=20, consensus={'num_seg': (2, 5, 2), 'standalong_classifier': True, 'stpp_cfg': (1, 1, 1), 'type': 'STPPTrain'}, use_regression=True, init_std=0.001)[source]¶
The classification head for SSN.
- Parameters
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
in_channels (int) – Number of channels for input data. Default: 1024.
num_classes (int) – Number of classes to be classified. Default: 20.
consensus (dict) – Config of segmental consensus.
use_regression (bool) – Whether to perform regression or not. Default: True.
init_std (float) – Std value for Initiation. Default: 0.001.
- class mmaction.models.heads.STGCNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', num_person=2, init_std=0.01, **kwargs)[source]¶
The classification head for STGCN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
num_person (int) – Number of person. Default: 2.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]¶
The classification head for SlowFast.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.TPNHead(*args, **kwargs)[source]¶
Class head for TPN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.
- forward(x, num_segs=None, fcn_test=False)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int | None) – Number of segments into which a video is divided. Default: None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
- class mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]¶
Class head for TRN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x, num_segs)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
- class mmaction.models.heads.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]¶
Class head for TSM.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- forward(x, num_segs)[source]¶
Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
- class mmaction.models.heads.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶
Class head for TSN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.TimeSformerHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, init_std=0.02, **kwargs)[source]¶
Classification head for TimeSformer.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Defaults to dict(type=’CrossEntropyLoss’).
init_std (float) – Std value for Initiation. Defaults to 0.02.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
- class mmaction.models.heads.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]¶
Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.
necks¶
- class mmaction.models.necks.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]¶
TPN neck.
This module is proposed in Temporal Pyramid Network for Action Recognition
- Parameters
in_channels (tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.
temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.
upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:
nn.Upsample
. Default: None.downsample_cfg (dict | None) – Config for downsample layers. Default: None.
level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.
aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.
flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.
- forward(x, target=None)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
losses¶
- class mmaction.models.losses.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]¶
Binary Cross Entropy Loss with logits.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
- class mmaction.models.losses.BMNLoss[source]¶
BMN Loss.
From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of
1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.
- forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]¶
Calculate Boundary Matching Network Loss.
- Parameters
pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.
- Returns
(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.
- Return type
tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])
- static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]¶
Calculate Proposal Evaluation Module Classification Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5
- Returns
Proposal evaluation classification loss.
- Return type
torch.Tensor
- static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]¶
Calculate Proposal Evaluation Module Regression Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.
- Returns
Proposal evaluation regression loss.
- Return type
torch.Tensor
- static tem_loss(pred_start, pred_end, gt_start, gt_end)[source]¶
Calculate Temporal Evaluation Module Loss.
This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.
- Parameters
pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
- class mmaction.models.losses.BaseWeightedLoss(loss_weight=1.0)[source]¶
Base class for loss.
All subclass should overwrite the
_forward()
method which returns the normal loss without loss weights.- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
- class mmaction.models.losses.BinaryLogisticRegressionLoss[source]¶
Binary Logistic Regression Loss.
It will calculate binary logistic regression loss given reg_score and label.
- forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]¶
Calculate Binary Logistic Regression Loss.
- Parameters
reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
- class mmaction.models.losses.CBFocalLoss(loss_weight=1.0, samples_per_cls=[], beta=0.9999, gamma=2.0)[source]¶
Class Balanced Focal Loss. Adapted from https://github.com/abhinanda- punnakkal/BABEL/. This loss is used in the skeleton-based action recognition baseline for BABEL.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
samples_per_cls (list[int]) – The number of samples per class. Default: [].
beta (float) – Hyperparameter that controls the per class loss weight. Default: 0.9999.
gamma (float) – Hyperparameter of the focal loss. Default: 2.0.
- class mmaction.models.losses.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]¶
Cross Entropy Loss.
Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of
cls_score
andlabel
. 1) Hard label: This label is an integer array and all of the elements arein the range [0, num_classes - 1]. This label’s shape should be
cls_score
’s shape with the num_classes dimension removed.- Soft label(probablity distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as
cls_score
. For now, only 2-dim soft label is supported.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
- class mmaction.models.losses.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]¶
Calculate the BCELoss for HVU.
- Parameters
categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.
- class mmaction.models.losses.NLLLoss(loss_weight=1.0)[source]¶
NLL Loss.
It will calculate NLL loss given cls_score and label.
- class mmaction.models.losses.OHEMHingeLoss(*args, **kwargs)[source]¶
This class is the core implementation for the completeness loss in paper.
It compute class-wise hinge loss and performs online hard example mining (OHEM).
- static backward(ctx, grad_output)[source]¶
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs as theforward()
returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
- static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]¶
Calculate OHEM hinge loss.
- Parameters
pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.
- Returns
Returned class-wise hinge loss.
- Return type
torch.Tensor
- class mmaction.models.losses.SSNLoss[source]¶
- static activity_loss(activity_score, labels, activity_indexer)[source]¶
Activity Loss.
It will calculate activity loss given activity_score and label.
- Args:
activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.
- Returns
Returned cross entropy loss.
- Return type
torch.Tensor
- static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]¶
Classwise Regression Loss.
It will calculate classwise_regression loss given class_reg_pred and targets.
- Args:
- bbox_pred (torch.Tensor): Predicted interval center and span
of positive proposals.
labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span
of positive proposals.
- regression_indexer (torch.Tensor): Index slices of
positive proposals.
- Returns
Returned class-wise regression loss.
- Return type
torch.Tensor
- static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]¶
Completeness Loss.
It will calculate completeness loss given completeness_score and label.
- Args:
completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and
incomplete proposals.
- positive_per_video (int): Number of positive proposals sampled
per video.
- incomplete_per_video (int): Number of incomplete proposals sampled
pre video.
- ohem_ratio (float): Ratio of online hard example mining.
Default: 0.17.
- Returns
Returned class-wise completeness loss.
- Return type
torch.Tensor
- forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]¶
Calculate Boundary Matching Network Loss.
- Parameters
activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.
- Returns
(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.
- Return type
dict([torch.Tensor, torch.Tensor, torch.Tensor])
mmaction.datasets¶
datasets¶
- class mmaction.datasets.AVADataset(ann_file, exclude_file, pipeline, label_file=None, filename_tmpl='img_{:05}.jpg', start_index=0, proposal_file=None, person_det_score_thr=0.9, num_classes=81, custom_classes=None, data_prefix=None, test_mode=False, modality='RGB', num_max_proposals=1000, timestamp_start=900, timestamp_end=1800, fps=30)[source]¶
AVA dataset for spatial temporal detection.
Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.
This datasets can load information from the following files:
ann_file -> ava_{train, val}_{v2.1, v2.2}.csv exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv label_file -> ava_action_list_{v2.1, v2.2}.pbtxt / ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl
Particularly, the proposal_file is a pickle file which contains
img_key
(in format of{video_id},{timestamp}
). Example of a pickle file:{ ... '0f39OWEqJ24,0902': array([[0.011 , 0.157 , 0.655 , 0.983 , 0.998163]]), '0f39OWEqJ24,0912': array([[0.054 , 0.088 , 0.91 , 0.998 , 0.068273], [0.016 , 0.161 , 0.519 , 0.974 , 0.984025], [0.493 , 0.283 , 0.981 , 0.984 , 0.983621]]), ... }
- Parameters
ann_file (str) – Path to the annotation file like
ava_{train, val}_{v2.1, v2.2}.csv
.exclude_file (str) – Path to the excluded timestamp file like
ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
.pipeline (list[dict | callable]) – A sequence of data transforms.
label_file (str) – Path to the label file like
ava_action_list_{v2.1, v2.2}.pbtxt
orava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
. Default: None.filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 0.
proposal_file (str) – Path to the proposal file like
ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl
. Default: None.person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Default: 0.9. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used.
num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)
custom_classes (list[int]) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and
num_classes
should be equal tolen(custom_classes) + 1
data_prefix (str) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
num_max_proposals (int) – Max proposals number to store. Default: 1000.
timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Default: 902.
timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Default: 1798.
fps (int) – Overrides the default FPS for the dataset. Default: 30.
- evaluate(results, metrics=('mAP'), metric_options=None, logger=None)[source]¶
Evaluate the prediction results and report mAP.
- class mmaction.datasets.ActivityNetDataset(ann_file, pipeline, data_prefix=None, test_mode=False)[source]¶
ActivityNet dataset for temporal action localization.
The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:
{ "v_--1DO2V4K74": { "duration_second": 211.53, "duration_frame": 6337, "annotations": [ { "segment": [ 30.025882995319815, 205.2318595943838 ], "label": "Rock climbing" } ], "feature_frame": 6336, "fps": 30.0, "rfps": 29.9579255898 }, "v_--6bJUbfpnQ": { "duration_second": 26.75, "duration_frame": 647, "annotations": [ { "segment": [ 2.578755070202808, 24.914101404056165 ], "label": "Drinking beer" } ], "feature_frame": 624, "fps": 24.0, "rfps": 24.1869158879 }, ... }
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
- dump_results(results, out, output_format, version='VERSION 1.3')[source]¶
Dump data to json/csv files.
- evaluate(results, metrics='AR@AN', metric_options={'AR@AN': {'max_avg_proposals': 100, 'temporal_iou_thresholds': array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95])}}, logger=None, **deprecated_kwargs)[source]¶
Evaluation in feature dataset.
- Parameters
results (list[dict]) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘AR@AN’.
metric_options (dict) – Dict for metric options. Options are
max_avg_proposals
,temporal_iou_thresholds
forAR@AN
. default:{'AR@AN': dict(max_avg_proposals=100, temporal_iou_thresholds=np.linspace(0.5, 0.95, 10))}
.logger (logging.Logger | None) – Training logger. Defaults: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results for evaluation metrics.
- Return type
dict
- static proposals2json(results, show_progress=False)[source]¶
Convert all proposals to a final dict(json) format.
- Parameters
results (list[dict]) – All proposals.
show_progress (bool) – Whether to show the progress bar. Defaults: False.
- Returns
The final result dict. E.g.
dict(video-1=[dict(segment=[1.1,2.0]. score=0.9), dict(segment=[50.1, 129.3], score=0.6)])
- Return type
dict
- class mmaction.datasets.AudioDataset(ann_file, pipeline, suffix='.wav', **kwargs)[source]¶
Audio dataset for video recognition. Extracts the audio feature on-the- fly. Annotation file can be that of the rawframe dataset, or:
some/directory-1.wav 163 1 some/directory-2.wav 122 1 some/directory-3.wav 258 2 some/directory-4.wav 234 2 some/directory-5.wav 295 3 some/directory-6.wav 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
suffix (str) – The suffix of the audio file. Default: ‘.wav’.
kwargs (dict) – Other keyword args for BaseDataset.
- class mmaction.datasets.AudioFeatureDataset(ann_file, pipeline, suffix='.npy', **kwargs)[source]¶
Audio feature dataset for video recognition. Reads the features extracted off-line. Annotation file can be that of the rawframe dataset, or:
some/directory-1.npy 163 1 some/directory-2.npy 122 1 some/directory-3.npy 258 2 some/directory-4.npy 234 2 some/directory-5.npy 295 3 some/directory-6.npy 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
suffix (str) – The suffix of the audio feature file. Default: ‘.npy’.
kwargs (dict) – Other keyword args for BaseDataset.
- class mmaction.datasets.AudioVisualDataset(ann_file, pipeline, audio_prefix, **kwargs)[source]¶
Dataset that reads both audio and visual data, supporting both rawframes and videos. The annotation file is same as that of the rawframe dataset, such as:
some/directory-1 163 1 some/directory-2 122 1 some/directory-3 258 2 some/directory-4 234 2 some/directory-5 295 3 some/directory-6 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
audio_prefix (str) – Directory of the audio files.
kwargs (dict) – Other keyword args for RawframeDataset. video_prefix is also allowed if pipeline is designed for videos.
- class mmaction.datasets.BaseDataset(ann_file, pipeline, data_prefix=None, test_mode=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=0, dynamic_length=False)[source]¶
Base class for datasets.
All datasets to process video should subclass it. All subclasses should overwrite:
Methods:load_annotations, supporting to load information from an
annotation file. - Methods:prepare_train_frames, providing train data. - Methods:prepare_test_frames, providing test data.
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
multi_class (bool) – Determines whether the dataset is a multi-class dataset. Default: False.
num_classes (int | None) – Number of classes of the dataset, used in multi-class datasets. Default: None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 1.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’, ‘Audio’. Default: ‘RGB’.
sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.
power (float) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: 0.
dynamic_length (bool) – If the dataset length is dynamic (used by ClassSpecificDistributedSampler). Default: False.
- evaluate(results, metrics='top_k_accuracy', metric_options={'top_k_accuracy': {'topk': (1, 5)}}, logger=None, **deprecated_kwargs)[source]¶
Perform evaluation for common datasets.
- Parameters
results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘top_k_accuracy’.
metric_options (dict) – Dict for metric options. Options are
topk
fortop_k_accuracy
. Default:dict(top_k_accuracy=dict(topk=(1, 5)))
.logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results dict.
- Return type
dict
- class mmaction.datasets.ConcatDataset(datasets, test_mode=False)[source]¶
A wrapper of concatenated dataset.
The length of concatenated dataset will be the sum of lengths of all datasets. This is useful when you want to train a model with multiple data sources.
- Parameters
datasets (list[dict]) – The configs of the datasets.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
- class mmaction.datasets.CutmixBlending(num_classes, alpha=0.2)[source]¶
Implementing Cutmix in a mini-batch.
This module is proposed in CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. Code Reference https://github.com/clovaai/CutMix-PyTorch
- Parameters
num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.
- class mmaction.datasets.HVUDataset(ann_file, pipeline, tag_categories, tag_category_nums, filename_tmpl=None, **kwargs)[source]¶
HVU dataset, which supports the recognition tags of multiple categories. Accept both video annotation files or rawframe annotation files.
The dataset loads videos or raw frames and applies specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a json file with multiple dictionaries, and each dictionary indicates a sample video with the filename and tags, the tags are organized as different categories. Example of a video dictionary:
{ 'filename': 'gD_G1b0wV5I_001015_001035.mp4', 'label': { 'concept': [250, 131, 42, 51, 57, 155, 122], 'object': [1570, 508], 'event': [16], 'action': [180], 'scene': [206] } }
Example of a rawframe dictionary:
{ 'frame_dir': 'gD_G1b0wV5I_001015_001035', 'total_frames': 61 'label': { 'concept': [250, 131, 42, 51, 57, 155, 122], 'object': [1570, 508], 'event': [16], 'action': [180], 'scene': [206] } }
- Parameters
ann_file (str) – Path to the annotation file, should be a json file.
pipeline (list[dict | callable]) – A sequence of data transforms.
tag_categories (list[str]) – List of category names of tags.
tag_category_nums (list[int]) – List of number of tags in each category.
filename_tmpl (str | None) – Template for each filename. If set to None, video dataset is used. Default: None.
**kwargs – Keyword arguments for
BaseDataset
.
- evaluate(results, metrics='mean_average_precision', metric_options=None, logger=None)[source]¶
Evaluation in HVU Video Dataset. We only support evaluating mAP for each tag categories. Since some tag categories are missing for some videos, we can not evaluate mAP for all tags.
- Parameters
results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mean_average_precision’.
metric_options (dict | None) – Dict for metric options. Default: None.
logger (logging.Logger | None) – Logger for recording. Default: None.
- Returns
Evaluation results dict.
- Return type
dict
- class mmaction.datasets.ImageDataset(ann_file, pipeline, **kwargs)[source]¶
Image dataset for action recognition, used in the Project OmniSource.
The dataset loads image list and apply specified transforms to return a dict containing the image tensors and other information. For the ImageDataset
The ann_file is a text file with multiple lines, and each line indicates the image path and the image label, which are split with a whitespace. Example of a annotation file:
path/to/image1.jpg 1 path/to/image2.jpg 1 path/to/image3.jpg 2 path/to/image4.jpg 2 path/to/image5.jpg 3 path/to/image6.jpg 3
Example of a multi-class annotation file:
path/to/image1.jpg 1 3 5 path/to/image2.jpg 1 2 path/to/image3.jpg 2 path/to/image4.jpg 2 4 6 8 path/to/image5.jpg 3 path/to/image6.jpg 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
**kwargs – Keyword arguments for
BaseDataset
.
- class mmaction.datasets.MixupBlending(num_classes, alpha=0.2)[source]¶
Implementing Mixup in a mini-batch.
This module is proposed in mixup: Beyond Empirical Risk Minimization. Code Reference https://github.com/open-mmlab/mmclassification/blob/master/mmcls/models/utils/mixup.py # noqa
- Parameters
num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.
- class mmaction.datasets.PoseDataset(ann_file, pipeline, split=None, valid_ratio=None, box_thr=None, class_prob=None, **kwargs)[source]¶
Pose dataset for action recognition.
The dataset loads pose and apply specified transforms to return a dict containing pose information.
The ann_file is a pickle file, the json file contains a list of annotations, the fields of an annotation include frame_dir(video_id), total_frames, label, kp, kpscore.
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
split (str | None) – The dataset split used. Only applicable to UCF or HMDB. Allowed choiced are ‘train1’, ‘test1’, ‘train2’, ‘test2’, ‘train3’, ‘test3’. Default: None.
valid_ratio (float | None) – The valid_ratio for videos in KineticsPose. For a video with n frames, it is a valid training sample only if n * valid_ratio frames have human pose. None means not applicable (only applicable to Kinetics Pose). Default: None.
box_thr (str | None) – The threshold for human proposals. Only boxes with confidence score larger than box_thr is kept. None means not applicable (only applicable to Kinetics Pose [ours]). Allowed choices are ‘0.5’, ‘0.6’, ‘0.7’, ‘0.8’, ‘0.9’. Default: None.
class_prob (dict | None) – The per class sampling probability. If not None, it will override the class_prob calculated in BaseDataset.__init__(). Default: None.
**kwargs – Keyword arguments for
BaseDataset
.
- class mmaction.datasets.RawVideoDataset(ann_file, pipeline, clipname_tmpl='part_{}.mp4', sampling_strategy='positive', **kwargs)[source]¶
RawVideo dataset for action recognition, used in the Project OmniSource.
The dataset loads clips of raw videos and apply specified transforms to return a dict containing the frame tensors and other information. Not that for this dataset, multi_class should be False.
The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath (without suffix), label, number of clips and index of positive clips (starting from 0), which are split with a whitespace. Raw videos should be first trimmed into 10 second clips, organized in the following format:
some/path/D32_1gwq35E/part_0.mp4 some/path/D32_1gwq35E/part_1.mp4 ...... some/path/D32_1gwq35E/part_n.mp4
Example of a annotation file:
some/path/D32_1gwq35E 66 10 0 1 2 some/path/-G-5CJ0JkKY 254 5 3 4 some/path/T4h1bvOd9DA 33 1 0 some/path/4uZ27ivBl00 341 2 0 1 some/path/0LfESFkfBSw 186 234 7 9 11 some/path/-YIsNpBEx6c 169 100 9 10 11
The first line indicates that the raw video some/path/D32_1gwq35E has action label 66, consists of 10 clips (from part_0.mp4 to part_9.mp4). The 1st, 2nd and 3rd clips are positive clips.
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
sampling_strategy (str) – The strategy to sample clips from raw videos. Choices are ‘random’ or ‘positive’. Default: ‘positive’.
clipname_tmpl (str) – The template of clip name in the raw video. Default: ‘part_{}.mp4’.
**kwargs – Keyword arguments for
BaseDataset
.
- class mmaction.datasets.RawframeDataset(ann_file, pipeline, data_prefix=None, test_mode=False, filename_tmpl='img_{:05}.jpg', with_offset=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=0.0, dynamic_length=False, **kwargs)[source]¶
Rawframe dataset for action recognition.
The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:
some/directory-1 163 1 some/directory-2 122 1 some/directory-3 258 2 some/directory-4 234 2 some/directory-5 295 3 some/directory-6 121 3
Example of a multi-class annotation file:
some/directory-1 163 1 3 5 some/directory-2 122 1 2 some/directory-3 258 2 some/directory-4 234 2 4 6 8 some/directory-5 295 3 some/directory-6 121 3
Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.
some/directory-1 12 163 3 some/directory-2 213 122 4 some/directory-3 100 258 5 some/directory-4 98 234 2 some/directory-5 0 295 3 some/directory-6 50 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
with_offset (bool) – Determines whether the offset information is in ann_file. Default: False.
multi_class (bool) – Determines whether it is a multi-class recognition dataset. Default: False.
num_classes (int | None) – Number of classes in the dataset. Default: None.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.
power (float) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: 0.
dynamic_length (bool) – If the dataset length is dynamic (used by ClassSpecificDistributedSampler). Default: False.
- class mmaction.datasets.RepeatDataset(dataset, times, test_mode=False)[source]¶
A wrapper of repeated dataset.
The length of repeated dataset will be
times
larger than the original dataset. This is useful when the data loading time is long but the dataset is small. Using RepeatDataset can reduce the data loading time between epochs.- Parameters
dataset (dict) – The config of the dataset to be repeated.
times (int) – Repeat times.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
- class mmaction.datasets.SSNDataset(ann_file, pipeline, train_cfg, test_cfg, data_prefix, test_mode=False, filename_tmpl='img_{:05d}.jpg', start_index=1, modality='RGB', video_centric=True, reg_normalize_constants=None, body_segments=5, aug_segments=(2, 2), aug_ratio=(0.5, 0.5), clip_len=1, frame_interval=1, filter_gt=True, use_regression=True, verbose=False)[source]¶
Proposal frame dataset for Structured Segment Networks.
Based on proposal information, the dataset loads raw frames and applies specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines and each video’s information takes up several lines. This file can be a normalized file with percent or standard file with specific frame indexes. If the file is a normalized file, it will be converted into a standard file first.
Template information of a video in a standard file: .. code-block:: txt
# index video_id num_frames fps num_gts label, start_frame, end_frame label, start_frame, end_frame … num_proposals label, best_iou, overlap_self, start_frame, end_frame label, best_iou, overlap_self, start_frame, end_frame …
Example of a standard annotation file: .. code-block:: txt
# 0 video_validation_0000202 5666 1 3 8 130 185 8 832 1136 8 1303 1381 5 8 0.0620 0.0620 790 5671 8 0.1656 0.1656 790 2619 8 0.0833 0.0833 3945 5671 8 0.0960 0.0960 4173 5671 8 0.0614 0.0614 3327 5671
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
train_cfg (dict) – Config for training.
test_cfg (dict) – Config for testing.
data_prefix (str) – Path to a directory where videos are held.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
start_index (int) – Specify a start index for frames in consideration of different filename format. Default: 1.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
video_centric (bool) – Whether to sample proposals just from this video or sample proposals randomly from the entire dataset. Default: True.
reg_normalize_constants (list) – Regression target normalized constants, including mean and standard deviation of location and duration.
body_segments (int) – Number of segments in course period. Default: 5.
aug_segments (list[int]) – Number of segments in starting and ending period. Default: (2, 2).
aug_ratio (int | float | tuple[int | float]) – The ratio of the length of augmentation to that of the proposal. Default: (0.5, 0.5).
clip_len (int) – Frames of each sampled output clip. Default: 1.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
filter_gt (bool) – Whether to filter videos with no annotation during training. Default: True.
use_regression (bool) – Whether to perform regression. Default: True.
verbose (bool) – Whether to print full information or not. Default: False.
- construct_proposal_pools()[source]¶
Construct positive proposal pool, incomplete proposal pool and background proposal pool of the entire dataset.
- evaluate(results, metrics='mAP', metric_options={'mAP': {'eval_dataset': 'thumos14'}}, logger=None, **deprecated_kwargs)[source]¶
Evaluation in SSN proposal dataset.
- Parameters
results (list[dict]) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mAP’.
metric_options (dict) – Dict for metric options. Options are
eval_dataset
formAP
. Default:dict(mAP=dict(eval_dataset='thumos14'))
.logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results for evaluation metrics.
- Return type
dict
- static get_negatives(proposals, incomplete_iou_threshold, background_iou_threshold, background_coverage_threshold=0.01, incomplete_overlap_threshold=0.7)[source]¶
Get negative proposals, including incomplete proposals and background proposals.
- Parameters
proposals (list) – List of proposal instances(
SSNInstance
).incomplete_iou_threshold (float) – Maximum threshold of overlap of incomplete proposals and groundtruths.
background_iou_threshold (float) – Maximum threshold of overlap of background proposals and groundtruths.
background_coverage_threshold (float) – Minimum coverage of background proposals in video duration. Default: 0.01.
incomplete_overlap_threshold (float) – Minimum percent of incomplete proposals’ own span contained in a groundtruth instance. Default: 0.7.
- Returns
- (incompletes, backgrounds), incompletes
and backgrounds are lists comprised of incomplete proposal instances and background proposal instances.
- Return type
list[
SSNInstance
]
- static get_positives(gts, proposals, positive_threshold, with_gt=True)[source]¶
Get positive/foreground proposals.
- Parameters
gts (list) – List of groundtruth instances(
SSNInstance
).proposals (list) – List of proposal instances(
SSNInstance
).positive_threshold (float) – Minimum threshold of overlap of positive/foreground proposals and groundtruths.
with_gt (bool) – Whether to include groundtruth instances in positive proposals. Default: True.
- Returns
- (positives), positives is a list
comprised of positive proposal instances.
- Return type
list[
SSNInstance
]
- class mmaction.datasets.VideoDataset(ann_file, pipeline, start_index=0, **kwargs)[source]¶
Video dataset for action recognition.
The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:
some/path/000.mp4 1 some/path/001.mp4 1 some/path/002.mp4 2 some/path/003.mp4 2 some/path/004.mp4 3 some/path/005.mp4 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 0.
**kwargs – Keyword arguments for
BaseDataset
.
- mmaction.datasets.build_dataloader(dataset, videos_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, drop_last=False, pin_memory=True, persistent_workers=False, **kwargs)[source]¶
Build PyTorch DataLoader.
In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.
- Parameters
dataset (
Dataset
) – A PyTorch dataset.videos_per_gpu (int) – Number of videos on each GPU, i.e., batch size of each GPU.
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
num_gpus (int) – Number of GPUs. Only used in non-distributed training. Default: 1.
dist (bool) – Distributed training/test or not. Default: True.
shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.
seed (int | None) – Seed to be used. Default: None.
drop_last (bool) – Whether to drop the last incomplete batch in epoch. Default: False
pin_memory (bool) – Whether to use pin_memory in DataLoader. Default: True
persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. The argument also has effect in PyTorch>=1.8.0. Default: False
kwargs (dict, optional) – Any keyword argument to be used to initialize DataLoader.
- Returns
A PyTorch dataloader.
- Return type
DataLoader
- mmaction.datasets.build_dataset(cfg, default_args=None)[source]¶
Build a dataset from config dict.
- Parameters
cfg (dict) – Config dict. It should at least contain the key “type”.
default_args (dict | None, optional) – Default initialization arguments. Default: None.
- Returns
The constructed dataset.
- Return type
Dataset
pipelines¶
- class mmaction.datasets.pipelines.ArrayDecode[source]¶
Load and decode frames with given indices from a 4D array.
Required keys are “array and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- class mmaction.datasets.pipelines.AudioAmplify(ratio)[source]¶
Amplify the waveform.
Required keys are “audios”, added or modified keys are “audios”, “amplify_ratio”.
- Parameters
ratio (float) – The ratio used to amplify the audio waveform.
- class mmaction.datasets.pipelines.AudioDecode(fixed_length=32000)[source]¶
Sample the audio w.r.t. the frames selected.
- Parameters
fixed_length (int) – As the audio clip selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Default: 32000.
Required keys are “frame_inds”, “num_clips”, “total_frames”, “length”, added or modified keys are “audios”, “audios_shape”.
- class mmaction.datasets.pipelines.AudioDecodeInit(io_backend='disk', sample_rate=16000, pad_method='zero', **kwargs)[source]¶
Using librosa to initialize the audio reader.
Required keys are “audio_path”, added or modified keys are “length”, “sample_rate”, “audios”.
- Parameters
io_backend (str) – io backend where frames are store. Default: ‘disk’.
sample_rate (int) – Audio sampling times per second. Default: 16000.
- class mmaction.datasets.pipelines.AudioFeatureSelector(fixed_length=128)[source]¶
Sample the audio feature w.r.t. the frames selected.
Required keys are “audios”, “frame_inds”, “num_clips”, “length”, “total_frames”, added or modified keys are “audios”, “audios_shape”.
- Parameters
fixed_length (int) – As the features selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Default: 128.
- class mmaction.datasets.pipelines.BuildPseudoClip(clip_len)[source]¶
Build pseudo clips with one single image by repeating it n times.
- Required key is “imgs”, added or modified key is “imgs”, “num_clips”,
“clip_len”.
- Parameters
clip_len (int) – Frames of the generated pseudo clips.
- class mmaction.datasets.pipelines.CenterCrop(crop_size, lazy=False)[source]¶
Crop the center area from images.
Required keys are “img_shape”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.
- Parameters
crop_size (int | tuple[int]) – (w, h) of crop size.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.pipelines.Collect(keys, meta_keys=('filename', 'label', 'original_shape', 'img_shape', 'pad_shape', 'flip_direction', 'img_norm_cfg'), meta_name='img_metas', nested=False)[source]¶
Collect data from the loader relevant to the specific task.
This keeps the items in
keys
as it is, and collect items inmeta_keys
into a meta item calledmeta_name
.This is usually the last stage of the data loader pipeline. For example, when keys=’imgs’, meta_keys=(‘filename’, ‘label’, ‘original_shape’), meta_name=’img_metas’, the results will be a dict with keys ‘imgs’ and ‘img_metas’, where ‘img_metas’ is a DataContainer of another dict with keys ‘filename’, ‘label’, ‘original_shape’.- Parameters
keys (Sequence[str]) – Required keys to be collected.
meta_name (str) – The name of the key that contains meta information. This key is always populated. Default: “img_metas”.
meta_keys (Sequence[str]) –
Keys that are collected under meta_name. The contents of the
meta_name
dictionary depends onmeta_keys
. By default this includes:”filename”: path to the image file
”label”: label of the image file
- ”original_shape”: original shape of the image as a tuple
(h, w, c)
- ”img_shape”: shape of the image input to the network as a tuple
(h, w, c). Note that images may be zero padded on the bottom/right, if the batch tensor is larger than this shape.
”pad_shape”: image shape after padding
- ”flip_direction”: a str in (“horiziontal”, “vertival”) to
indicate if the image is fliped horizontally or vertically.
- ”img_norm_cfg”: a dict of normalization information:
mean - per channel mean subtraction
std - per channel std divisor
to_rgb - bool indicating if bgr was converted to rgb
nested (bool) – If set as True, will apply data[x] = [data[x]] to all items in data. The arg is added for compatibility. Default: False.
- class mmaction.datasets.pipelines.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1)[source]¶
Perform ColorJitter to each img.
Required keys are “imgs”, added or modified keys are “imgs”.
- Parameters
brightness (float | tuple[float]) – The jitter range for brightness, if set as a float, the range will be (1 - brightness, 1 + brightness). Default: 0.5.
contrast (float | tuple[float]) – The jitter range for contrast, if set as a float, the range will be (1 - contrast, 1 + contrast). Default: 0.5.
saturation (float | tuple[float]) – The jitter range for saturation, if set as a float, the range will be (1 - saturation, 1 + saturation). Default: 0.5.
hue (float | tuple[float]) – The jitter range for hue, if set as a float, the range will be (-hue, hue). Default: 0.1.
- class mmaction.datasets.pipelines.Compose(transforms)[source]¶
Compose a data pipeline with a sequence of transforms.
- Parameters
transforms (list[dict | callable]) – Either config dicts of transforms or transform objects.
- class mmaction.datasets.pipelines.DecordDecode(mode='accurate')[source]¶
Using decord to decode the video.
Decord: https://github.com/dmlc/decord
Required keys are “video_reader”, “filename” and “frame_inds”, added or modified keys are “imgs” and “original_shape”.
- Parameters
mode (str) – Decoding mode. Options are ‘accurate’ and ‘efficient’. If set to ‘accurate’, it will decode videos into accurate frames. If set to ‘efficient’, it will adopt fast seeking but only return key frames, which may be duplicated and inaccurate, and more suitable for large scene-based video datasets. Default: ‘accurate’.
- class mmaction.datasets.pipelines.DecordInit(io_backend='disk', num_threads=1, **kwargs)[source]¶
Using decord to initialize the video_reader.
Decord: https://github.com/dmlc/decord
Required keys are “filename”, added or modified keys are “video_reader” and “total_frames”.
- Parameters
io_backend (str) – io backend where frames are store. Default: ‘disk’.
num_threads (int) – Number of thread to decode the video. Default: 1.
kwargs (dict) – Args for file client.
- class mmaction.datasets.pipelines.DenseSampleFrames(*args, sample_range=64, num_sample_positions=10, **kwargs)[source]¶
Select frames from the video by dense sample strategy.
Required keys are “filename”, added or modified keys are “total_frames”, “frame_inds”, “frame_interval” and “num_clips”.
- Parameters
clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
num_clips (int) – Number of clips to be sampled. Default: 1.
sample_range (int) – Total sample range for dense sample. Default: 64.
num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Default: 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
- class mmaction.datasets.pipelines.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, left_kp=None, right_kp=None, lazy=False)[source]¶
Flip the input images with a probability.
Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered.
Required keys are “img_shape”, “modality”, “imgs” (optional), “keypoint” (optional), added or modified keys are “imgs”, “keypoint”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.
- Parameters
flip_ratio (float) – Probability of implementing flip. Default: 0.5.
direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.
flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.
left_kp (list[int]) – Indexes of left keypoints, used to flip keypoints. Default: None.
right_kp (list[ind]) – Indexes of right keypoints, used to flip keypoints. Default: None.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
- class mmaction.datasets.pipelines.FormatAudioShape(input_format)[source]¶
Format final audio shape to the given input_format.
Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.
- Parameters
input_format (str) – Define the final imgs format.
- class mmaction.datasets.pipelines.FormatGCNInput(input_format, num_person=2)[source]¶
Format final skeleton shape to the given input_format.
Required keys are “keypoint” and “keypoint_score”(optional), added or modified keys are “keypoint” and “input_shape”.
- Parameters
input_format (str) – Define the final skeleton format.
- class mmaction.datasets.pipelines.FormatShape(input_format, collapse=False)[source]¶
Format final imgs shape to the given input_format.
Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.
- Parameters
input_format (str) – Define the final imgs format.
collapse (bool) – To collpase input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Default: False.
- class mmaction.datasets.pipelines.Fuse[source]¶
Fuse lazy operations.
- Fusion order:
crop -> resize -> flip
Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.
- class mmaction.datasets.pipelines.GenerateLocalizationLabels[source]¶
Load video label for localizer with given video_name list.
Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.
- class mmaction.datasets.pipelines.GeneratePoseTarget(sigma=0.6, use_score=True, with_kp=True, with_limb=False, skeletons=((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9), (0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)), double=False, left_kp=(1, 3, 5, 7, 9, 11, 13, 15), right_kp=(2, 4, 6, 8, 10, 12, 14, 16))[source]¶
Generate pseudo heatmaps based on joint coordinates and confidence.
Required keys are “keypoint”, “img_shape”, “keypoint_score” (optional), added or modified keys are “imgs”.
- Parameters
sigma (float) – The sigma of the generated gaussian map. Default: 0.6.
use_score (bool) – Use the confidence score of keypoints as the maximum of the gaussian maps. Default: True.
with_kp (bool) – Generate pseudo heatmaps for keypoints. Default: True.
with_limb (bool) – Generate pseudo heatmaps for limbs. At least one of ‘with_kp’ and ‘with_limb’ should be True. Default: False.
skeletons (tuple[tuple]) –
The definition of human skeletons. Default: ((0, 1), (0, 2), (1, 3), (2, 4), (0, 5), (5, 7), (7, 9),
(0, 6), (6, 8), (8, 10), (5, 11), (11, 13), (13, 15), (6, 12), (12, 14), (14, 16), (11, 12)),
which is the definition of COCO-17p skeletons.
double (bool) – Output both original heatmaps and flipped heatmaps. Default: False.
left_kp (tuple[int]) – Indexes of left keypoints, which is used when flipping heatmaps. Default: (1, 3, 5, 7, 9, 11, 13, 15), which is left keypoints in COCO-17p.
right_kp (tuple[int]) – Indexes of right keypoints, which is used when flipping heatmaps. Default: (2, 4, 6, 8, 10, 12, 14, 16), which is right keypoints in COCO-17p.
- gen_an_aug(results)[source]¶
Generate pseudo heatmaps for all frames.
- Parameters
results (dict) – The dictionary that contains all info of a sample.
- Returns
The generated pseudo heatmaps.
- Return type
list[np.ndarray]
- generate_a_heatmap(img_h, img_w, centers, sigma, max_values)[source]¶
Generate pseudo heatmap for one keypoint in one frame.
- Parameters
img_h (int) – The height of the heatmap.
img_w (int) – The width of the heatmap.
centers (np.ndarray) – The coordinates of corresponding keypoints (of multiple persons).
sigma (float) – The sigma of generated gaussian.
max_values (np.ndarray) – The max values of each keypoint.
- Returns
The generated pseudo heatmap.
- Return type
np.ndarray
- generate_a_limb_heatmap(img_h, img_w, starts, ends, sigma, start_values, end_values)[source]¶
Generate pseudo heatmap for one limb in one frame.
- Parameters
img_h (int) – The height of the heatmap.
img_w (int) – The width of the heatmap.
starts (np.ndarray) – The coordinates of one keypoint in the corresponding limbs (of multiple persons).
ends (np.ndarray) – The coordinates of the other keypoint in the corresponding limbs (of multiple persons).
sigma (float) – The sigma of generated gaussian.
start_values (np.ndarray) – The max values of one keypoint in the corresponding limbs.
end_values (np.ndarray) – The max values of the other keypoint in the corresponding limbs.
- Returns
The generated pseudo heatmap.
- Return type
np.ndarray
- generate_heatmap(img_h, img_w, kps, sigma, max_values)[source]¶
Generate pseudo heatmap for all keypoints and limbs in one frame (if needed).
- Parameters
img_h (int) – The height of the heatmap.
img_w (int) – The width of the heatmap.
kps (np.ndarray) – The coordinates of keypoints in this frame.
sigma (float) – The sigma of generated gaussian.
max_values (np.ndarray) – The confidence score of each keypoint.
- Returns
The generated pseudo heatmap.
- Return type
np.ndarray
- class mmaction.datasets.pipelines.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]¶
Load and decode images.
Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- Parameters
io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.
kwargs (dict, optional) – Arguments for FileClient.
- class mmaction.datasets.pipelines.ImageToTensor(keys)[source]¶
Convert image type to torch.Tensor type.
- Parameters
keys (Sequence[str]) – Required keys to be converted.
- class mmaction.datasets.pipelines.Imgaug(transforms)[source]¶
Imgaug augmentation.
Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.
It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.
Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.
It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.
Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways
to create transforms. 1) string: only support default for now.
e.g. transforms=’default’
- list[dict]: create a list of augmenters by a list of dicts, each
dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]
- iaa.Augmenter: create an imgaug.Augmenter object.
e.g. transforms=iaa.Rotate(rotate=(-20, 20))
- Add Imgaug in dataset pipeline. It is recommended to insert imgaug
pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [
- dict(
type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,
), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(
type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),
dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])
- Parameters
transforms (str | list[dict] |
iaa.Augmenter
) – Three different ways to create imgaug augmenter.
- static default_transforms()[source]¶
Default transforms for imgaug.
Implement RandAugment by imgaug. Please visit https://arxiv.org/abs/1909.13719 for more information.
Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa
Miss one augmenter
SolarizeAdd
since imgaug doesn’t support this.- Returns
The constructed RandAugment transforms.
- Return type
dict
- imgaug_builder(cfg)[source]¶
Import a module from imgaug.
It follows the logic of
build_from_cfg()
. Use a dict object to create an iaa.Augmenter object.- Parameters
cfg (dict) – Config dict. It should at least contain the key “type”.
- Returns
iaa.Augmenter: The constructed imgaug augmenter.
- Return type
obj
- class mmaction.datasets.pipelines.JointToBone(dataset='nturgb+d')[source]¶
Convert the joint information to bone information.
Required keys are “keypoint” , added or modified keys are “keypoint”.
- Parameters
dataset (str) – Define the type of dataset: ‘nturgb+d’, ‘openpose-18’, ‘coco’. Default: ‘nturgb+d’.
- class mmaction.datasets.pipelines.LoadAudioFeature(pad_method='zero')[source]¶
Load offline extracted audio features.
Required keys are “audio_path”, added or modified keys are “length”, audios”.
- class mmaction.datasets.pipelines.LoadHVULabel(**kwargs)[source]¶
Convert the HVU label from dictionaries to torch tensors.
Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.
- class mmaction.datasets.pipelines.LoadKineticsPose(io_backend='disk', squeeze=True, max_person=100, keypoint_weight={'face': 1, 'limb': 3, 'torso': 2}, source='mmpose', **kwargs)[source]¶
Load Kinetics Pose given filename (The format should be pickle)
Required keys are “filename”, “total_frames”, “img_shape”, “frame_inds”, “anno_inds” (for mmpose source, optional), added or modified keys are “keypoint”, “keypoint_score”.
- Parameters
io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
squeeze (bool) – Whether to remove frames with no human pose. Default: True.
max_person (int) – The max number of persons in a frame. Default: 10.
keypoint_weight (dict) – The weight of keypoints. We set the confidence score of a person as the weighted sum of confidence scores of each joint. Persons with low confidence scores are dropped (if exceed max_person). Default: dict(face=1, torso=2, limb=3).
source (str) – The sources of the keypoints used. Choices are ‘mmpose’ and ‘openpose-18’. Default: ‘mmpose’.
kwargs (dict, optional) – Arguments for FileClient.
- class mmaction.datasets.pipelines.LoadLocalizationFeature(raw_feature_ext='.csv')[source]¶
Load Video features for localizer with given video_name list.
Required keys are “video_name” and “data_prefix”, added or modified keys are “raw_feature”.
- Parameters
raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.
- class mmaction.datasets.pipelines.Loa