Prepare Dataset

MMAction2 supports many existing datasets. In this chapter, we will lead you to prepare datasets for MMAction2.

Notes on Video Data Format

MMAction2 supports two types of data formats: raw frames and video. The former is widely used in previous projects such as TSN. This is fast when SSD is available but fails to scale to the fast-growing datasets. (For example, the newest edition of Kinetics has 650K videos and the total frames will take up several TBs.) The latter saves much space but has to do the computation intensive video decoding at execution time. To make video decoding faster, we support several efficient video loading libraries, such as decord, PyAV, etc.

Use built-in datasets

MMAction2 already supports many datasets, we provide shell scripts for data preparation under the path $MMACTION2/tools/data/, please refer to supported datasets for details to prepare specific datasets.

Use a custom dataset

The simplest way is to convert your dataset to existing dataset formats:

After the data pre-processing, the users need to further modify the config files to use the dataset. Here is an example of using a custom dataset in rawframe format.

In configs/task/method/

# dataset settings
dataset_type = 'RawframeDataset'
data_root = 'path/to/your/root'
data_root_val = 'path/to/your/root_val'
ann_file_train = 'data/custom/custom_train_list.txt'
ann_file_val = 'data/custom/custom_val_list.txt'
ann_file_test = 'data/custom/custom_val_list.txt'
data = dict(

Action Recognition

There are two kinds of annotation files for action recognition.

  • rawframe annotaiton for RawFrameDataset

    The annotation of a rawframe dataset is a text file with multiple lines, and each line indicates frame_directory (relative path) of a video, total_frames of a video and the label of a video, which are split by a whitespace.

    Here is an example.

    some/directory-1 163 1
    some/directory-2 122 1
    some/directory-3 258 2
    some/directory-4 234 2
    some/directory-5 295 3
    some/directory-6 121 3
  • video annotation for VideoDataset

    The annotation of a video dataset is a text file with multiple lines, and each line indicates a sample video with the filepath (relative path) and label, which are split by a whitespace.

    Here is an example.

    some/path/000.mp4 1
    some/path/001.mp4 1
    some/path/002.mp4 2
    some/path/003.mp4 2
    some/path/004.mp4 3
    some/path/005.mp4 3

Skeleton-based Action Recognition

The task recognizes the action class based on the skeleton sequence (time sequence of keypoints). We provide some methods to build your custom skeleton dataset.

  • Build from RGB video data

    You need to extract keypoints data from video and convert it to a supported format, we provide a tutorial with detailed instructions.

  • Build from existing keypoint data

    Assuming that you already have keypoint data in coco formats, you can gather them into a pickle file.

    Each pickle file corresponds to an action recognition dataset. The content of a pickle file is a dictionary with two fields: split and annotations

    1. Split: The value of the split field is a dictionary: the keys are the split names, while the values are lists of video identifiers that belong to the specific clip.

    2. Annotations: The value of the annotations field is a list of skeleton annotations, each skeleton annotation is a dictionary, containing the following fields:

      • frame_dir (str): The identifier of the corresponding video.

      • total_frames (int): The number of frames in this video.

      • img_shape (tuple[int]): The shape of a video frame, a tuple with two elements, in the format of (height, width). Only required for 2D skeletons.

      • original_shape (tuple[int]): Same as img_shape.

      • label (int): The action label.

      • keypoint (np.ndarray, with shape [M x T x V x C]): The keypoint annotation.

        • M: number of persons;

        • T: number of frames (same as total_frames);

        • V: number of keypoints (25 for NTURGB+D 3D skeleton, 17 for CoCo, 18 for OpenPose, etc. );

        • C: number of dimensions for keypoint coordinates (C=2 for 2D keypoint, C=3 for 3D keypoint).

      • keypoint_score (np.ndarray, with shape [M x T x V]): The confidence score of keypoints. Only required for 2D skeletons.

    Here is an example:

                    ['S001C001P001R001A001', ...],
                    ['S001C001P003R001A001', ...],
                        'frame_dir': 'S001C001P001R001A001',
                        'label': 0,
                        'img_shape': (1080, 1920),
                        'original_shape': (1080, 1920),
                        'total_frames': 103,
                        'keypoint': array([[[[1032. ,  334.8], ...]]])
                        'keypoint_score': array([[[0.934 , 0.9766, ...]]])
                        'frame_dir': 'S001C001P003R001A001',

    Support other keypoint formats needs further modification, please refer to customize dataset.

Audio-based Action Recognition

MMAction2 provides support for audio-based action recognition tasks utilizing the AudioDataset. This task employs mel spectrogram features as input. An example annotation file format is as follows:

ihWykL5mYRI.npy 300 153
lumzQD42AN8.npy 240 321
sWFRmD9Of4s.npy 250 250
w_IpfgRsBVA.npy 300 356

Each line represents a training sample. Taking the first line as an example, ihWykL5mYRI.npy corresponds to the filename of the mel spectrogram feature. The value 300 represents the total number of frames of the original video corresponding to this mel spectrogram feature, and 153 denotes the class label. We take the following two steps to perpare the mel spectrogram feature data:

First, extract audios from videos:

python tools/data/ ${ROOT} ${DST_ROOT} [--ext ${EXT}] [--num-workers ${N_WORKERS}] \
    [--level ${LEVEL}]
  • ROOT: The root directory of the videos.

  • DST_ROOT: The destination root directory of the audios.

  • EXT: Extension of the video files. e.g., mp4.

  • N_WORKERS: Number of processes to be used.

Next, offline generate the mel spectrogram features from the audios:

python tools/data/ ${AUDIO_HOME_PATH} ${SPECTROGRAM_SAVE_PATH} [--level ${LEVEL}] \
    [--ext $EXT] [--num-workers $N_WORKERS] [--part $PART]
  • AUDIO_HOME_PATH: The root directory of the audio files.

  • SPECTROGRAM_SAVE_PATH: The destination root directory of the audio features.

  • EXT: Extension of the audio files. e.g., m4a.

  • N_WORKERS: Number of processes to be used.

  • PART: Determines how many parts to be splited and which part to run. e.g., 2/5 means splitting all files into 5-fold and executing the 2nd part. This is useful if you have several machines.

Spatio-temporal Action Detection

MMAction2 supports the task based on AVADataset. The annotation contains groundtruth bbox and proposal bbox.

  • groundtruth bbox groundtruth bbox is a csv file with multiple lines, and each line is a detection sample of one frame, with following formats:

    video_identifier, time_stamp, lt_x, lt_y, rb_x, rb_y, label, entity_id each field means: video_identifier : The identifier of the corresponding video time_stamp: The time stamp of current frame lt_x: The normalized x-coordinate of the left top point of bounding box lt_y: The normalized y-coordinate of the left top point of bounding box rb_y: The normalized x-coordinate of the right bottom point of bounding box rb_y: The normalized y-coordinate of the right bottom point of bounding box label: The action label entity_id: a unique integer allowing this box to be linked to other boxes depicting the same person in adjacent frames of this video

    Here is an example.

  • proposal bbox proposal bbox is a pickle file generated by a person detector, and usually needs to be fine-tuned on the target dataset. The pickle file contains a dict with below data structure:

    {'video_identifier,time_stamp': bbox_info}

    video_identifier (str): The identifier of the corresponding video time_stamp (int): The time stamp of current frame bbox_info (np.ndarray, with shape [n, 5]): Detected bbox, <x1> <y1> <x2> <y2> <score>. x1, x2, y1, y2 are normalized with respect to frame size, which are between 0.0-1.0.

Temporal Action Localization

We support Temporal Action Localization based on ActivityNetDataset. The annotation of ActivityNet dataset is a json file. Each key is a video name and the corresponding value is the meta data and annotation for the video.

Here is an example.

  "video1": {
      "duration_second": 211.53,
      "duration_frame": 6337,
      "annotations": [
              "segment": [
              "label": "Rock climbing"
      "feature_frame": 6336,
      "fps": 30.0,
      "rfps": 29.9579255898
  "video2": {...

Use mixed datasets for training

MMAction2 also supports to mix dataset for training. Currently it supports to repeat dataset.

Repeat dataset

We use RepeatDataset as wrapper to repeat the dataset. For example, suppose the original dataset as Dataset_A, to repeat it, the config looks like the following

dataset_A_train = dict(
        dataset=dict(  # This is the original config of Dataset_A

Browse dataset

coming soon…

Read the Docs v: dev-1.x
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.