Prepare Dataset¶
MMAction2 supports many existing datasets. In this chapter, we will lead you to prepare datasets for MMAction2.
Notes on Video Data Format¶
MMAction2 supports two types of data formats: raw frames and video. The former is widely used in previous projects such as TSN. This is fast when SSD is available but fails to scale to the fast-growing datasets. (For example, the newest edition of Kinetics has 650K videos and the total frames will take up several TBs.) The latter saves much space but has to do the computation intensive video decoding at execution time. To make video decoding faster, we support several efficient video loading libraries, such as decord, PyAV, etc.
Use built-in datasets¶
MMAction2 already supports many datasets, we provide shell scripts for data preparation under the path $MMACTION2/tools/data/
, please refer to supported datasets for details to prepare specific datasets.
Use a custom dataset¶
The simplest way is to convert your dataset to existing dataset formats:
RawFrameDataset
andVideoDataset
for Action RecognitionPoseDataset
for Skeleton-based Action RecognitionAVADataset
for Spatio-temporal Action DetectionActivityNetDataset
for Temporal Action Localization
After the data pre-processing, the users need to further modify the config files to use the dataset. Here is an example of using a custom dataset in rawframe format.
In configs/task/method/my_custom_config.py
:
...
# dataset settings
dataset_type = 'RawframeDataset'
data_root = 'path/to/your/root'
data_root_val = 'path/to/your/root_val'
ann_file_train = 'data/custom/custom_train_list.txt'
ann_file_val = 'data/custom/custom_val_list.txt'
ann_file_test = 'data/custom/custom_val_list.txt'
...
data = dict(
videos_per_gpu=32,
workers_per_gpu=2,
train=dict(
type=dataset_type,
ann_file=ann_file_train,
...),
val=dict(
type=dataset_type,
ann_file=ann_file_val,
...),
test=dict(
type=dataset_type,
ann_file=ann_file_test,
...))
...
Action Recognition¶
There are two kinds of annotation files for action recognition.
rawframe annotaiton for
RawFrameDataset
The annotation of a rawframe dataset is a text file with multiple lines, and each line indicates
frame_directory
(relative path) of a video,total_frames
of a video and thelabel
of a video, which are split by a whitespace.Here is an example.
some/directory-1 163 1 some/directory-2 122 1 some/directory-3 258 2 some/directory-4 234 2 some/directory-5 295 3 some/directory-6 121 3
video annotation for
VideoDataset
The annotation of a video dataset is a text file with multiple lines, and each line indicates a sample video with the
filepath
(relative path) andlabel
, which are split by a whitespace.Here is an example.
some/path/000.mp4 1 some/path/001.mp4 1 some/path/002.mp4 2 some/path/003.mp4 2 some/path/004.mp4 3 some/path/005.mp4 3
Skeleton-based Action Recognition¶
The task recognizes the action class based on the skeleton sequence (time sequence of keypoints). We provide some methods to build your custom skeleton dataset.
Build from RGB video data
You need to extract keypoints data from video and convert it to a supported format, we provide a tutorial with detailed instructions.
Build from existing keypoint data
Assuming that you already have keypoint data in coco formats, you can gather them into a pickle file.
Each pickle file corresponds to an action recognition dataset. The content of a pickle file is a dictionary with two fields:
split
andannotations
Split: The value of the
split
field is a dictionary: the keys are the split names, while the values are lists of video identifiers that belong to the specific clip.Annotations: The value of the
annotations
field is a list of skeleton annotations, each skeleton annotation is a dictionary, containing the following fields:frame_dir
(str): The identifier of the corresponding video.total_frames
(int): The number of frames in this video.img_shape
(tuple[int]): The shape of a video frame, a tuple with two elements, in the format of(height, width)
. Only required for 2D skeletons.original_shape
(tuple[int]): Same asimg_shape
.label
(int): The action label.keypoint
(np.ndarray, with shape[M x T x V x C]
): The keypoint annotation.M: number of persons;
T: number of frames (same as
total_frames
);V: number of keypoints (25 for NTURGB+D 3D skeleton, 17 for CoCo, 18 for OpenPose, etc. );
C: number of dimensions for keypoint coordinates (C=2 for 2D keypoint, C=3 for 3D keypoint).
keypoint_score
(np.ndarray, with shape[M x T x V]
): The confidence score of keypoints. Only required for 2D skeletons.
Here is an example:
{ "split": { 'xsub_train': ['S001C001P001R001A001', ...], 'xsub_val': ['S001C001P003R001A001', ...], ... } "annotations: [ { { 'frame_dir': 'S001C001P001R001A001', 'label': 0, 'img_shape': (1080, 1920), 'original_shape': (1080, 1920), 'total_frames': 103, 'keypoint': array([[[[1032. , 334.8], ...]]]) 'keypoint_score': array([[[0.934 , 0.9766, ...]]]) }, { 'frame_dir': 'S001C001P003R001A001', ... }, ... } ] }
Support other keypoint formats needs further modification, please refer to customize dataset.
Spatio-temporal Action Detection¶
MMAction2 supports the task based on AVADataset
. The annotation contains groundtruth bbox and proposal bbox.
groundtruth bbox groundtruth bbox is a csv file with multiple lines, and each line is a detection sample of one frame, with following formats:
video_identifier, time_stamp, lt_x, lt_y, rb_x, rb_y, label, entity_id each field means:
video_identifier
: The identifier of the corresponding videotime_stamp
: The time stamp of current framelt_x
: The normalized x-coordinate of the left top point of bounding boxlt_y
: The normalized y-coordinate of the left top point of bounding boxrb_y
: The normalized x-coordinate of the right bottom point of bounding boxrb_y
: The normalized y-coordinate of the right bottom point of bounding boxlabel
: The action labelentity_id
: a unique integer allowing this box to be linked to other boxes depicting the same person in adjacent frames of this videoHere is an example.
_-Z6wFjXtGQ,0902,0.063,0.049,0.524,0.996,12,0 _-Z6wFjXtGQ,0902,0.063,0.049,0.524,0.996,74,0 ...
proposal bbox proposal bbox is a pickle file generated by a person detector, and usually needs to be fine-tuned on the target dataset. The pickle file contains a dict with below data structure:
{'video_identifier,time_stamp': bbox_info}
video_identifier (str): The identifier of the corresponding video time_stamp (int): The time stamp of current frame bbox_info (np.ndarray, with shape
[n, 5]
): Detected bbox, <x1> <y1> <x2> <y2> <score>. x1, x2, y1, y2 are normalized with respect to frame size, which are between 0.0-1.0.
Temporal Action Localization¶
We support Temporal Action Localization based on ActivityNetDataset
. The annotation of ActivityNet dataset is a json file. Each key is a video name and the corresponding value is the meta data and annotation for the video.
Here is an example.
{
"video1": {
"duration_second": 211.53,
"duration_frame": 6337,
"annotations": [
{
"segment": [
30.025882995319815,
205.2318595943838
],
"label": "Rock climbing"
}
],
"feature_frame": 6336,
"fps": 30.0,
"rfps": 29.9579255898
},
"video2": {...
}
...
}
Use mixed datasets for training¶
MMAction2 also supports to mix dataset for training. Currently it supports to repeat dataset.
Repeat dataset¶
We use RepeatDataset
as wrapper to repeat the dataset. For example, suppose the original dataset as Dataset_A
,
to repeat it, the config looks like the following
dataset_A_train = dict(
type='RepeatDataset',
times=N,
dataset=dict( # This is the original config of Dataset_A
type='Dataset_A',
...
pipeline=train_pipeline
)
)
Browse dataset¶
coming soon…