Update

  • Oct,18,2022:   OakInk public v2.1 -- a new version of OakInk-Image's annotation is released!
    Within this update: 1). Several artifacts have been fixed, including: a). wrong poses of hand and objects; b). time delay; c). contact surface mismatches; 2). A "release" tag of hand-over sequence is added to the annotation.
    NOTE:   If you downloaded the OakInk dataset before 11:00 AM October 18, 2022, UTC, You only need to replace the previous anno.zip (3.4G) by this newly released: anno_v2.1.zip (3.6G), unzip it and keep the same file structures as before, and don't forget to download and install the latest OakInk Toolkit.

  • Jul,26,2022:   Tink has been made public.
  • Jun,28,2022:   OakInk public v2 has been released.
    OakInk Toolkit -- a Python package that provides data loading, has been released.
  • Mar,03,2022:   :sparkles: OakInk got accepted by CVPR 2022.

Abstract

Learning how humans manipulate objects requires machines to acquire knowledge from two perspectives: one for understanding object affordances and the other for learning human's interactions based on the affordances. Even though these two knowledge bases are crucial, we find that current databases lack a comprehensive awareness of them. In this work, we propose a multi-modal and rich-annotated knowledge repository, OakInk, for visual and cognitive understanding of hand-object interactions. We start to collect 1,800 common household objects and annotate their affordances to construct the first knowledge base: Oak. Given the affordance, we record rich human interactions with 100 selected objects in Oak. Finally, we transfer the interactions on the 100 recorded objects to their virtual counterparts through a novel method: Tink. The recorded and transferred hand-object interactions constitute the second knowledge base: Ink. As a result, OakInk contains 50,000 distinct affordance-aware and intent-oriented hand-object interactions. We benchmark OakInk on pose estimation and grasp generation tasks. Moreover, we propose two practical applications of OakInk: intent-based interaction generation and handover generation.

overview

Download

For researchers in China, you can download OakInk from the following mirror:
来自中国的研究者们可以使用镜像链接: 百度云盘 (hrt9) 下载数据集。

OakInk-Image -- Image-based subset

To ensure latest version, please use sha256sum to calculate the checksum of the anno_v2.1.zip. You will get:

dc64402d65cff3c1e2dd40fb560fcc81e3757e1936f44d353c381874489d71ea

OakInk-Shape -- Geometry-based subset

After downloading all the above .zip files, you need to arrange them in the following structure:
 $OAKINK_DIR
    ├── image
    │   ├── anno_v2.1.zip
    │   ├── obj.zip
    │   └── stream_zipped
    │       ├── oakink_image_v2.z01
    │       ├── ...
    │       ├── oakink_image_v2.z10
    │       └── oakink_image_v2.zip
    └── shape
        ├── metaV2.zip
        ├── OakInkObjectsV2.zip
        ├── oakink_shape_v2.zip
        └── OakInkVirtualObjectsV2.zip
      

Prepare

For the image/anno_v2.1.zip, image/obj.zip, and shape/*.zip, simply unzip it.

For the 11 split zip files in image/stream_zipped/, you need to cd into the image/ directory, run:
zip -F ./stream_zipped/oakink_image_v2.zip --out single-archive.zip
This will combine the split zip files into a single .zip file at image/single-archive.zip.

Finally, unzip the combined archive, run:
unzip single-archive.zip

After all the extractions are finished, you will have a your OAKINK_DIR as the following structure:
 $OAKINK_DIR
    .
    ├── image
    │   ├── anno
    │   ├── obj
    │   └── stream_release_v2
    │       ├── A01001_0001_0000
    │       ├── A01001_0001_0001
    │       ├── A01001_0001_0002
    │       ├── ....
    │
    └── shape
        ├── metaV2
        ├── OakInkObjectsV2
        ├── oakink_shape_v2
        └── OakInkVirtualObjectsV2
      

Datasheet & Explanation

OakInk-Image -- Image-based subset

Dataset Structure

  • Image Sequence: with resolution (848x480).
  • Annotation: 2D/3D positions for 21 keypoints of the hand; MANO's pose & shape parameters; MANO's vertex 3D locations; object's vertex 3D locations; object's .obj models, camera calibration (intrinsics & extrinsics); subject ID; intent ID; data split files (train/val/test).
  • Visualization Code: viz_oakink_image.py

OakInk-Image provides data splits for two categories of tasks: Hand Mesh Recovery and Hand-Object Pose Estimation. The dataset contains in total 314,404 frames if no filtering is applied, in which 157,600 frames are from two-hand sequences. For single view tasks, we filter out frames that have less than 50% of joints falling in the bounds of the images. Note these frames might still be useful in multiview tasks. Refer to oikit repo for the usage of these split files.


Splits for Hand Mesh Recovery

For Hand Mesh Recovery task, we offer three different split modes. The details of each split modes are described as below.

SP0. Default split (split by views)

We randomly select one view per sequence and mark all images from this view as the test sequence, while the rest three views form the train/val sequences.

Train+Val set
* Train+Val: 232,049 frames; in which 114,697 frames are from two-hand sequences.

Test set
* Test: 77,330 frames; in which 38,228 frames are from two-hand sequences.

We also provide an example train/val split on the train+val set. The val set is randomly sampled from the train+val set.

Train set
* Train: 216,579 frames; in which 107,043 frames are from two-hand sequences.

Val set
* Val: 15,470 frames; in which 7,654 frames are from two-hand sequences.

SP1. Subject split (split by subjects)

We select five subjects and mark all images containing these subjects as the test sequence, while the images not containing these subjects form the train/val sequences. Note some sequences involving two-hand interactions between subjects in test set and subjects in train/val set are dropped.

Train+Val set
* Train+Val: 192,532 frames; in which 82,539 frames are from two-hand sequences.

Test set
* Test: 83,503 frames; in which 37,042 frames are from two-hand sequences.

We also provide an example train/val split on the train+val set. We select one subject to form the val set, and the remaining subjects form the train set. Similar as the split of the test set, sequences having overlapped subjects are dropped.

Train set
* Train: 177,490 frames; in which 73,658 frames are from two-hand sequences.

Val set
* Val: 6,151 frames. No frames from two-hand sequences as one subject is included.

SP2. Object split (split by objects)

We randomly select 25 objects (out of total 100 objects) and mark all sequences that contain these objects as the test sequences, while the sequences that contain the rest 75 objects form the train/val sequences.

Train+Val set
* Train+Val: 230,832 frames; in which 116,501 frames are from two-hand sequences.

Test set
* Test: 78,547 frames; in which 36,424 frames are from two-hand sequences.

We also provide an example train/val split on the train+val set. We randomly select 5 objects (out of 75 objects) to form the val set and the rest objects form the train set.

Train set
* Train: 214,630 frames; in which 107,767 frames are from two-hand sequences.

Val set
* Val: 16,202 frames; in which 8,734 frames are from two-hand sequences.

Splits for Hand-Object Pose Estimation

For Hand-Object Pose Estimation task, we offer one split mode based on views. The details of the split mode is described as below.

SP0. Default split (split by views)

We randomly select one view per sequence and mark all images from this view as the test sequence, while the rest three views form the train/val sequences. We filter out frames that the min distance between hand and object surface vertices are greater than 5 mm.

Train+Val set
* Train+Val: 145,589 frames; in which 61,256 frames are from two-hand sequences.

Test set
* Test: 48,538 frames; in which 20,413 frames are from two-hand sequences.

We also provide an example train/val split on the train+val set. The val set is randomly sampled from the train+val set.

Train set
* Train: 135,883 frames; in which 57,161 frames are from two-hand sequences.

Val set
* Val: 9,706 frames; in which 4,095 frames are from two-hand sequences.


OakInk-Shape -- Geometry-based subset

Dataset Structure

  • Annotation: Object's .obj models in its canonical system; MANO's pose & shape parameters and vertex 3D locations in object's canonical system; subject ID; intent ID; origin sequence ID; Alternative hand pose, shape, and vertex (if any, for hand-over pair).
  • Visualization Code: viz_oakink_shape.py

OakInk-Shape provides data split for tasks of Grasp Generation, Intent-based Interaction Generation, and Handover Generation. These three tasks share one data split. Details as below.


Split for Grasp Generation

We use the remainder of int(object ID's hash code) mod 10 as the split separator:

  • obj_id_hash % 10 < 8 in train split
  • obj_id_hash % 10 == 8 in val split
  • obj_id_hash % 10 == 9 in test split

* Train set
1,308 objects with 49,302 grasping hand poses. 
Including 5 intents: 11,804 use, 9,165 hold, 9,425 lift-up, 9,454 hand-out, and 9,454 receive.

* Val set
166 objects with 6,522 grasping hand poses. 
Including 1,561 use, 1,239 hold, 1,278 lift-up, 1,222 hand-out, and 1,222 receive.

* Test set
183 objects with 6,222 grasping hand poses. 
Including 1,473 use, 1,115 hold, 1,122 lift-up, 1,256 hand-out, and 1,256 receive.

* Total set
We release 1,801 object CAD models, of which 1,657 models have corresponding grasping hand poses. The total number of grasping poses is 62,046

Considerations for Using the Data

  • Licensing Information: Codes are MIT license. Dataset is CC BY-NC-ND 4.0 license.
  • IRB approval: The third-party crowd-sourcing company warrants appropriate IRB approval (or equivalent, based on local government requirements) are obtained.
  • Portrait Usage: All the subjects involved in data collection are required to sign a contract with the third-party crowd-sourcing company, involving permission on the portrait usage, the acknowledgment of data usage, and payment policy. We desensitized all samples in the dataset by blurring the subjects’ faces (if any), tattoos, rings, or any other accessories that may be offensive or reveal the subjects’ identity.

Maintenance

Acknowledgements

If you find our work useful in your research, please cite:
@InProceedings{YangCVPR2022OakInk,
    author = {Yang, Lixin and Li, Kailin and Zhan, Xinyu and Wu, Fei and Xu, Anran and Liu, Liu and Lu, Cewu},
    title = {{OakInk}: A Large-Scale Knowledge Repository for Understanding Hand-Object Interaction},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2022},
}
              


The website template was borrowed from Michaël Gharbi.