Synthetic datasets for optical flow in computer vision

Photo of author
Written By Elie Raad

Data scientist

Technological advances have catalyzed a paradigm shift in how computer vision problems are approached. Since the rise of deep learning, more effort is given to creating extensive training data while the interest in engineering algorithms for specific vision tasks has diminished.

A recent study, scaling laws for neural language models, makes an ansatz where performance scaling is defined as a power-law with model size, dataset size, and the amount of compute used for training. Given that deep learning’s potential can only be achieved with large datasets, the size of the computer vision datasets needs to grow.

To accommodate the demands of ever-growing datasets, and more and more complications in pixel-accurate ground-truth, synthetic data is considered a promising area for many computer vision challenges. Synthetic data has been gaining traction in the domain of low-level computer vision tasks. Optical and scene flow estimation are examples of low-level computer vision tasks where synthetic data has made its way through the years.

This article provides an overview of how synthetic datasets are beneficial for optical flow and explores a number of synthetic datasets for optical flow. First, we’ll explain with an example what optical flow means, and then we’ll describe a variety of optical flow works that turned to the generation of synthetic data. As presented in what follows, synthetic data for optical flow is mainly used for evaluation and when a rich ground truth of annotated images is required.

Here are the main sections that are covered in this blog:

Synthetic data for optical flow: An introduction

Many areas of computer vision have witnessed a growing interest in creating artificially generated data as a way to overcome domain-specific challenges and push forward the state-of-the-art. Optical flow, stereo, and image decomposition are all typical low-level vision problems.

What means optical flow?

Optical flow is the pattern of apparent motion of objects between two consecutive frames in a visual scene where either the object or camera is moving. More specifically, it is the apparent motion of the brightness patterns between image sequences.

An illustrative example of a face moving across multiple frames is shown in the image below. As the face is moving, the position of each pixel is also changing. These changes can be observed between frame 1 and frame 2 and also between frame 2 and frame 3. The boundary of the moving face is indicated by changing the color of each pixel. For more info on this optical flow example, refer here.

optical flow face pixel example

The previous example is a simple optical flow vector field for black and white frames. Optical flow tasks are much more challenging in real-life scenarios where the algorithmic solutions need to cope with varying environmental conditions, dynamic scenes, and large displacement between consecutive frames.

The optical flow is a velocity field in the image that transforms one image into the next image in a sequence. As such it is not uniquely determined; the motion field, on the other hand, is a purely geometric concept, without any ambiguity—it is the projection into the image of three-dimensional motion vectors.

Determining optical flow: A retrospective (1993)

What is optical flow used for?

Optical flow is a fundamental computer vision sub-field that has practical applications in autonomous driving, robot navigation, medical imaging, visual question answering, as well as in multiple visual effects areas. Optical flow plays an important role in video editing, analysis, compression, and restoration algorithms. It is used across various application fields like digital remastering and film restoration in the media industry, and for fabric defect detection in many manufacturing industries.

What is synthetic data for optical flow?

As data is the driving force of machine learning, the new wave of solutions struggles with challenges with respect to data size, generation, and ground truth availability. Robust and relevant optical flow datasets are essential to ensure continued progress and to evaluate algorithms. In the vein of producing annotated datasets and evaluating the performance of optical flow algorithms, synthetic data has emerged as an attractive solution. Indeed, optical flow is one of the first computer vision fields to embrace synthetic data for evaluation since the early 1990s.

Various low-level computer vision tasks on a synthetic image from the Multi-Object Video (MOVi) datasets

A list of synthetic datasets for optical flow

Synthetic datasets have been used for benchmark evaluation and are a necessity in highly dynamic or interactive scenes. The following table summarizes a list of synthetic datasets that can be used for optical flow. The table provides the year of publication and the size of each dataset. A closer look at these cited datasets and information on where to find them are given in the following sections of this article.

DatasetPublished inFrames
Virtual KITTI201621,260
Falling Things201860,000
Flying Chairs201521,818
SceneNet RGB-D20175,000,000
Kubric (MOVi-F)202210,000
A summary table of a number of synthetic datasets for optical flow.

The first synthetic datasets for optical flow

Since the introduction of the famous Yosemite dataset (1994), synthetic image sequences have proven their success as an alternative to real-world images for optical flow evaluation. The Yosemite sequence was one of the first synthetic image datasets ever created. It contains synthetic images of a fly through the Yosemite valley. The dataset has been extensively used in many optical flow evaluation and optimization algorithms. The data is hosted here.

In the context of vision-based driver assistance, the “.enpeda.. Image Sequence Analysis Test Site”, EISATS (2011), is made publicly available. The dataset is divided into multiple sets. Set 2 consists of synthetic image sequences with ground truth information. The ultimate goal of the dataset is to enable performance evaluation for various computer vision tasks such as stereo vision, optic flow, and motion analysis.

UCL-Flow (2013) is composed of a number of image pairs that feature moving objects and moving cameras (panning left and right, and rotation). This ground truth dataset has been generated initially to measure the optical flow confidence. By learning the confidence, it becomes possible to estimate if a flow algorithm is likely to fail and, consequently, to select the best performing algorithm. The dataset can be downloaded from the project page with the code source to generate additional data.

The following two images are synthetically generated optical flow sequences from the UCL-Flow dataset. The two images represent the intensity image and a rendering of the ground truth flow field.

Widely used datasets for optical flow

A selection of the Yosemite dataset sequences is part of the Middlebury dataset (2011), a widely used synthetic dataset. Middlebury provides a collection of synthetic images for optical flow algorithms. In addition to the Yosemite gray-scale sequences, Middlebury contains urban and natural outdoor synthetic sequences. These synthetic images are colored and are generated under controlled environments (lighting effects with shadows and rich textures). This dataset provides only eight frame pairs with ground truth. Find more about Middlebury optical flow datasets.

MPI-Sintel (2012) is a flow dataset for the evaluation of optical flow algorithms. It is derived from the animated film, Sintel. The dataset contains a variety of sequences rich in image degradations and effects. While the scenes remain the same between the dataset and the original movie, the rendering settings are modified to make the data suitable to construct an optical flow evaluation dataset. Blender is used to generate the frames of the dataset and to simulate different aspects of image creation. The MPI-Sintel dataset and its source code are available for download. The MPI-Sintel dataset is composed of three sets of image sequences with different effects. The three passes are:

  • Albedo: the simplest pass.
  • Clean: exhibits smooth shading, self-shadowing, and other more complex illumination and reflections effects.
  • Final: is the most challenging pass. This last pass adds atmospheric effects, depth of field, and various blur effects.

Virtual KITTI (2016) is a generated dataset that contains photo-realistic synthetic videos. The synthetic dataset also includes automatically generated ground truth labels for a multitude of video understanding tasks such as object detection and instance segmentation. Virtual KITTI operates in such a way to clone a few seed real-world video sequences. The original real-world KITTI dataset has been used in the context of this real-to-virtual world cloning method. In the last few years, improvements to the original real-world KITTI dataset have been achieved resulting in a more photo-realistic and better-featured version that is available on the website of Naver Labs.

Comparing images from original KITTI (top row), Virtual KITTI (middle row), and Virtual KITTI 2 (bottom row). Image source: Announcing Virtual KITTI 2

Indoor and outdoor synthetic datasets

VIPER (2016), which stands for VIsual PERception benchmark, is a synthetic dataset with ground-truth data. As described in Playing for benchmark, the annotated data enables a wide range of low-level and high-level vision tasks including optical flow, semantic instance segmentation, object detection, and object-level 3D scene layout. The dataset is published online. Its 250,000 video frames have been collected while interacting with the virtual world simulated in the popular computer game Grand Theft Auto V.

UnrealStereo (2018) is a synthetic dataset that contains synthetic images with hazardous regions. These regions exhibit certain properties related to transparency, specularity, lack of texture, and thin objects. Many algorithms would fail when processing regions with such properties. Therefore, UnrealStereo allows the evaluation of the performance of stereo algorithms and their robustness to changes in material and other scene parameters. UnrealStereo, which is based on Unreal Engine 4 (UE4), features six virtual scenes and a total of 10k image pairs that can be downloaded from here.

Falling Things (2018) is a synthetic dataset that seeks to provide photorealistic images that can be used to solve 3D detection and pose estimation vision tasks. Using UE4, the data contains indoor and outdoor scenes. Specifically, three environments and five locations with various lighting conditions were used along with 21 household objects. With its 60,000 annotated photos, the dataset is divided into two sets: 1) images with single objects, and 2) images with mixed objects (from 2 to 10 objects). The dataset can be downloaded from this link.

EDEN (2021), stands for Enclosed garDEN scenes and features more than 300,000 synthetic images of parks and gardens. It is a multimodal dataset with annotations that cover both low and high vision modalities. These annotations make this large-scale synthetic dataset a good fit for numerous vision tasks including optical flow. The dataset and other related materials are on the project’s page.

The different multimodality, including flow forward and backward, from the EDEN dataset.

Human optical flow datasets

Unlike previously mentioned datasets, the RefRESH (2018) dataset focuses on the optical flow of humans. RefRESH is an abbreviation for “REal 3D from REconstruction with Synthetic Humans”, a semi-synthetic dataset with RGB-D dynamic scenes. Human motion is a fundamental characteristic of the RefRESH dataset. Its main purpose is to identify the rigid regions in dynamic scenes observed by a moving camera. In this dataset, the scene backgrounds are real-world, static, and rigid while the foreground objects are moving synthetic humans. 68,000 images from the RefRESH dataset are used for training a machine learning model. A link to the dataset can be found on Github along with its creation tool which is available here.

The Human Optical Flow (2020) dataset is composed of two parts: the Single-Human Optical Flow dataset (SHOF) and the Multi-Human Optical Flow dataset (MHOF). The image synthetically-generated sequences combine images from the two parts of the dataset. The approach for building the dataset is built so that it can be used for modeling photorealistic artificial images. The dataset was developed to overcome the manifold data limitations related to insufficient data and annotation to train models in a scalable and reproducible way. Here is the link to the human flow dataset.

Human motion estimation using data from the MHOF dataset

Data for convolutional neural networks

Convolutional neural networks (CNNs) have built their success on the availability of large amounts of labeled data. Popular datasets like ImageNet, MS-COCO, and CityShapes are used across many high-level computer vision tasks such as object detection and classification. While these labeled datasets are widely adopted for deep learning vision tasks, one major limitation is that they are manually annotated. Manual labeling is a time-consuming and expensive process that cannot easily be generalized to different domains and to various downstream tasks.

How does optical flow estimation differ from other CNN tasks?

Optical flow estimation tasks differ from other vision tasks where CNN proved to be successful. In fact, once a CNN learns to learn feature representations it needs to match them at different locations in two images. This is a fundamental characteristic of optical flow estimation. At the same time, the lack of a large ground truth dataset to train CNN is an undesirable limitation for optical flow, as well as for other prediction tasks. Therefore, finding efficient solutions to automatically generate labels is key to circumventing the scaling challenge.

CNN-based approaches that use synthetic datasets

FlowNet (2016) pioneered the use of CNN and builds on its power to extract complex feature representations from the data in the context of optical flow estimation, like finding correspondences between two images. Therefore, the idea of synthetically generating ground truth information becomes an appealing solution in the domain of optical flow.

FlowNet, the approach behind the creation of Flying Chair, inspired the generation of additional synthetic datasets. FlyingThings3D, Monkaa, and Driving are three synthetic stereo video datasets with over 35,000 image pairs. These datasets are of sufficient magnitude to enable the training of convolutional neural networks. FlowNet’s three datasets are freely available online. Their image pairs are realistic and contain scene flow ground truth information. In details:

  • FlyingThingd3D, a more sophisticated version of Flying Chairs, consists of everyday flying objects.
  • Monkaa is another synthetic dataset that consists of nonrigid and softly articulated motion and large disparities. Similar to MPI-Sintel it’s derived from an animated movie, Monkaa.
  • Driving is a synthetic dataset that features dynamic street scenes from the viewpoint of the driving car.
YouTube video

SceneNet RGB-D (2017) is a large synthetic dataset with 5 million indoor RGB-D images from 16,000 different room configurations. The dataset is split into a train and a test set with images of bedrooms, offices, kitchens, living rooms, etc. These images are the input of the proposed approach that is built on top of the U-Net architecture. This dataset is a great resource for numerous applications from scene understanding and 3D labeling tasks to optical flow and camera pose estimation. SceneNet TGB-D data can be downloaded from here.

Flying Chairs is a synthetic dataset created from publicly available 3D chair models as well as images collected from Flickr. Multiple transformations are applied to the original data to generate ground truth image pairs. Since the introduction of Flying Chairs and its 22,872 image pairs, a number of similar synthetic datasets have been derived and created to suit specific use cases.

  • Flying Chairs 2 is a sequel to the initial Flying Chairs dataset that exhibits additional modalities including backward flows, motion boundaries, and occlusions.
  • ChairsSDHom is a synthetic dataset that is designed for untextured regions. It is also a good candidate to train networks on small displacements.
  • FlyingChairsOcc (2019), is another synthetic dataset that features bi-directional optical flow ground truth and two occlusion maps for each image.

Flying Chairs, Flying Chairs 2, and ChairsSDHom are available for download here and FlyingChairsOcc is available here.

Fine-tuning optical flow models with synthetic data

Most advances in neural network architectures exploit the concept of pre-training first then fine-tuning. As is often the case, pre-training is performed on a large training set such as Flying Chairs or FlyingThings3D then the model is fine-tuned on a limited training data specific to the target domain. Here’s a quick overview of some popular and advanced optical flow models.

Optical flow models

FlowNet 2.0 is an extension of FlowNet that became the state-of-the-art CNN for optical flow estimation. It inherits all of its benefits while improving the original version in terms of accuracy. With FlowNet 2.0 CNN models for optical flow started to outperform classical approaches.

Subsequent CNN models for optical flow propose simplified architectures as is the case with LiteFlowNet, Spy-Bet, and PWC-Net. Besides being faster and smaller than FlowNet, these architectures are more efficient as they improve the accuracy when benchmarked against public datasets such as Middlebury, KITTI, Flying Chairs, and Sintel.

Another deep neural network architecture is Recurrent All-Pairs Field Transforms (RAFT). This architecture achieves state-of-the-art accuracy while significantly reducing the percentage of error, strong cross-dataset generalization, and high efficiency in terms of training speed and inference time. This short description is beyond the scope of this blog but it’s important to mention as these models are transforming the domain of optical flow.

Powerful frameworks for optical flow synthetic datasets

Omniflow (2021) is a synthetic omnidirectional human optical flow dataset. It renders indoor environments with characters. The data generation pipeline is composed of a rendering engine, a camera, rooms, animations, characters, objects, illumination, and motion. OmniFlow data has been used for fine-tuning a pre-trained RAFT model. The pre-trained model used FlyingChairs and FlyingThings datasets and then OmniFlow’s image pairs for fine-tuning. The data can be found on the project’s page

Autoflow (2021) addresses the question of how to build an effective optical flow dataset, and at the same time, optimize the model performance on a target dataset. The architecture presents a methodology that enables the generation of synthetic data for optical flow estimation with data augmentation to the 2D rendered data. The motion, shape, and appearance of each layer are controlled and randomly generated according to a set of hyperparameters. The code and the dataset of 40,000 training images are available on Github.

Flow360 (2022) highlights the importance of combining 360° panoramic images and optimal flow estimation tasks in self-driving and robotics systems. Optical flow estimation on 2D images enables the understanding of the world by capturing motion and trajectories. With spherical cameras that capture panoramic images, it becomes interesting to extend optical flow to reach panoramic 360° understanding. Based on the CARLA simulator, Flow360 is a synthetic panoramic dataset that aims to foster development in the field of flow estimation. The dataset contains 6,400 images of street scenes with diverse environmental conditions and pixel-level annotation as ground truth. Flow360 is a first step to overcoming the lack of available panoramic. Its code and data can be accessed online.

Synthetic images from the Flow360 dataset

Kubric (2022) is a framework that allows the generation of photo-realistic synthetic scenes with rich annotations. The data-generation pipeline is built on top of PyBullet to enable physics simulations and Blender for scene rendering. With its rich set of annotations, the generated datasets can then be used for myriad vision tasks. To support the requirements of a range of vision use cases, a number of synthetic datasets of varying levels of complexity have been created. These datasets are managed through a dedicated front-end called SunDs (Scene Understanding Datasets). Among the many ready-to-use datasets through SunDs we find:

  • Five Multi-Object Video (MOVi) datasets (MOVi-A to -E)
  • Synthetic scenes containing flat surfaces
  • Salient object detection datasets
  • Scene semantic understanding datasets

Kubric’s MOVi-F, an adapted version of MOVi-E, is used for optical flow prediction. the dataset. Kubric’s creators published the code source as well as the source for SunDs, the datasets’ front-end.

Towards intrinsic processes and interleaved workflows

Synthetic datasets have been widely used for low-level and high-level computer vision algorithms. In this article, we covered how synthetic datasets are beneficial for optical flow estimation tasks and presented a number of synthetic datasets. Synthetic data generation makes it possible to produce a large amount of labeled data required for various computer vision tasks. Put simply, synthetic generation provides a reliable way to produce labeled data where manual labeling fails or cannot scale.

The scale and the quality in terms of the pixel-perfect ground truth of these datasets open the door to a promising perspective where generating training data and training models are tightly interleaved within the same workflow. Indeed, this is the case with Autoflow where the synthetically generated data has been optimized for MPI-Sintel’s Final pass. Furthermore, understanding what makes good synthetic data for specific tasks is another important challenge. As a result, this would allow to fine-tune pre-trained models using synthetic-specific data, that reflect particular characteristics, for a defined downstream application. Such intrinsic processes, that harness synthetic data, would enable machine learning models to better learn and understand the real world.

Leave a Comment