Synthetic Data: The Early Days and Onwards

Photo of author
Written By elie

Data scientist

Real-world data has a long history in artificial intelligence (AI). Collecting, processing, or distributing real-world datasets is often associated with data gathering costs, quality problems, and privacy concerns. But how to train modern AI systems when a large amount of high-quality data is required? While data can be vastly available it might be unlabeled, highly biased, legally protected, or mostly of low quality. To overcome these issues, companies and researchers have turned to synthetic data as an adequate solution in the face of the numerous data-related challenges posed by real-world data and the requirements of recent AI technologies.

In this article, we’ll go through:

The case for synthetic data

The manifold data challenges constitute the major ongoing obstacles to leveraging AI in industrial enterprises. Data scarcity, privacy, bias, gathering costs, and provisioning are among the most common challenges. And this is the case regardless of the data source: 1) Enterprise data, 2) User-generated data, 3) IoT data, and 4) Web data.

Synthetic data promises to unlock the true potential of AI by removing many of the real-world data-related limitations. Instead of collecting data, synthetic data is generated and this is a game-changer across different sectors.

What is synthetic data?

Synthetic data is information that has been generated through an artificial process rather than produced or measured in the real world. To create useful information within a specific domain, the algorithmically-generated data should reflect and look like real-world data.

Synthetic data: what are its application domains?

Across various industries and research fields, synthetic data has many application areas that range from computer vision and simulated environments to privacy protection and data augmentation. Synthetic data is widely used in robotics, automation, finance, and healthcare, among other domains. There are many benefits to opting for synthetic data such as:

  • Build complete training datasets for AI (e.g., computer vision, simulated environments)
  • Create datasets without any sensitive (healthcare) or personally identifiable information (finance)
  • Balance datasets by generating additional samples for a specific a class or population (fraud detection)
  • Accelerate and facilitate the generation of datasets without the need for expensive simulations (optical inverse-design)

Why is synthetic data so important?

Synthetic data allows for solving many issues related to the availability and limitations of real-world data in the domain of AI. Synthetic data is cited as a crucial privacy-enhancing technology technique (PET technique). In addition, synthetic data’s application areas are expected to keep growing. By 2024, 60% of AI training data is excepted to be synthetically generated according to Gartner.

The early days of synthetic data

The use of synthetic data dates back to the 1970s with the early days of computing. Many of the first systems and algorithms needed data to operate. Limited computational power, difficulties to gather large amounts of data, and privacy disclosure issues have oriented efforts toward generating synthetic data, and as a result, turning this domain into a key strategic advantage.

Synthetic data for computer vision

The idea of synthetic data sets isn’t new. It has been around since the beginning of computer vision and much earlier than the current golden era of deep learning. The Early Days of Synthetic Data, a book chapter, traces the journey of synthetic data and how computer vision researchers turned to synthetic data as a way to provide easier test problems for computers.

Artificial drawings are all over the early days of computer vision with many works that date back to the 1960s and 1970s. Examples can be found in the works of M. B. Clowes, On seeing things (1971), and the work of DA Huffman, Impossible object as nonsense sentences (1971). Both works aimed to teach a computer to recognize polyhedral images as 3D scenes. of the long history of drawings in computer vision. Carrying out edge detection, 3D modeling, or other line labeling tasks on real images would create too many problems to solve due to the computational power limits.

One note is that these artificially generated drawings didn’t serve to build training datasets, instead, they were used as inputs to run computer vision algorithms around edge and shape detection.

Synthetic data for autonomous vehicles

In some industries, like as autonomous vehicles, synthetic data is valuable as it facilitates the generation of data that present dangerous driving conditions (snow, ice, night condition, pedestrian, or animal position). Such conditions can be hard to find or dangerous to reconstruct in the real world. One of the first attempts to build an autonomous vehicle was the project ALV (Autonomous Land Vehicle) funded by DARPA in 1980. Its vision system VITS (Vision Task Sequencer) used some custom-made algorithms.

The use of synthetic data for training in the domain of AV can be retraced to a project called ALVINN (An Autonomous Land Vehicle In a Neural Network). In the research paper, the author, Dean Pomerleau, acknowledges the difficulties to collect a large number of training exemplars depicting roads under a wide variety of conditions. The paper was submitted to NIPS conference in 1989.

To avoid these difficulties, we have developed a simulated road generator which creates road images to be used as training exemplars for the network”.

Dean Pomerleau, 1989

The proposed solution was a road generator that creates road images to be used in training data. This is one of the earliest mentions of synthetic data to be used as training input in the context of a project to develop an autonomous vehicle.

Synthetic data for data confidentiality

Privacy protection has been a focus of attention for several decades but has become the subject of particular with the advent of technology and the world of big data. Preserving privacy and sensitive information is governed by multiple directives and laws that are country and region-specific.

Privacy preservation has been a much-debated topic and a research topic where multiple techniques and statistical disclosure control processes have been developed. Extensive research efforts have been proposed and evolved over time.

At the origin, Donald Rubin, in this comment Discussion Statistical Disclosure Limitation (1993), proposed to generate fully-synthetic data set that contains no identifiable information about the dataset it was generated from. The idea is to treat all units of the population that have not been selected in the sample as missing data, and then impute these records using the developed framework of multiply imputed synthetic datasets (MISDs).

The proposal offered here is to release no actual microdata but only synthetic microdata constructed using multiple imputation so that they can be validly analyzed using standard statistical software.

Donald Rubin, Discussion Statistical Disclosure Limitation

A closely-related approach was also proposed by Roderick Joseph Alexander Little in the same year. Given a dataset, only part of the sensitive information is replaced by multiple imputations. For instance, only a subset of records that need to be protected is altered. It’s a trade-off between data quality and utility that could increase the risk of disclosure. Today this method is known as generating partially synthetic datasets.

These two early works laid the foundations for using synthetic data for data confidentiality and established their inventors as the founding fathers of synthetic data in this field.

Modern application with synthetic data

Today synthetic data is approaching the mainstream, with a large number of applications that rely on synthetic data as a viable way to alleviate many of their data-related issues and to generate data at scale. Large, accurate, relevant, and diverse are the main characteristic of the training data for AI systems.

Synthetic data on the web

With the currently available data on the web, it’s possible to build AI systems for users to generate fake faces for people that don’t exist, craft new artworks inspired by various styles, or create virtual but realistic landscapes. It’s possible to generate realistic-looking fake images of things, locations, and persons that don’t exist in the real world, as shown in the following image from

There are numerous technologies powering the generation of synthetic data. Generative adversarial networks (GAN), first introduced by Ian Goodfellow in 2014, are one of the existing solutions. GANs can model and generate high-dimensional data. Trained to produce realistic synthetic samples that are nearly indistinguishable from real-world data.

Synthetic data to supplement numerical simulations

In research, the AI transformation has a tremendous impact. In the last few decades, computer-based simulation tools have been created to assist researchers. Instead of preparing and running real lab experiments, researchers are able to run a computer-based simulation. This turned out to be not enough as some simulations could take weeks, or more, to finish. Furthermore, an expert would still be needed to guide the progression of numerical simulations as scanning multidimensional parameter spaces become an inefficient task.

Without physically accurate simulation to generate the data we need for these AIs, there’s no way we’re going to progress.

Rev Lebaredian, Vice President, Omniverse & Simulation Technology at Nvidia

Synthetic data takes this one step further. Researchers run the simulation tool to generate a subset of data. An AI system is then trained on the numerically generated data. Once the AI is trained, it becomes a matter of prediction to generate the remaining multidimensional parameter spaces. Synthetic data generation allows training AI systems and consequently opens the door to replacing a large portion of the time-consuming numerical simulations with predictions. For instance, for tasks like studying how light interacts with matter and how to inverse design optical devices, these neural network-based AI systems are orders of magnitude faster than numerical simulations.

Replicating the real world with synthetic data

There are several initiatives to build tools to reconstruct the real world as with Nvidia’s Omniverse project or to generate virtual human models as with Unity’s PeopleSansPeople project. Both of these projects provide toolkits and templates to lower the barrier of entry for the community.

The Omniverse Replicator is a powerful tool to build AI training datasets for perception networks using synthetic data. In particular, the Omniverse Replicator enables the generation of 3D synthetic data that is as physically accurate as possible so it can match the real world well enough. More information about this project can be found in this IEEE Spectrum article.

Image source:

On the other hand, PeopleSansPeople consists of a set of tools, virtual lighting, and a camera system that enable the generation of 2D or 3D scenes populated by human assets in a variety of poses on top of a number of background objects with natural textures. PeopleSansPeople’s simulator provides control over the generation of data and facilitates tuning the datasets in order to bridge the simulation to real (sim2real) transfer learning.

Image source:

Preparing the future with synthetic data

When useful real-world data isn’t available in large amounts, doesn’t have much diversity, or doesn’t mirror the reality, AI systems would mostly fail. Synthetic data can simulate the real world closely enough and transform our day-to-day life, accelerate research, and prepare AI systems for real-world situations.

Looking back to the past and looking ahead to the future, synthetic data is already one of the key technologies that can considerably shape the future of AI and transform the real/physical world. The use of synthetic data is key to amplifying AI innovation and at the same time controlling AI proliferation.

Leave a Comment