Synthetic data is not a new phenomenon. Rules-based synthetic data has been around longer than most people realize. It is commonly used in analytics for data augmentation, conjoint model analysis, and simulation testing. Rules-based methods, however, lack flexibility and struggle with complex data distributions. Assumptions made during rule creation don’t always hold universally, and manually defining rules can become impractical as datasets grow. Generative AI (genAI) models such as generative adversarial networks (GANs) and variational autoencoders (VAEs) are making it easier to generate more realistic synthetic data quickly by learning complex distributions directly from that data and generating much more realistic and higher-quality synthetic data, which can then train better-performing AI models.
Forrester defines synthetic data as:
Generated data of any type (e.g., structured, transactional, image, audio) that duplicates, mimics, or extrapolates from the real world but maintains no direct link to it, particularly for scenarios where real-world data is unavailable, unusable, or strictly regulated.
GenAI-based synthetic data is becoming the unsung hero of AI development. For example, we have synthetic data to thank for Microsoft’s Phi-1 base model which was trained on a curated “textbook-quality” synthetic dataset, rather than traditional web data exclusively, which appears to have a positive impact on mitigating toxic and biased content generation. These smaller models will continue to play a crucial role in scaling genAI implementation for industry-specific use cases.
Synthetic data is also likely to grow in popularity due to its ability to train AI models at a much faster pace by generating large, clean, relevant datasets. NVIDIA claims its NVIDIA Isaac Sim simulation application can help “train [computer vision models] 100 times faster.” Synthetic data providers are emerging to democratize AI training — and their solutions are not limited to computer vision systems. Synthetic data provider Gretel, for example, released the world’s largest open-source text-to-SQL synthetic dataset to assist developers in training their models via tabular data.
One of the most salient advantages of employing synthetic data for AI model training lies in data privacy. By generating data that is completely disconnected from the original dataset, it becomes impervious to any traceability back to its source. This attribute holds particular significance in sensitive domains such as healthcare, medical research, and financial services, where the utilization of data for AI training is highly regulated and requires strict adherence to privacy laws and regulations.
As the field of AI continues its rapid expansion, the demand for training data escalates in tandem, necessitating the establishment of robust regulatory frameworks. Synthetic data emerges as a viable solution, enabling the faster training of models to meet market demands while remaining fully compliant with regulatory constraints.
If you’re curious to hear more about how to best leverage synthetic data, please join me at Forrester’s Technology & Innovation Summit North America in Austin, Texas on September 9–12, 2024. I’ll be presenting a session on synthetic data use cases, and there will be a variety of other sessions on related topics, so definitely check out the agenda here.