Skip to main content

Synthetic data v the real thing

ITS and smart cities thrive on data: but does all the data need to be real? Steve Harris of Mindtech explains why the answer could lie in combining elements of the real world with the synthetic
By Steve Harris January 9, 2023 Read time: 6 mins
© Jasonjung | Dreamstime.com

Data is at the heart of innovation, and real-world data sources have always been the driving force of technological advancements. However, the time for change has arrived. Smart cities have been a topic of prolonged discussion, yet challenges we hope to resolve remain pervasive in our society. Traffic, air pollution, accidents – all continue to pose problems. Reflecting on what is lacking from current artificial intelligence deployments suggests the need for better training data solutions.

Real-world training data has its place and there is no taking away from its integral contribution to training AI systems. However, it can also be flawed. When it comes to training computer vision systems for smart cities, real-world data has been the key focus but has notably lacked in addressing corner cases. If we wish to continue to enhance AI innovation and cover applicability to any edge case, then there is a solution: synthetic data.

What is synthetic data?

Training AI systems requires large volumes of carefully labelled datasets. Compiling enough compliant, annotated real-world data to support these use cases can be incredibly labour-intensive and time-consuming. It can be impractical to spend the time collecting and labelling datasets that can consist of thousands or even millions of objects. Additionally, some use cases may consist of variables for which gathering data is not feasible or challenging to create by traditional means. As a result, synthetic data has emerged as a resource for businesses to solve these problems.

Synthetic data is generated from computer systems with the intention of resembling real data both statistically and structurally. Privacy issues can be a major setback of real data. Synthetic data avoids these issues by delivering privacy and GDPR-compliant training data.

A common misunderstanding is that synthetic data is not accurate or reliable, but in actuality it can be closer to the real world than gathered real-world data itself. Datasets can be created for any situation. Issues with geographical location, camera capture and deployment systems are not an issue. The process is constantly evolving and, as more data is artificially generated, the potential for creating more refined data sets that better resemble real-world use cases gets better. This approach offers an alternative that provides the opportunity to collect greater volumes at a quicker rate, and more inexpensively.

This does not eliminate real data

Preconceptions about real-world data being more reliable or accurate are highly prevalent. In comparison, synthetic data is faster, more cost-effective and more flexible, with greater potential for scaling. It provides data scientists with the ability to innovate and break barriers in place with the use of real data alone.

However, this does not eliminate the need for real-world data. Whilst there are evident advantages of artificially generating information to train AI models, if the aim is to optimise and build a truly intelligent system, a combination of the two is necessary. Real data combined with synthetic data delivers robust AI models covering the key use cases while being thoroughly tested against edge cases. The two go hand in hand and it is important not to dismiss the relevance of both.

The future of smart cities

Smart cities are intended to be safer and on the whole, better. This means technology is implemented to optimise everyday processes and improve citizens’ quality of life. Improved road safety and travel experience are high on the digital transformation agenda. Before adopting these digital solutions, relevant pain points must be identified in order to offer a better user experience. Often, when trained on real data, systems can fail to deal with infrequent scenarios - the ones that typically characterise the need to build smarter and safer cities.

Real data combined with synthetic data delivers robust AI models covering the key use cases while being thoroughly tested against edge cases
Real data combined with synthetic data delivers robust AI models covering the key use cases while being thoroughly tested against edge cases

Synthetic data can be generated for any edge case needed, meaning AI systems can be prepared for these situations. It is ultimately an essential tool in driving the adoption of AI for smart cities. It can supply local authorities with information and valuable data about their infrastructure, particularly their mobility and travel routes. For example, it can help computer vision systems to predict where certain accidents are more likely to happen within the city's current infrastructure, or where potential cyclists will be at risk of injury. A lack of data for a given class, in this case cyclists, can lead to failures in detection, which can result in fatalities. Targeting these class imbalances through generating synthetic data for use cases like this within autonomous vehicles can make roads safer for all users.

Driving simulators are a critical part of developing autonomous vehicles. While automotive companies have trained vehicles for a long time on real-world data, certain circumstances can be difficult to account for. The collection and labelling of the data used for training this software is also expensive and time consuming. Corner cases reflecting real-world conditions like extreme weather or accidents create a gap, increasing the potential for error. This is where synthetic data comes in to fill gaps in the dataset and also reduce incumbent bias. Expanding the data pool for edge cases stretches the capabilities of an AI model, allowing for the development of solutions in the form of automated tolling, automated vehicles or intelligent traffic lights which will be integral to smart cities. Regardless of how this technology is deployed, synthetic data is guaranteeing these systems are truly intelligent and safe.

Being able to easily find a parking spot or preventing constant traffic jams is all in the pursuit of cities that are better to live in. The deployment of smart technology should come with considerations of how it will impact the lives of those living in the city. A human focus should be maintained throughout the process.

Aligning citizens’ and officials’ needs

Given synthetic data is not subject to the same privacy issues that real-world data is, there is also a greater opportunity for transparency. The lack of need for tracking personally identifiable information allows for models to be developed for meaningful causes, with safety in mind, without breaching these concerns. Its machine-generated nature saves valuable time and energy, whilst also providing the opportunity to counteract bias in the base data.  Offering a viable solution for preserving data privacy means that datasets can be more openly published and analysed without having to mask or omit information. This further pushes for service-building that prioritises citizen wellbeing.

City officials should have the needs of citizens at the top of their agendas and so, optimising city functions for the sake of elevating quality of life and wellbeing should also be in their best interests. However, decisions will always come down to economic implications too. Promoting economic growth is understandably an important driver of implementing smart technologies. Smart cities come at a cost, but with the use of synthetic data this can be a more affordable process. As mentioned, real-world data is expensive to collect, both in terms of resources and time. Synthetic data can be collected inexpensively and can be readily available for whatever edge case may be needed. Therefore, leveraging the use of synthetic data can bring the needs of citizens and city officials into alignment to develop smart cities that benefit everyone.

Steve HarrisUltimately, AI systems are essential in the move towards smarter cities all over the world and currently, there is a need to revise training methods to allow the technology to work as effectively as it can. Synthetic data may be the answer to addressing the shortcomings of the system, pushing us closer to making this a reality in the near future. It is more flexible, cheaper and more convenient to collect. However, it is important to acknowledge that a diversity of datasets will always be the more optimal approach to accurately training models. Therefore, jointly training with synthetic data will pave the way for developing globally smarter and safer cities.

About the Author:
Steve Harris is CEO of Mindtech

For more information on companies in this article

Related Content

  • “Gas tax hasn't gone up since 1993: that's where tolling can come in”
    March 14, 2025
    IBTTA president James Hofmann talks to Adam Hill about new beginnings plus the need for tolling to get the user experience right, streamlining digital experiences - and what to expect from the IBTTA Technology Summit in Dallas
  • ITF Corporate Partnership Board projects highlight ways forward
    October 29, 2014
    The findings of the first four projects launched by the ITF Corporate Partnership Board (CPB), the organisation's platform for engaging with the private sector, have been announced. CPB projects are designed to enrich policy discussion with a business perspective. They are launched in areas where CPB member companies identify an emerging issue in transport policy or an innovation challenge to the transport system. Led by ITF, work is carried out in collaborative fashion in working groups consisting of CP
  • Need for secure approach to connected vehicle technology
    January 7, 2013
    Accidental or malicious issue of false messages to connected vehicles could result in dire consequences, so secure systems of authentication and certification are likely to be necessary, write Paul Avery and Sandra Dykes. Connectivity among vehicles in urban traffic systems will provide opportunity for beneficial impacts such as congestion reduction and greater safety. However, it also creates security risks with the potential for targeted disruption. Security algorithms, protocols and procedures must take
  • Machine vision needs standards to fulfil ITS demands
    May 28, 2014
    No-one should expect the enabling qualities of machine vision to come free of charge but Jason Barnes finds there is still much that ITS stakeholders can do to help reduce costs. After many years of application in high-end solutions for the enforcement and tolling sectors, machine vision is gaining traction in more general areas of traffic management. Nevertheless, those OEMs producing transport-oriented solutions which incorporate machine vision and looking to increase the technology’s share of the ITS mar