Enough is never Enough

We need more data

Oct 04, 2024

I'm tired of writing about AI, but the money just keeps flowing. This week, OpenAI made headlines with its latest valuation round pegging the company to $157 billion. This makes it one of the largest valuations ever for a private company. The market wants AI, and they want Sam Altman to lead the way even if he keeps losing people left and right; the latest exit being CTO Mira Murati. But behind the flashy numbers and the hype, there’s a deeper story about why developing these large language models (LLMs).

Developing new AI models is expensive, and OpenAI’s $6.6 billion raise reflects this. First, there's the hardware required to train these models — server farms worth of GPUs crunching data day and night plus the energy costs to run these farms. Then, there's the cost of talent. The competition for the brightest minds in AI has driven annual salaries into the millions, and there’s no sign of that slowing down. Finally, a major hidden cost in developing LLMs is the data itself. High-quality, diverse training data is the only way to make these models better. The've already been trained with most of the internet, and finding new, real data requires licensing deals and data collection. But to give an answer to any scenario, you need data on any scenario. If I'm trying to decide on the optimal amount of cups to buy for my coffee shop, how do I account for extreme cases? Say there's a conference nearby, and suddenly, instead of expecting a slow Tuesday, there's an influx of 50+ customers ordering lattes? Enter synthetic data: artificially generated data that mimics the patterns of the real world.

simulating my coffee shop for my synthetic customers

Imagine if, instead of scraping the web or licensing datasets, we could create our own perfect training data on demand. That’s what synthetic data promises: a way to generate large amounts of data tailored specifically to train AI models without the privacy, bias, and quality issues that often plague real-world datasets. You're effectively simulating reality.

For example, if you want to train an AI model to understand how to brew the perfect cup of coffee, you could generate millions of hypothetical brewing scenarios, each adjusted to provide the exact conditions the model needs to learn from. You’re not waiting for the perfect barista to come along—you’re creating that ideal training moment yourself. This also means that synthetic data can help overcome one of the biggest challenges in AI today: the fact that real-world data often has inconsistencies or biases. By generating balanced, representative scenarios, companies like OpenAI can try to ensure their models are learning in a fair and inclusive way.

However, the problem with saying things like 'fair and inclusive' is there's always the question of who decides. A biased model of the world could inadvertently create synthetic data that reflects and amplifies those biases. How can we be sure that the model generating this synthetic data is not influenced by its own underlying biases? It’s like training for a marathon on a treadmill: it’s helpful, but it’s not quite the same as running on the open road. In this case, investors are not only relying on OpenAI to be this perfect model of the world, they're agreeing to stay monogamous:

"OpenAI told investors in the new round that it doesn’t want them to put money into its biggest private competitors. Those include Anthropic, which was founded by several ex-OpenAI employees; Safe Superintelligence, which was co-founded by OpenAI’s former chief scientist Ilya Sutskever; and xAI. Before starting xAI, Musk was an OpenAI cofounder and its first major source of funding." — OpenAI Nearly Doubles Valuation to $157 Billion in Funding Round, WSJ

Business is business, but the implications of this are significant. If synthetic data is being used to train LLMs, and those LLMs are being deployed in critical applications—such as customer service, medical diagnosis, or financial advice—then the biases of the original model are being baked into each of these applications. It means that the biases are not just preserved, but could even be amplified if synthetic data is generated and used repeatedly without addressing the foundational issues. This leads to a feedback loop where bias begets more bias, making the case for multiple models from multiple companies even stronger.

Thank you

I want to joke about how this is the top, but I’m not so sure anymore. As always, if you have any questions, want more explanations, or strongly disagree, comment below, follow me on Twitter (X), follow me on Instagram, or shoot me an email.

Share Periodic Scribbles

Disclaimer: These views are my own, and do not necessarily reflect the views of any organization with which I am affiliated with. This article is written with AI assistance.

Enough is never Enough

We need more data

Thank you

Discussion about this post