The Fab visualizations
Part IX · Chapter 55

The ChatGPT Detonation

OpenAI launches ChatGPT (Nov 30, 2022); GPT-3.5/4 era; scaling laws as industry doctrine; the transformer-era inflection that made AI compute-bound for the first time. → Why chips became the binding constraint of intelligence itself.

On the morning of Wednesday, November 30, 2022, the people inside the eight-story former Pioneer Building on the corner of 18th and Folsom in San Francisco were nervous. They were nervous about the wrong things. They were nervous that the model would say something offensive in its first hour of public exposure and that the press would write about that one exchange instead of the thousand merely interesting ones. They were nervous that the inference servers would not handle the load. They were nervous, on a more existential register, that the launch would simply land flat. The team had been told for weeks to keep expectations low. The product they were releasing today was a chat interface wrapped around a fine-tuned variant of a model in OpenAI’s GPT-3.5 series, internally called, until almost the last minute, Chat With GPT-3.5. They had decided, mercifully, to call it ChatGPT. They were planning to release it as what Sandhini Agarwal would later describe to MIT Technology Review as a research preview, a tease of a more polished version, not something they wanted to oversell as a fundamental advance.

In the early afternoon Pacific time, Sam Altman, the chief executive, posted from his account at @sama on Twitter the most consequential nine words of his career. “today we launched ChatGPT,” he wrote. “try talking with it here.” He attached a link to a free signup page at chat.openai.com. The post had no marketing graphics, no embedded video, no list of capabilities, no quote from a famous beta tester. It was a flat, lowercased declaration of availability. In the Pioneer Building, the engineers refreshed dashboards and watched the queue depths. Greg Brockman, the president and the company’s most relentless shipping engineer, kept his laptop open through the afternoon. Mira Murati, the chief technology officer, who had spent the autumn pushing for a product that ordinary people could touch, watched the feedback channel fill up. Ilya Sutskever, the chief scientist, by all later accounts, looked at the early traffic and was quiet. The OpenAI board, several of its members would much later testify, learned about the launch from Twitter.

By Friday, December 2, the model had, by Altman’s count, crossed a hundred thousand users. By Sunday it had crossed five hundred thousand. On the following Monday, December 5, five days after launch, Altman posted again. “ChatGPT launched on wednesday,” he wrote, again in his deliberate lowercase. “today it crossed 1 million users.” The note had the flat affect of a man trying to announce something privately while shouting it. Inside the company the engineers were not sleeping. The queue depths were not in fact handling the load. The team was scaling capacity hour by hour, throwing more GPUs at the inference fleet, watching the cost per query climb, debating internally whether to put up rate limits or paywalls or login walls and concluding, one by one, that the right answer was to let the wave keep coming and bear the cost. By the end of January, by an estimate UBS analysts published in early February, ChatGPT had crossed a hundred million monthly users. UBS, accustomed to writing measured prose about consumer internet adoption, allowed itself a sentence the analysts would have to defend in client calls for months afterward. In twenty years following the internet space, they wrote, they could not recall a faster ramp in a consumer internet app. TikTok had taken nine months to reach a hundred million users. Instagram had taken two and a half years. Facebook, four. ChatGPT had done it in two months, with no marketing budget, no app, and a product that for many of its early users could not yet handle a sustained load without timing out.

What the world had been handed, on the day after the U.S. Thanksgiving holiday and a week before its end-of-year news cycle, was not a product. It was a demonstration. The demonstration was that the line on a graph that a small group of researchers had been staring at for the better part of three years had not, in fact, started to bend. The graph plotted, on its horizontal axis, the amount of computation thrown at the training of a particular kind of neural network, and on its vertical axis, the loss the network achieved on a held-out body of text. In January 2020, a team at OpenAI led by a former physicist named Jared Kaplan, including Sam McCandlish, Tom Henighan, Tom Brown, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and a Princeton-trained researcher named Dario Amodei, had published a paper on the arXiv preprint server titled Scaling Laws for Neural Language Models. The paper was, in form, dry and empirical. It reported, in clean log-log plots, a straight line. The loss declined as a power law in compute, in the number of parameters in the model, and in the size of the training dataset, across more than seven orders of magnitude. Other architectural choices, the paper found, mattered remarkably little within wide bands. Width did not matter much. Depth did not matter much. Specific recipes did not matter much. What mattered was the scale.

The line had a shape that was easy to understand and difficult to internalize. If a researcher could afford to spend a hundred times more compute next year than this year, the loss would decline by a predictable amount. If the loss declined by that amount, capabilities the smaller model could not perform would, often, suddenly come within reach. The pattern had been documented, before the Kaplan paper, in the work of physicists studying the asymptotic behavior of statistical-learning systems, and in scattered observations across DeepMind, Google Brain, and OpenAI itself. Kaplan and his coauthors made it the explicit doctrine. They drew the line and dared the field to find a place where it bent. For most of the next three years, no one did. In March 2022, a team at DeepMind led by Jordan Hoffmann published a follow-up that adjusted the recipe in an important way. They called the model they trained Chinchilla. The paper, titled Training Compute-Optimal Large Language Models, argued that the existing generation of large models, including the 175-billion-parameter GPT-3 that OpenAI had trained in 2020, were undertrained on data: for every doubling of model size, the optimal allocation called for a doubling of training tokens, a ratio that earlier papers had let drift. Their compute-matched 70-billion-parameter Chinchilla, trained on 1.3 trillion tokens, beat all the larger competitors. The paper revised the slope of the line. It did not bend it. The doctrine survived.

The substrate beneath the doctrine was an architecture that had been published five years earlier in a paper from a small team at Google Brain. The paper, posted to arXiv in June 2017 by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin, was titled Attention Is All You Need. It introduced a network architecture the authors called the Transformer, built around a mechanism called self-attention, and it dispensed with the recurrent and convolutional structures that had defined the previous generation of sequence models. The Transformer’s defining property, beyond its accuracy on machine translation, was that it parallelized cleanly across many GPUs. The earlier architectures it replaced had to process tokens largely in sequence, which meant the same tokens could not in general be computed simultaneously on different chips. The Transformer could. A modern data-center GPU was an enormous parallel arithmetic engine. The Transformer was a model for which that engine was, in effect, a perfect substrate. If the scaling laws were the doctrine, the Transformer was the engine, and the relationship between them, in the four years between the Vaswani paper and Kaplan’s, looked increasingly like a single technological wager. Compute was the input. Capability was the output. Engineering was, in the limit, the work of converting one into the other.

The wager required enormous amounts of compute, and the reason ChatGPT was an OpenAI product on November 30, 2022, rather than a Google or DeepMind or Meta one, came back to a deal Altman and Satya Nadella had begun negotiating in 2019. Microsoft’s first investment in OpenAI, a billion dollars announced that summer, came with a clause whose strategic significance was only later widely understood: Azure would be the exclusive cloud provider for OpenAI’s workloads, and in return Microsoft would build the supercomputer OpenAI needed. The first version of that machine, announced at the Build conference in May 2020, was a cluster of more than ten thousand Nvidia V100 GPUs lashed together with InfiniBand. It was, when announced, the fifth-largest supercomputer in the world, and the cluster on which GPT-3 was trained. By the time of GPT-4, whose training ran for roughly a hundred days on a successor cluster of around twenty-five thousand Nvidia A100 GPUs, the hardware bill alone was on the order of sixty million dollars and the strategic logic of the OpenAI-Microsoft pairing had become unmistakable: OpenAI did the research; Microsoft owned the only Western cloud that could train the result.

The path from the underlying GPT-3.5 to ChatGPT itself was a separate technical story whose central innovation had been published in March 2022 in another arXiv paper, “Training Language Models to Follow Instructions with Human Feedback.” The paper described a technique known across the field as RLHF, reinforcement learning from human feedback, that OpenAI had been developing in stages since 2017. RLHF took a base model trained to predict the next token across a vast corpus of internet text and refined it through supervised fine-tuning on human-written demonstrations, training of a separate reward model from human comparisons of paired responses, and reinforcement learning via the Proximal Policy Optimization algorithm John Schulman had invented as a Berkeley graduate student. The result was a model that no longer talked like the internet. It talked like an anxious, helpful assistant that refused certain requests and apologized for the limits of its knowledge. That voice was the artifact of an alignment procedure. It was not the voice of the underlying network. It was the voice the network had been taught to wear.

Inside OpenAI, the launch decision had been a near-run thing. The original plan had been to ship a more capable model, the one that would later become known as GPT-4. The team running that model’s evaluations had been finding throughout the autumn that it was not yet stable enough for public release. Altman and Brockman, watching the increasing internal use of an instruction-tuned chat interface that engineers had been using for months on the API playground, decided in mid-November to put a smaller, faster, less capable thing in front of the public first. The decision was made over the objections of much of the company. Many of the team thought the model was not ready. Some worried that the press cycle would be unfavorable. The board was not consulted in any meaningful way. Altman would later say that he believed the launch would do well; the rest of the company, he recalled, had told him “we don’t think you should launch this.” He launched it. The team gave themselves about thirteen days to integrate the chat front end, harden the moderation pipeline, and stand up enough capacity to hold the line for a few hours of curiosity traffic. Then the curiosity traffic arrived, and it was not curiosity traffic.

The difference between this product and previous AI products, for the people who tried it, was not that it was strictly more accurate. It was that it would do anything. A user could ask it for a poem about the Federal Reserve in the style of Bob Dylan, and it would write one. A user could paste in twelve hundred words of a doctoral thesis and ask it for a brutal three-paragraph critique, and it would write one. A user could ask it to write a Python script that scraped a webpage, debug the script when it failed, suggest a refactor in the manner of a senior engineer, and explain the underlying HTTP protocol in language a high-school student could understand, and it would do all four in a single conversation. Andrej Karpathy, the former OpenAI researcher then at Tesla, posted long technical observations on Twitter about how the model produced its tokens and how its psychology differed from anything researchers had previously had to think about. Programmers who had spent fifteen years developing intuitions about which tasks computers could do and which they could not found the line moving inside a single weekend. Bill Gates, who that summer had hosted Altman, Brockman, and Satya Nadella at his home outside Seattle to watch a more advanced internal model answer Advanced Placement biology questions, would later write that he had witnessed two technological demonstrations in his life that struck him as revolutionary. The first had been the Xerox Alto’s graphical user interface in 1980. The second was the model OpenAI was building. ChatGPT, in the public’s hands, was the smaller, older sibling of the model that had stunned him.

What ChatGPT did to the people who had been working on transformers was harder to articulate. They had known, as a matter of empirical observation, that the line on the Kaplan plot kept going. They had not known, until the world responded, what a particular point on that line meant in human terms. The GPT-3.5 model behind ChatGPT was, by frontier-research standards, modest, trained on compute that was no longer cutting edge. The reason it caused a public detonation was not that it was the largest model anyone had built but that it was the first model anyone had built that had been wrapped in an interface ordinary people could find and use. Underneath the chat box was a three-year extrapolation of a power law. That power law had been the private thesis of a small community of researchers, including Dario Amodei before he left OpenAI in late 2020 to found Anthropic with his sister Daniela and a dozen colleagues; the authors of Kaplan’s paper; and Gwern Branwen, an anonymous essayist who had spent the pandemic years writing long expositions of what the scaling hypothesis implied. They had been arguing, for the better part of three years, that intelligence was less a software problem than a compute problem, that the algorithms were largely figured out, that the binding constraint was no longer ideas but flops. ChatGPT was the moment that argument moved from a private thesis to a public fact.

In Mountain View, the response was immediate and panicked. Sundar Pichai, the chief executive of Google, declared an internal Code Red. The phrase, by Christmas, had escaped into the press. Larry Page and Sergey Brin, who had spent the previous five years almost entirely absent from the day-to-day operation of the company, were summoned back. Engineers across Google Research, Trust and Safety, and the Brain and DeepMind teams were redirected onto AI product work. Google had had, internally, a chatbot called LaMDA, built on a Transformer architecture that descended from the very paper its own researchers had published in 2017, that was, by some internal benchmarks, more capable than ChatGPT. It had not been released because the company’s safety review processes had judged it too risky for the brand. By February 2023, the calculus had inverted. Google rushed out a product called Bard, demoed it during a livestreamed event whose recording showed the model making a confident factual error about the James Webb Space Telescope, and watched its market capitalization fall by roughly a hundred billion dollars in a single trading session. The lesson of the Bard demo, beyond the obvious one about haste, was that the company that had invented the architecture under the floor of the entire Transformer revolution had not been the company that had figured out how to package it. Eight years of caution had produced, in the public’s perception, a single embarrassing minute of footage.

In Redmond, the response was the opposite. Satya Nadella had spent four years patiently building the substrate on which OpenAI’s models could run. On January 23, 2023, fewer than eight weeks after the ChatGPT launch, Microsoft announced an additional ten-billion-dollar investment in OpenAI, structured as a multi-year commitment a substantial portion of which would be paid in the form of Azure compute credits. The terms, reconstructed from filings and reporting, included a profit-sharing arrangement that gave Microsoft three-quarters of OpenAI’s profits until it had recouped its investment, a roughly forty-nine-percent equity stake split among Microsoft and other investors, and a continuation of the exclusive-cloud clause from 2019. Microsoft’s stock rose. Bing, which Microsoft had been losing money on for fifteen years, was rapidly retrofitted with a chat interface based on a more capable OpenAI model. By February, Nadella was telling reporters that he wanted to make Google “dance.” For Microsoft, the bet that had begun in 2019 with a billion-dollar investment in a company most people had never heard of was now the largest single strategic bet on its books. For OpenAI, the deal converted its compute pipeline from a question of fundraising into a question of consumption. The company would not, at least for the foreseeable future, be limited by money. It would be limited by the rate at which Nvidia, TSMC, and the supply chain could build the things it had to consume.

The technological centerpiece of the consumption story, by an accident of timing that would shape the rest of the decade, had been launched eight months before ChatGPT. In March 2022, at its GTC conference, Nvidia had introduced the Hopper architecture and its flagship data-center accelerator, the H100. The chip’s defining feature was a hardware unit called the Transformer Engine, designed in close consultation with the model researchers at OpenAI, Google, and elsewhere who were already running into the limits of the previous-generation A100. The Transformer Engine could do mixed-precision arithmetic at FP8, an eight-bit floating-point format that two years earlier had been considered too lossy for serious training and that the Hopper team had shown, with appropriate scaling tricks, could be used for production runs of the largest models. Compared to the A100, an H100 ran transformer training and inference workloads up to nine times and thirty times faster respectively, depending on the precise mix. The H100 had been designed for a workload that, on the day it shipped, had no widely known consumer-product instantiation. By the end of December 2022, that situation had reversed. ChatGPT was the workload. Every hyperscaler in the world placed orders. Lead times stretched from weeks to months to a year. The H100, by the second quarter of 2023, was the most-coveted single product in the technology industry, and Jensen Huang, the chief executive of Nvidia, who had spent two decades arguing that GPUs were a general-purpose accelerator and not a graphics product, found the argument finally settling itself. Nvidia’s data-center revenue, which had been around fifteen billion dollars for the fiscal year that ended in January 2023, would more than triple in the year that followed and quadruple in the year after that. The company’s market capitalization, which had crossed a trillion dollars for the first time in May 2023, would cross two trillion in early 2024 and three trillion later that summer.

In the months after the launch, Altman began describing compute as the new oil. The phrase oversimplified the way most aphorisms do. Oil was fungible; compute was a deeply differentiated stack of fabs, packaging, memory, networking, power, and software. Oil came out of the ground; compute had to be designed, fabbed, racked, networked, and cooled. The aphorism nonetheless caught something correct. For the first time since the early days of the personal computer, the binding constraint on the most economically important new technology in the world was the rate at which a physical industrial supply chain could deliver units of a particular product, and the supply chain ran from a handful of Dutch lithography systems through a handful of Taiwanese fabs and Korean memory makers and American GPU designers into the hyperscale data centers of California, Oregon, Virginia, Iowa, and Texas. The detonation in San Francisco on November 30, 2022, was, for that supply chain, an order signal of a magnitude no one had modeled.

For the people who had been watching the Kaplan plot, the consequences were less mysterious than they were enormous. If the line continued, the next model would cost ten times more compute than the last, the one after that ten times more again, and the ceiling would be set by the world’s collective ability to manufacture the inputs. Anthropic, by then eighteen months old, was raising the first of what would become a long sequence of multi-billion-dollar rounds aimed almost entirely at compute. Google was reorienting internally from research lab to product company. Meta, which had open-sourced its first generation of large language models earlier in 2022, was rethinking its compute roadmap. Inflection, Adept, Mistral, Cohere, Character, and a dozen other companies that had not yet been founded would, over the following twenty-four months, raise capital on a scale the venture industry had never deployed for software, almost all of which would flow eventually to Nvidia, to TSMC, to ASML, to SK hynix, to Samsung. The scaling laws had, over a single weekend, been promoted from a research curiosity into a bill of materials. Intelligence, in the operational sense the field had begun to mean, was now priced in dollars per token, capped in gigawatts, and contested in fab capacity.

Seven weeks earlier, on October 7, the Bureau of Industry and Security at the U.S. Department of Commerce had published a set of export controls aimed at the same workload OpenAI was about to release into the wild. The rule had landed in a press environment focused on midterm elections, inflation, and Ukraine, and most American consumers had no concrete sense of what an advanced AI accelerator did or why one would matter. Eight weeks later, on the afternoon of November 30, Altman declared a research preview, the world went to talk to it, and the meaning of those controls began to consolidate without anyone in Washington having to add a word. They were no longer a preemptive measure aimed at a future technology. They were a contemporary measure aimed at the substrate of the most important consumer-internet product release in twenty years.

In the Pioneer Building, on the morning after the launch, the engineers had not slept. The queue depths had still not stabilized. The cost per query was rising. The capacity team was begging for more H100s. The model was telling jokes, writing essays, refusing requests, hallucinating, apologizing, and, in the weeks to come, redrawing the boundary of what most people meant when they said the word computer. None of those engineers, that morning, were thinking about the export controls of October 7. They were thinking about the next hour of traffic. The two things were, by then, the same thing. They had not yet been told that they were.