The Fab visualizations
Part IX · Chapter 59

The Hyperscaler Silicon Pivot

Google TPU comes of age (TPU v4 → v5p → Trillium / v6). AWS Trainium and Inferentia. Microsoft Maia. Meta MTIA. Apple's Project ACDC and the Baltra chip. Broadcom and Marvell as the picks-and-shovels designers. Custom silicon as a margin and dependency strategy. → Custom silicon goes from exotic to essential at hyperscaler scale.

In the fall of 2013, an Israeli engineer named Nafea Bshara walked into the Virginia Inn at the corner of First and Virginia in Seattle, slid into a booth across from James Hamilton, and arranged a stack of paper slides face-down on the table so the other patrons would not see what was on them. Bshara had printed the deck that afternoon at a UPS Store on the way over from his hotel. He was the co-founder of an obscure two-year-old fabless chip startup based in Yokne’am, Israel, called Annapurna Labs, named for the Himalayan ridge that he and his co-founder, the Bosnian-born Hrvoje Bilic, had been planning to climb when they decided to start a company instead. The Annapurna pitch, on those slides, was a custom system-on-chip aimed at storage controllers and networking. Hamilton, by then Amazon’s vice president and distinguished engineer, the unassuming Canadian-born systems thinker who had been telling re:Invent audiences for years that the future of cloud computing belonged to whoever owned the most of their own infrastructure, was the kind of person an Israeli chip founder would want to find at a Pacific Northwest happy hour. He had a standing reputation for taking meetings at pubs and listening longer than he talked.

Hamilton had not come into the Virginia Inn looking for a chip company. He had come, by his own later telling, because a mutual friend had asked him to listen to a startup. By the second beer he had begun to understand that the slides Bshara was hiding from the rest of the restaurant were not really a storage controller pitch. They were a description of an end-to-end ARM-based design and verification team, smaller and faster and cheaper than anything Amazon could build in Seattle, that already knew how to ship silicon. The cost of buying Intel Xeons at list price was eating an unacceptable share of EC2’s gross margin. The Nitro program, an internal effort to peel network and storage virtualization off the host CPU onto purpose-built cards, was just getting underway. The slides on Bshara’s table looked, with squinting, like the team that could build the chips Nitro was going to need. Within fifteen months, AWS would purchase Annapurna outright for an estimated three hundred and fifty million dollars, and the team in Yokne’am would become the engineering core of every custom processor Amazon would ship for the next decade.

Around the same period, on a different continent, a Stanford-trained computer architect in his late fifties named Norman Jouppi was being recruited away from HP Labs to join a chip team that did not yet exist, at a company that had never publicly described itself as a chip company. Jouppi had spent his career at the temple of microprocessor design. He had been one of the principal architects of Stanford’s MIPS chip in the early 1980s, then had crossed to Palo Alto’s Western Research Laboratory under DEC, then under Compaq after the 1998 merger, then under HP after the 2002 merger, designing the pipelines and caches and memory hierarchies that the rest of the industry was studying in textbooks. The pitch his recruiter at Google made was that the company’s chief scientist, Jeff Dean, had a problem the rest of the industry had never had to solve, and the Stanford-MIPS-DEC bloodline was the place to find someone who could solve it. Jouppi took the meeting. By the end of 2013 he was on the badge.

The problem Dean had described to him was not, technically, a chip problem. It was a calculator problem. Sometime in 2013, Dean had sat down with a sheet of paper and worked out what would happen if every Android user in the world started talking to their phone for three minutes a day, dictating short queries to a deep neural network for speech recognition. The arithmetic, on the assumption that the neural network would have to run on the same general-purpose CPUs that ran Search and Gmail and YouTube, came back as a number Dean did not at first believe. Google would have to double the size of its global datacenter footprint, the largest privately operated computing infrastructure on earth, just to keep up with people talking to their phones. Doubling the datacenter footprint was not a path Larry Page would approve. The alternative, the one Dean started turning over in early 2013 inside Google Brain, was to build a chip designed not to run general computing but specifically to run the matrix multiplications that lived at the heart of every neural network. The chip did not exist. The team to build it did not exist. The fabrication arrangements with TSMC did not exist. The schedule the project would need to meet, if Dean’s three-minutes-of-voice problem was going to be solvable, was about fifteen months. This was the project Jouppi had been recruited to lead.

The first version of what Google would later call the Tensor Processing Unit was, by the standards of advanced chip design, almost willfully crude. The team picked a 28-nanometer process node from TSMC that had been in volume production since 2011 and was, by 2014, considered last-generation. They picked a clock rate of seven hundred megahertz, far below what a modern microprocessor of the same year would have run at, because slow clocks meant the design’s timing closure could be done quickly and the wafers could be back inside Google datacenters as fast as possible. They built the heart of the chip around a single 256-by-256 grid of eight-bit multiply-accumulate units, sixty-five thousand of them in total, arranged in a systolic array of the kind H. T. Kung had described in academic papers in the late 1970s but that no commercial silicon vendor had ever found a profitable reason to build at that scale. The architecture could do essentially one thing, the dense matrix multiply, and it could do that one thing at a throughput that made the math of large neural network inference tractable. The first TPUs went into Google datacenters in 2015, fifteen months after the project began, on a schedule the chip industry had not seen since the original microprocessor era. By the time AlphaGo defeated Lee Sedol in Seoul in March 2016, the inference for AlphaGo’s policy and value networks was running on Jouppi’s TPUs in Google’s data centers. The world watched a Korean Go champion lose to a machine. Almost no one outside of Google’s hardware organization knew that the machine had been running on a chip that did not officially exist.

Jouppi made the chip official in the summer of 2017, when he and a group of co-authors that included Cliff Young and David Patterson presented at the Forty-Fourth International Symposium on Computer Architecture in Toronto a paper called “In-Datacenter Performance Analysis of a Tensor Processing Unit.” The paper, dense with measurements drawn from real Google production workloads, showed that the TPU was on average fifteen to thirty times faster than contemporary CPUs and GPUs at the same inference jobs, and that its performance per watt was thirty to eighty times better. Inside Google, the paper was a polite victory lap. Outside Google, it was a mortar shell. Every other hyperscaler suddenly understood that the most expensive line item in their datacenter capex budgets, the merchant silicon they were buying from Nvidia and Intel, was not technically necessary. A custom chip, designed in-house against the workloads the company actually ran, could do better. The argument was no longer hypothetical. It was published, at ISCA, with measurements.

The 2017 paper was also a recruiting flyer. Within Google, a junior engineer named Jonathan Ross who had begun the original TPU effort as a twenty-percent project before Jouppi arrived was already restless; in 2016 he left, with a handful of colleagues from the early TPU team, to found a startup called Groq. Outside Google, the senior chip architects at Microsoft and Amazon and Facebook read the Jouppi paper and concluded the same thing in a different order. The hyperscaler-as-chip-designer model was not, after Jouppi’s measurements, a strategic option to be considered against the alternative of staying on merchant silicon. It was the only path that made any economic sense at the volumes hyperscalers operated at, if the talent and the foundry access could be assembled.

The talent, in 2017, was reachable. The foundry access was the harder problem. Putting a chip into a Google or AWS or Microsoft datacenter at scale required not just designing the silicon but securing leading-edge wafer slots at TSMC, designing the substrate and the package, qualifying the high-bandwidth memory, validating, debugging, ramping, and then producing tens of thousands of accelerators a quarter. None of the hyperscalers, in 2017, had that pipeline in-house. The classical chip industry did. By the late 2010s, an unexpected pair of design-services companies stepped quietly into the gap and began to define the second wave of hyperscaler silicon.

The first was Broadcom. After Hock Tan’s Avago Technologies had acquired the original Broadcom Corporation in 2016 and taken its name, Tan had quietly built one of the largest custom-silicon design-services practices in the industry, serving hyperscalers who wanted bespoke ASICs. Google had been Broadcom’s first big custom-ASIC partner; the original TPU and every TPU since had passed through Broadcom’s design-services arm on the way to TSMC, with Google supplying the architecture and Broadcom supplying the back-end physical design, the SerDes, the packaging, and the foundry interface. By the early 2020s, Meta had joined the list with the chips that became MTIA, ByteDance was reported to be on the list, and so, in time, was Apple. Broadcom did not market itself as a hyperscaler ASIC firm; the contracts were under aggressive non-disclosure agreements. The numbers were nevertheless visible in the financials. Broadcom’s AI revenue, almost entirely custom-ASIC and AI networking sales to a small handful of named hyperscaler customers, ran at roughly three point eight billion dollars in fiscal 2023; in fiscal 2024 it more than tripled to over twelve billion.

The second design-services firm was Marvell, built across the 1990s by the Indonesian-American Sutardja brothers and grown through acquisitions including Cavium into a competing custom-ASIC practice. Its biggest customer was AWS. The Trainium chips Annapurna designed in Israel were, in many of their most demanding subsystems, brought to silicon in collaboration with Marvell engineers in Santa Clara. Microsoft’s Maia program would also list Marvell as a partner. Together, by the mid-2020s, Broadcom and Marvell would account for most of the world’s hyperscaler ASIC volume, the two picks-and-shovels firms whose names rarely appeared on the keynote slides at re:Invent or Google Cloud Next but whose engineers wrote the physical design files that determined whether the keynote claims would be deliverable.

Inside AWS, the Annapurna acquisition produced a cadence of chip launches that became a quiet annual ritual. The first line, the Nitro cards, peeled network and storage virtualization off the host CPU. The second, Graviton, was an ARM server CPU placed underneath EC2 instances at significantly lower price than Intel’s Xeons. Graviton 1 launched in 2018 as a curiosity, Graviton 2 in 2019 was credible, Graviton 3 in 2021 was for many workloads the default; by the early 2020s, AWS engineers were saying internally that more than half of new EC2 capacity was being deployed on ARM rather than x86. Intel’s largest customer for server CPUs in absolute volume terms was, by then, building its own.

The third Annapurna product line was the AI accelerators. Inferentia was unveiled by AWS chief executive Andy Jassy at the November 2018 re:Invent in Las Vegas. Jassy framed the chip in language that explicitly named the gap Jouppi’s 2017 paper had measured: GPU vendors, he told the audience, had focused on training at the expense of inference, and AWS had decided to design better inference chips itself. Inf1 instances, built around four Inferentia chips per server, became generally available in late 2019 and ran the workloads that powered Alexa at, AWS claimed, seventy percent lower cost per inference than GPU equivalents. The training-side counterpart, Trainium, was unveiled at re:Invent in December 2020, in what would be Jassy’s last AWS keynote before he moved up to succeed Jeff Bezos as Amazon chief executive. The first Trn1 instances launched in 2021 and 2022 under Adam Selipsky. The Trainium die contained roughly fifty-five billion transistors on TSMC’s seven-nanometer process with custom NeuronCore tensor engines designed against the matrix-multiply patterns of Transformer training. The hardware was good. The compiler stack, by widespread internal acknowledgment, was rougher than CUDA, and AWS spent the years that followed quietly closing the software gap, with Anthropic later becoming the anchor customer that turned Trainium from an interesting experiment into the load-bearing infrastructure of a frontier laboratory’s training program.

Microsoft arrived at the same conclusion through a different door. An internal AI silicon project codenamed Athena had been quietly underway since 2019, originally targeted at running OpenAI’s models more cheaply on Azure than on the Nvidia GPUs Microsoft was buying in increasingly uncomfortable quantities. By 2022, the OpenAI partnership had grown into a multi-billion-dollar investment commitment with no obvious ceiling, and the unit cost of every prompt OpenAI’s models served had become a Microsoft P&L problem. Athena went public on November 15, 2023, at the Microsoft Ignite conference in Seattle, where Satya Nadella unveiled it under the production name Azure Maia 100. Maia was a one-hundred-and-five-billion-transistor chip on TSMC’s five-nanometer process, with a die approaching eight hundred and twenty square millimeters and four high-bandwidth-memory stacks on a CoWoS-S interposer, designed to be deployed in Microsoft-built liquid-cooled racks running OpenAI’s inference workloads at scale. Microsoft simultaneously announced Cobalt 100, a one-hundred-and-twenty-eight-core ARM Neoverse server CPU built in-house to compete with Graviton. The strategic logic was identical to Amazon’s. Owning the chip recovered margin that would otherwise flow to Nvidia and Intel; owning the chip also reduced the supply-chain dependency that, by 2023, had become the sharpest constraint on Azure’s rate of expansion.

Meta arrived last, in May 2023, when its AI infrastructure team published a low-key blog post describing what it called the Meta Training and Inference Accelerator, MTIA v1. The chip was a seven-nanometer TSMC part with sixty-four processing elements arranged on an eight-by-eight grid, an INT8 throughput around 102 teraOps per second, and a thermal design power of just twenty-five watts. It was modest by the standards of the Nvidia H100, which by 2023 was running at hundreds of teraflops on much larger thermal envelopes. It was also, deliberately, not aimed at the H100’s market. MTIA v1 was an inference and ranking chip, designed to accelerate the deep-learning recommendation models that drove Facebook and Instagram’s news-feed ranking and ad targeting, a workload Meta had been running on CPUs and GPUs for years and that, at Meta’s volumes, had become as expensive as anything else in the company’s data center bill of materials. The chip was developed in collaboration with Broadcom, in a partnership that the two companies would later make public and would expand across multiple chip generations. Like Google before it, like Amazon, like Microsoft, Meta had concluded that at its specific scale, paying merchant-silicon margins on its largest single workload no longer made sense.

The economics underneath all four programs were, by the early 2020s, identical and brutal. Nvidia’s data-center GPUs ran at gross margins north of seventy-five percent, which meant that a hyperscaler paying Nvidia’s price was, in cost-of-goods terms, paying roughly four dollars for every dollar of silicon Nvidia had actually spent. At AWS’s, Google’s, Microsoft’s and Meta’s annual capex levels, that ratio represented several billion dollars per company per year that was, in principle, recoverable through vertical integration. The recovery was not free. A custom chip required design teams in the high hundreds, software stacks that took years to mature, foundry slots at TSMC negotiated against Apple’s and Nvidia’s standing claims, and tape-out costs in the high tens of millions per generation. But the math, at hyperscaler volumes, worked. Industry analysts, especially Dylan Patel at SemiAnalysis whose hyperscaler ASIC reports had become the closest thing the field had to a public scoreboard, would later calculate total cost of ownership advantages for hyperscaler ASICs of forty percent and more on the workloads they had been designed for. The number depended on assumptions and on the workload. The direction of the number did not.

Apple, by the early 2020s, was the last hyperscaler-scale buyer of merchant inference silicon for its own services that had not yet committed to a custom data-center chip. The company had spent fifteen years, since Steve Jobs’s 2008 acquisition of P.A. Semi, building one of the deepest in-house silicon design organizations in the consumer electronics industry, but its silicon had remained almost entirely confined to devices that fit in a pocket or on a desk. By 2024, reports in The Information and Bloomberg would describe an internal Apple effort under the project name ACDC, sometimes elaborated as Apple Chips in Data Center, to extend the M-series silicon lineage into a server inference accelerator with Broadcom as a development partner. The same Cupertino silicon team that Johny Srouji had built one quiet hire at a time over a decade and a half was preparing to extend itself into the part of the stack the company had until then been content to outsource.

By the middle of 2023, every one of the four largest cloud infrastructure operators in the United States, plus the largest consumer electronics company in the world, was either shipping or actively building its own custom AI accelerators, fabricated almost entirely at TSMC and aimed at displacing Nvidia from the workloads each company ran most heavily. Google had TPU v4 in volume, v5e and v5p arriving, Trillium under development as v6. Amazon had Inferentia and Trainium, both in their second generations. Microsoft had Maia and Cobalt. Meta had MTIA. Apple’s chips were still under wraps. Each program had concluded that the interesting question was no longer whether to design its own AI silicon but how fast it could ramp the next generation and how much of its workload it could move onto its own silicon before Nvidia’s roadmap pulled ahead again.

The transformation, at a distance, was a return rather than a revolution. The semiconductor industry Robert Noyce had founded at Fairchild had been vertically integrated by default; the fabless revolution of the 1980s and 1990s had broken that integration apart on the argument that no firm could afford to do everything. The hyperscaler era had begun to put it back together on the design side. The manufacturing layer remained in Hsinchu and Tainan at the end of someone else’s contract, but the rest of the stack was once again being assembled inside single companies. The companies were no longer chip companies. They were cloud companies that had concluded, after Jouppi’s 2017 paper and Annapurna’s first Trainium shipments, that operating at hyperscaler scale meant being a chip company in everything but corporate description.

The strategic premise was margin and dependency. The margin was Nvidia’s seventy-five percent gross. The dependency was the TSMC queue. By owning their own designs, the hyperscalers were attempting to claim a share of the first and negotiate from a stronger position against the second. They were not, in 2023, trying to displace Nvidia from the frontier of training the largest models. The H100 and its successors were still, on every honest engineering reading, the best general-purpose AI silicon money could buy. The hyperscaler ASICs were aimed at the workloads the hyperscalers ran for themselves and their largest tenants: inference at scale, recommendation systems, and the increasingly specific corners of training where co-design with a particular model architecture could pay back the tape-out cost. The frontier of training still belonged to Jensen Huang. The volume underneath it was beginning, quietly and at significant capital cost, to belong to the people who ran the data centers in which the frontier was defined.