Part IX · Chapter 63

DeepSeek Day

Jan 27, 2025: DeepSeek-R1 / V3 release; Nvidia loses roughly $600B in market cap in a session. The accusations: smuggled H100s, training on banned hardware. The rebuttals: efficient training, MoE architecture, MLA, FP8 PTX-level kernels, real software discipline. The Jevons-Paradox debate. The export-control thesis put on trial in real time. → The day the chip war's strategic premise was challenged in public.

On Sunday afternoon, January 26, 2025, somewhere in the United States, a free iPhone application built by a company almost no one outside Chinese AI circles had heard of climbed into the number one slot on Apple’s App Store, displacing ChatGPT for the first time in the two years and two months since OpenAI’s chatbot had defined the category. The app’s logo was a pale blue cartoon whale. Its name, in English, was DeepSeek. Its parent was a Hangzhou hedge fund. By Sunday evening Eastern time, screenshots of the App Store top-ten chart had been pinned to the front page of Hacker News, retweeted across the AI corners of X, and forwarded into the private Slacks of every major American AI lab. By midnight in New York, the trading desks at Goldman Sachs and Morgan Stanley had begun fielding emails from portfolio managers asking what, exactly, was about to happen to Nvidia’s stock at the open.

What was about to happen, in the four sessions that followed, was the largest single-day market-capitalization loss in the history of American capitalism, and the first occasion on which the strategic premise of the chip war the United States had been waging on China since October 2022 was put on public trial. The trial was held by a market that had no jurisdiction to convene one. It nevertheless produced a verdict: that the premise was at least partially wrong, although neither the prosecution nor the defense could agree on which part.

The defendant was the H100. The witness for the prosecution was a 671-billion-parameter mixture-of-experts model called DeepSeek-V3, released on December 26, 2024, with a technical report on arXiv that almost no one in the West read at the time, and a reasoning successor called DeepSeek-R1, released on January 20, 2025, with weights on Hugging Face under an MIT license that made the model freely usable for commercial purposes. The prosecution’s argument was that a Chinese laboratory denied access to leading-edge American GPUs had nevertheless trained a model competitive with OpenAI’s o1 on mathematical and coding benchmarks for what its own paper claimed was 2.788 million H800 GPU hours, roughly $5.576 million in compute cost, against an industry baseline of nine-figure budgets at OpenAI and Anthropic. The defense was that the number was a sleight of hand: the underlying GPU count was much larger than DeepSeek admitted, some unknown share had reached Hangzhou through smuggling routes the Bureau of Industry and Security could not fully police, and the model’s existence therefore vindicated rather than refuted the case for tightening the controls further. By the time both sides had filed their briefs, in op-eds and tweet threads and conference calls and a single devastating SemiAnalysis teardown, the closing argument had already been written by the tape.

The man at the center of the case had granted exactly two interviews in the previous eighteen months, both to a Chinese tech outlet called 36Kr’s Anyong vertical, and had declined every approach from the international press. Liang Wenfeng was forty years old in January 2025. He had been born in Zhanjiang, in the southern coastal province of Guangdong, in 1985, and had read electronic engineering at Zhejiang University in Hangzhou through the late 2000s, graduating with a master’s into a Chinese economy whose stock market had just been remade by the 2008 financial crisis. The crisis, as he would later describe it, had given him and two classmates the idea that there was money in algorithmic trading, and in 2015, working out of a Hangzhou apartment, the three of them had started experimenting with quantitative strategies for the Chinese A-share market. In February 2016 they incorporated the resulting partnership as High-Flyer, the firm that would within a decade become one of the largest quantitative hedge funds in China, managing in the range of one hundred billion yuan, and one of two High-Flyer-controlled vehicles whose 2024 returns put Liang on Bloomberg’s list of the top-performing China-based hedge-fund principals.

What Liang did with High-Flyer’s profits, beginning quietly in 2021, was buy GPUs. The first purchase, by SemiAnalysis’s later reconstruction, was an order of around ten thousand Nvidia A100 accelerators, placed with the Asia-Pacific channel before any export restriction prevented him from doing so. The order made no obvious sense to the AI research community of the time. The A100 was a training accelerator. Hedge funds did not train large language models. Liang, in his Anyong interviews, described the purchase as a hedge against a hardware regime he believed was about to tighten and as an investment in a research program he had been thinking about since well before his peers in finance had taken AI seriously. In May 2023 he formalized the program by spinning out a separate entity, Hangzhou Deep Seek Artificial Intelligence Basic Technology Research Co. Ltd., funded entirely by High-Flyer and staffed almost exclusively by Chinese engineers under thirty, many of them recent graduates of Tsinghua, Peking University, and Zhejiang. Outside investors had shown what one of his colleagues described as limited interest. Liang did not need them. The hedge fund was profitable enough to absorb the cost of training models for as long as it took.

The cost was not, as the headlines would later imply, six million dollars. The cost was, by SemiAnalysis’s late-January estimate, in the range of $1.6 billion in cumulative server CapEx and around $944 million in cumulative operating expense across DeepSeek and the High-Flyer trading infrastructure that shared its hardware. The fleet, the same teardown estimated, ran on roughly fifty thousand Hopper-class GPUs at peak, of which around ten thousand were H800s, the China-export variant that Nvidia had introduced after October 2022 with the same Hopper silicon and a 55 percent reduction in NVLink interconnect bandwidth, around ten thousand were H100s acquired before the controls or via routes the analyst team did not attempt to map in detail, and the balance were a mix of pre-export-control A100s and the newer H20 inference chips Nvidia had begun shipping to China in 2024. The pre-training run for V3 had used a fraction of that fleet, and the $5.576 million figure the technical report quoted was a defensible accounting for the incremental compute hours of the run itself, conducted in FP8 mixed precision over fourteen point eight trillion tokens, and not for the research, infrastructure, salary, and prior-experiment costs that would have to be amortized to produce a true total cost of model. The number was real. The framing was not.

Inside Hangzhou, the engineers who had produced V3 and R1 had also produced something the Western labs had been promising for years and rarely shipping. The architectural innovations described in the V3 paper, by the consensus of every researcher who read it carefully, were real. The model used a sparse mixture-of-experts design that activated only thirty-seven billion of its 671 billion parameters per token, sharply reducing the per-token compute cost relative to a dense model of comparable capability. A load-balancing scheme called auxiliary-loss-free routing fixed one of the chronic instabilities of MoE training. A multi-token prediction objective improved data efficiency by predicting two tokens ahead instead of one. The most consequential innovation was an attention mechanism DeepSeek had introduced in V2 the previous summer: multi-head latent attention compressed the key-value cache into a low-rank latent representation and reduced the memory bandwidth requirements of inference by roughly an order of magnitude. The mechanism mattered because the H800s on which DeepSeek had been forced to train had been crippled exactly there, on the bandwidth side. The constraint Washington had imposed in 2022 to slow Chinese frontier training had become, in the Hangzhou engineering culture Liang had built, an axis of optimization. The whole architecture was bent toward making bandwidth-starved hardware run as if it were not bandwidth-starved.

Below the architecture sat a software stack that Western researchers reading the paper would describe over the following week with a mixture of admiration and disquiet. DeepSeek had implemented its training kernels not in the standard CUDA C++ that the rest of the field used, but in significant part directly in PTX, the lower-level pseudo-assembly that the CUDA toolchain compiled to and that almost no one outside Nvidia’s own internal teams wrote by hand. They had built a custom communication scheduler called DualPipe that overlapped forward and backward passes across pipeline stages to hide cross-node latency. They had validated FP8 training on a model of unprecedented scale, with a relative loss error against BF16 below 0.25 percent. They had, in the words of one Anthropic engineer who read the paper on Christmas afternoon, done the kind of low-level systems work the American labs had stopped doing because Nvidia’s compilers and Nvidia’s libraries had been good enough that no one needed to. DeepSeek had needed to. The fact that they had done it, on the constrained hardware they were known to have, was the part of the story that the more thoughtful readers of the V3 paper found genuinely unsettling.

R1, released four weeks later, made the cost question into a capability question. Its companion paper, posted to arXiv on January 20, 2025, described a reasoning model trained primarily through reinforcement learning, with a reward signal based only on the correctness of final answers and no human-written reasoning trajectories at all. The starting point, called R1-Zero, was V3 fine-tuned through pure RL on math and code tasks. R1-Zero had developed, the paper claimed, emergent self-reflection: at some point during training, the model had begun spontaneously rechecking its own work, identifying errors in earlier reasoning steps, and producing revised solutions. The DeepSeek researchers, in a passage that would be screenshotted and circulated for weeks afterward, called these the Aha Moments. They were, on any honest reading, the most striking demonstration since OpenAI’s o1 that something like deliberate problem-solving could be coaxed out of a transformer with the right reinforcement signal. The capabilities of R1, on the public benchmarks, sat in a band roughly comparable to o1’s, behind in some places and ahead in others, at a small fraction of the inference cost OpenAI was charging for o1’s API. And the weights, unlike o1’s, were on Hugging Face under a permissive license, ready to be downloaded by anyone with a workstation and run on commodity hardware. By Sunday January 26, half a million people had downloaded them.

The American AI community had, for the most part, not been paying attention. Christmas week had absorbed the V3 release. The new American president, sworn in on January 20 against a backdrop that included the swearing-in of a bipartisan technocratic class to the second Trump cabinet, had spent the day before R1’s release announcing a $500 billion private-sector AI infrastructure venture called Stargate, a joint venture between OpenAI, SoftBank, Oracle, and the Emirati investor MGX, that was supposed to absorb a sizable fraction of the data-center capital expenditure of the next four years. Sam Altman had stood in the East Room of the White House with Larry Ellison and Masayoshi Son and described a future in which the Stargate buildout would single-handedly cure cancer, restore American manufacturing, and put the United States permanently ahead in the AI race. The framing, which had been months in the writing, treated frontier AI as something that scaled with capital and chips, and that therefore could be won by the country willing to pour the most of both into it. On the day Stargate was announced, OpenAI’s market position was the central premise of American AI policy. Six days later, on the morning of Monday January 27, that premise was about to be priced.

The selling started in Tokyo. Asia-session tape on Sunday afternoon Eastern time had already shown weakness in TSMC’s Taipei-listed shares; by the time European markets opened on Monday morning, ASML, the Dutch lithography monopolist, was down sharply, eventually closing the session off nearly 8 percent. By the time the New York pre-market opened at four in the morning Eastern, Nvidia futures were down more than 10 percent. By the cash open at nine-thirty, the floor of the New York Stock Exchange had a particular quality of silence the older traders associated with the dot-com unwind of March 2000. By eleven, Nvidia had blown through $130, then $125, then $120. By the close at four o’clock, the stock had ended the session at $118.58, down 16.97 percent on the day. The market capitalization erased was approximately $589 billion. It was, by every standard accounting, the largest single-day dollar loss in the history of any company on any U.S. exchange. Apple, which had spent most of the prior six months in second place, retook the top of the S&P 500 by default. Broadcom, the picks-and-shovels designer that had ridden the AI compute trade through 2024, finished the day off 19 percent. The U.S.-listed ADRs of TSMC closed off more than 15. The Nasdaq Composite, weighted heavily toward semiconductors and AI, closed off 3.07 percent. The CBOE Volatility Index, the VIX, which had ended the prior Friday at 14.85, spiked above 22 intraday and ended the day above 18. The cumulative market-cap loss across the AI-adjacent complex, by Bloomberg’s tally that evening, was something north of one trillion dollars.

The reasons being given on the financial cable channels by mid-morning had narrowed to a single hypothesis. If a Chinese hedge fund could train a frontier-class reasoning model for what its own paper described as $5.6 million, then the entire investment thesis underlying Nvidia’s twenty-times-forward-earnings multiple, underlying Stargate’s $500 billion budget, and underlying the data-center capital expenditure plans of the four American hyperscalers, was suspect. If frontier AI did not require ever-larger training runs on ever-more-expensive accelerators, the demand curve for H100s and their successors flattened, the moat around CUDA narrowed, and the AI compute trade that had carried the S&P 500 through 2023 and 2024 was a bubble whose pin had just been pushed in. The hypothesis was simple, available, and intuitive. It was also, on any careful reading of the V3 and R1 papers, only partially correct. By Monday afternoon the more sophisticated voices on the supply side were already pushing back.

The pushback came in three waves. The first wave, fronted by Scale AI’s Alexandr Wang in a CNBC interview that aired on Monday morning and was reposted by Elon Musk with the single-word comment “obviously”, was that DeepSeek’s reported hardware was a fraction of its actual hardware. Wang asserted, with no public evidence offered, that the company possessed approximately fifty thousand H100s acquired through channels that violated U.S. export controls, and that the supposedly cheap training run had been done on banned silicon DeepSeek could not legally acknowledge. The second wave, articulated most carefully in a SemiAnalysis essay called “DeepSeek Debates” that Dylan Patel and his team published the week of January 27, was more nuanced: the fifty-thousand figure was probably right in aggregate, but the chips were a mix of pre-control A100s, China-spec H800s, an unknown number of H100s acquired by means SemiAnalysis declined to characterize, and a growing inventory of H20 inference accelerators that Nvidia had been producing legally for the Chinese market at a rate of more than a million units in the prior nine months. The training had been done on H800s, as DeepSeek’s paper described. The marginal compute cost had been real. The total cost of building the company that could do that training, including the original 2021 A100 buy and the High-Flyer infrastructure on which the early experiments had run, was a long way north of six million dollars and a long way north of any plausible reading of DeepSeek’s own published numbers. The third wave, articulated three days later in a long blog post by Anthropic’s Dario Amodei, argued that DeepSeek’s accomplishment was less impressive than the headlines suggested, that V3 and R1 were roughly comparable to Claude 3.5 Sonnet at six to twelve months of lag rather than at parity, and that the appropriate policy response was not to retreat from export controls but to tighten them, on the grounds that the next generation of Chinese frontier models, if given access to true H100-equivalent compute, would be much harder to keep in second place.

There was a fourth wave, more politically loaded than the first three, articulated by David Sacks, the venture capitalist Donald Trump had just appointed as White House AI and crypto czar. In a Fox News interview on January 28, Sacks asserted that there was “substantial evidence” DeepSeek had used a technique called distillation, in which one model is trained on the outputs of another, to lift capability directly from OpenAI’s API in violation of OpenAI’s terms of service. The implication, never spelled out explicitly, was that R1 was less an indigenous Chinese frontier model than a clever Chinese repackaging of an American one. OpenAI’s own statements that week confirmed that the company was investigating possible distillation but declined to specify what evidence it possessed. The accusation acquired a particular irony when set against OpenAI’s own legal exposure for training on copyrighted material the company had not licensed, an irony that several technical commentators noted in real time and that the Fox News audience was generally not invited to consider. Whether or not distillation had occurred at the margins, no serious researcher who read the V3 architecture paper believed that distillation alone could account for the model’s existence. The MLA mechanism, the FP8 stack, the PTX-level kernels, and the RL training pipeline were not the kind of thing one stole from an API. They were the kind of thing one designed.

The market, as it digested all four waves through Monday and Tuesday and Wednesday, settled on a more interesting question than any of them, articulated most cleanly that week by Microsoft’s Satya Nadella. On Sunday evening, Nadella had posted to LinkedIn the kind of message a CEO posts when his company has just lost forty billion dollars of market capitalization to a story he believes the market has misread. “Jevons paradox strikes again,” he wrote. “As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can’t get enough of.” William Stanley Jevons was a Victorian economist who had observed in 1865 that the introduction of the more efficient Watt steam engine, which used less coal per unit of mechanical work than its predecessors, had not reduced England’s coal consumption. It had increased it, because the cheaper price had unlocked applications no one had previously been able to afford. The paradox, in Jevons’s framing, was that efficiency in the use of a resource often expanded rather than contracted demand for it. The relevance to DeepSeek, in Nadella’s reading, was direct. If frontier AI capability had just become an order of magnitude cheaper to produce, the universe of applications that could be built on top of it had just expanded by a comparable factor, and the demand for inference compute in the long run, far from contracting, was about to explode. Nadella’s framing, which was also Nvidia’s official framing in the corporate statement Jensen Huang’s office released that afternoon, treated DeepSeek as a tailwind for the picks-and-shovels providers. The market, over the following weeks, would come around to something like that view. Nvidia would recover most of the lost ground inside two months. Apple’s brief return to the top of the S&P would last about as long.

Trump, traveling in Florida on January 27, was asked about DeepSeek during a press gaggle at the House Republican Conference meeting in Doral. He said the news “should be a wake-up call for our industries that we need to be laser focused on competing to win.” Then, in a comment his audience had not anticipated, he added that he considered the development “very much a positive” because, in his phrasing, “instead of spending billions and billions, you’ll spend less, and you’ll come up with, hopefully, the same solution.” It was the most sanguine response any senior American politician offered that day. Sam Altman, posting to X overnight, called R1 “an impressive model, particularly around what they’re able to deliver for the price,” promised that OpenAI would “obviously deliver much better models” soon, and added a line that would be parsed by every commentator on every side of the export-control debate: “it’s legit invigorating to have a new competitor!” Within a week, in a separate Reddit AMA, Altman would acknowledge that OpenAI had been “on the wrong side of history” on open source, the first concession of its kind from any major American AI executive.

Liang Wenfeng, in Hangzhou, said almost nothing publicly. His most quoted line of the period was a sentence he had given to the 36Kr interview the previous July, before any of his models had been frontier and before any American journalist had pretended to know who he was. “Money has never been the problem for us,” he had told the interviewer. “Bans on shipments of advanced chips are the problem.” The line, taken in isolation, sounded like a complaint. Read against the V3 paper and the R1 paper and the SemiAnalysis teardown and the chart of Nvidia’s stock price on January 27, it sounded like something else. The export controls had not stopped DeepSeek. They had shaped it. The bandwidth-starved H800 had selected for engineers who wrote PTX. The capital constraints High-Flyer had imposed had selected for architectures that activated thirty-seven billion of 671 billion parameters per token instead of the dense alternatives. The ban on the very best Western chips had selected, at least in this one case, for a Chinese laboratory that did the kind of low-level systems work the Western labs had stopped doing. Whether this generalized, or whether DeepSeek was an outlier whose existence depended on a hedge-fund founder with a hobby and ten thousand pre-control A100s, was a question the next two years would answer.

What the markets had been asked to price, in the four sessions following January 26, was the strategic premise that the chip war’s architects had been running on since the original Bureau of Industry and Security action of October 7, 2022. The premise was that denying China access to leading-edge accelerators would slow its frontier AI development by a generation or more, that the gap between American and Chinese frontier models would therefore widen, and that the United States would retain decisive AI advantage for as long as the gap remained. The prosecution had not, on January 27, refuted the premise. It had qualified it. China could not, on the evidence of DeepSeek-R1, build the scaled training clusters its American competitors built. It could, on the same evidence, train models that were less than a generation behind those clusters’ outputs, on hardware the American policy had been designed to prevent it from using effectively, by means the American labs had not deployed. The chip ban remained in force. The chip ban had become, in the same gesture, a piece of evidence in an argument it had not been designed to anticipate. Dario Amodei, writing on January 29, would call this evidence that the controls were working. Ben Thompson, writing on Stratechery on January 27, would call it evidence that the controls had produced the very phenomenon they had been designed to prevent. Both readings could be defended on the same data. Neither could be falsified short of running the counterfactual no one could run.

In Hangzhou, the engineers under thirty went back to their next paper. The High-Flyer servers continued clearing A-share trades at four-millisecond latency by day and training the next iteration of the model overnight. Liang, declining the cover of every Western magazine, gave no interviews. Jensen Huang’s communications team prepared a series of follow-on statements emphasizing that “inference requires significant numbers of NVIDIA GPUs and high-performance networking,” a point that was both true and beside the point of the Monday tape. Howard Lutnick was confirmed as Commerce Secretary in late February, and the new administration began drafting what would become the AI Diffusion Rule, then revising it, then rescinding it. ASML’s order book for high-NA EUV machines was undisturbed. The wafers of N3 and below kept shipping at the rates the contracts demanded. The clean rooms ran. The export controls held. The strategic premise had been challenged but not broken, and it was no longer the same premise on Tuesday morning that it had been on Sunday evening, as everyone in the rooms where the next decisions would be made knew.