Josherich’s Blog

Stanford CS336 Language Modeling from Scratch - Spring 2025 - Mixture of experts

2025-04-24T00:00:01+00:00

Stanford CS336 Language Modeling from Scratch - Spring 2025 - Mixture of experts

So, we’ll get started. Today, we’re going to cover a mixture of experts. Last year, this was kind of a fun bonus lecture that I threw together. But this year, thanks to lots of people doing research, this has become a much more critical lecture. So I’ve added a lot of the recent developments, and at the end, we’ll try to walk through DeepSeek V3 and try to understand what all the components that make up a state-of-the-art open-source system or at least on the architecture side look like.

So mixture of experts is how a lot of the most modern high-performance systems today are built and deployed. There was the funny Nvidia leak of GPT-4 being potentially revealed as GPTOE1 BT. But more broadly, others like Grock and DeepSeek and Llama 4 now have all adopted a mixture of experts architecture, and it seems like at this point in 2025, the advantage of mixtures of experts over dense architectures is very much clear. Almost all compute scales training a mixture of experts model if you do it well is going to give you benefits over a dense model, and so everyone seems to be doing it in both the East and the West. This will be an important thing to understand if you’re trying to build the best model that you can for the FLOPS that you have.

So mixture of experts is very simple. It’s a very terribly named concept. You might hear “mixture of experts” and think, “Oh, there must be experts specialized for different domains, and they’re like doing different things.” Like there’s a coding expert and an English expert and a languages expert. However, it is very far from that mental model. A mixture of experts is a type of fancy architecture that has several subcomponents called experts that are activated sparsely. In particular, when you think about mixture of experts, you should be thinking about the MLPS. This is where all the action is.

So a standard architecture and a mixture architecture are similar in almost all their components except for one. If you look at this slide over here, this shows the components of a standard transformer. You’ve got your self-attention, and you’ve got your FFN. If you zoom in, in a dense model, the feed-forward component just sort of exists as one big block. In a sparse model, what you would do is take this FFN and split it up, or you would copy it depending on how you’re going to be setting up your multiple copies, let’s say, of your FFN, your fully connected networks, and you’re going to have a router that picks a smaller number of those in each forward pass or at each inference.

So this is the basic idea behind it, and we’re going to replace this one big feed forward on the left side with a selector layer and many smaller ones. What’s the advantage of this? Well, if it’s sparsely activated, that is, let’s say it only picks one expert and an expert is the same size as your dense FFN, then the FLOPS between the left side and the right side, the dense model and the sparse model, have the same FLOPS. They’re doing the same matrix multiplies as you do your forward pass.

You have more parameters without affecting your FLOPS. If you’re a believer that what matters is having more parameters to, for example, memorize facts about the world, this is a great architecture. You can kind of see the intuition behind it. Hopefully, that’s all very clear.

So you might wonder, okay, it makes sense that you can get more parameters per FLOPS, but does that translate to actually better performance for the models that you’re training? There’s been, I think at this point, many papers showing that at the same FLOP count, at the same training amount of FLOPS, you get better performance out of a mixture of experts than out of a dense model.

This is a nice paper to reference. Today I’m going to go over a couple of the classic Google papers that put this field together, and this is one of them by Fetis et al. in 2022, where they show that if your FLOPS match your training FLOPS, so that’s the same amount of compute used for training, as you increase the number of experts, the training loss of your language model just keeps going down and down and down and down. More experts mean better results.

Of course, the experts aren’t free; you need to store the memory for these experts. When you do parallelism, you’re going to have to think about routing your data into 256 separate experts, so there are going to be system complexities. But if you’re only thinking about FLOPS, this is a great chart to see because you have the same FLOPS, but you’ve gotten free test loss here. As you train longer, the model with 128 experts gets better perplexity faster.

Hopefully, that’s quite clear. You might say, well, this is a 2022 paper. Is this true on modern architectures at modern scales? It continues to be very much true. AI2 had a very nice paper, OLO, which did a whole bunch of ablations and carefully controlled comparisons into dense versus other architectures, and they sort of see exactly the same thing. Here on the left side, this is still from Fetis et al. You see the 7x speedup from having many experts. On the right side, this is the OLO comparison. You see the pink one is the mixture of experts, and the teal one is dense. The training loss for the dense model goes down much more slowly than the mixture.

Hopefully, I have sold you on the value of learning this kind of slightly new architecture. We’re going to pay a price for all of this, but at least at the FLOPS level, this looks very compelling.

You might have a question about the bias tuning part because although it’s a pretty cheap computation, it affects our actual process pretty badly. You know, loading in and out can have its issues. So, the question was, in the last lecture, you know, I mentioned even small non-FLOPS, negligible FLOPS can be really significant in wall clock time. Is anything in the world going to look like that?

I think one of the drawbacks of why that’s not standard, let’s say at 224n, is because there’s significant systems complexities to making this thing efficient. It’s possible to make these things very efficient, especially if each expert lives on a separate device, so that you’re routing data to different places. You can be very efficient when you do that, but it’s not easy. There are a lot of infrastructural concerns, and you’re going to see a lot of complexities in getting this to work. But when it does work, you’re putting all of your FLOPS to use.

The last one I wanted to show is something that a lot of the companies really love because you get to present plots that look very compelling. This was from the DeepSeek V2 paper. On the X-axis, this is a little bit of slight of hand. This is only activated parameters, right? So this is only the parameters that are used for computation. You ignore all the deactivated experts, and the Y-axis is MMLU performance. We see DeepSeek V2 with very few activated parameters achieving really good MMLU performance. If you’re only interested in both training and inference FLOPS, activated parameters are the name of the game. You get really good performance here.

This is not just an ablation. This is a real system that someone spent a lot of money to train and deployed out in the wild. You’ll see this pattern recur in other examples as well.

The systems aspect also provides another axis of parallelism. I’m going to get into parallelism in much more detail in the systems lectures when I talk about how you’re going to take your model and cut it up into many small pieces and lay them out across many different devices.

When you have experts, there’s a very natural way to parallelize at the expert level. You have multiple different feed-forward blocks. You can take each of these experts and put them on different devices, right? Because experts are sparsely activated, all you have to do is take your token and route it to the appropriate device, and the computation will happen on that device. It’s a natural cutting point to shard your model into different devices. This is called expert parallelism, and this is another reason why they’re very popular. If you really want to parallelize really big models, this is a thing that you’re going to have to do.

Interestingly enough, a lot of work has been developed at Google, and many of the frontier closed labs were doing it, but I think the open results actually came from China very frequently. Quen and DeepSeek were doing a lot of work last year, and it’s only really recently that I think Western open-source groups started to do more work, like Mixstrol and Grock.

Now Llama has become an architecture as well. Llama 4 just got released as the latest and greatest. This is also a sparse model, and I’ll talk about Llama 4 as I go through the lecture.

As I said before, one of the starting points for this is some of the Chinese groups, like Quen and DeepSeek, have actually done some impressive benchmarking and evaluations of the results. Quen 1.5 was one of the first models that had large-scale testing and documentation. They took a Quen 1.5 dense model and had a nice trick to upcycle it into a mixture of experts. That’s a clever trick to take a dense model and turn it into one. They showed significant gains in compute efficiency while decreasing the total number of parameters relative to their 7B model.

DeepSeek, which is now famous, originally was not quite as well-known when these papers were released. They did foundational work in the open-source world. A big part of this lecture is going to trace the trajectory of the DeepSeek MOE architecture.

If you look at their original DeepSeek paper, you’ll see very nice comparisons showing what happens when you train a dense model with a particular amount of FLOPS versus a really naive model that doesn’t perform smart routing, compared to a smarter routing called the switch. You’ll see these carefully controlled comparisons showing that as you go from dense to sparse, all the benchmark metrics improve for a fixed amount of FLOPS.

This is very consistent, and DeepSeek V3 is something that almost everyone is aware of. This model is in some sense a culmination of this line of work. However, if you had been following this branch of neural networks and language modeling, you would have known about DeepSeek long before V3 became popular. At the very end of this lecture, you’ll see that DeepSeek V3 is not very different architecturally from the very earliest DeepSeek models. They had nailed the architecture way back when they were training much smaller two billion parameter models.

They really just got the engineering right to create something remarkably good, which is their V3 model.

I have spent quite a few minutes trying to hype up these models, and they really are worth hyping up. However, there’s a question of why they haven’t been more popular. Why isn’t this the standard thing we teach in NLP and language modeling classes?

It’s just that they’re very complex and messy. I’m hoping that over the next few years they’ll get simplified, but they still remain quite intricate. One of the things is that the infrastructure is very complex, and the biggest advantages really happen when you’re doing multi-node training. When you have to split up your models anyway, it makes sense to shard experts across different models. That’s a natural thing to do, but until you reach that point, they may not be as effective.

Some of the earlier Google papers talk about this trade-off, where they say actually when you get these really big models you have to split up, then experts become uniquely good. There are also other things that are really tricky. If you think about it carefully, the decision of which expert you route tokens to is a very difficult thing to learn.

In deep learning, we prefer differentiable objectives—smooth things we can take gradients of. However, routing decisions are not differentiable because we have to pick and commit to a particular expert. If we do that, we face a very tricky optimization problem. The training objectives required to make that work are either heuristic or unstable.

We have to carefully engineer these factors to get them to work. These are two reasons why you may not want to pursue this normally.

The classic design you should think of involves taking densely connected layers like the FFNs, splitting them up, or copying them, and having sparse routing decisions among them. Of course, you could do the same thing with a sparsely routed attention layer. Some people have taken this approach. However, it is rare to see this in major model releases.

I think I’ve seen people talking on the internet saying this approach is very unstable and difficult to train consistently. I haven’t seen ablations to back that up, but certainly, very few people have trained those kinds of models with attention mechanisms.

Now that I’ve told you about the basic architecture, it’s quite simple. You have a router of some kind and you route, and then you have different MLPS. What are the things that might vary across different choices? You might ask how we route. The routing function is an obviously important choice.

How many experts and how big should the experts be? That’s another choice. The final one is how would we train this router? This non-differentiable objective seems very difficult to train. These are very important design questions, and we’re going to cover each one, hopefully detailing the design space of all these aspects.

If you have any questions before I delve into each of these different subcomponents, now is the time.

If you’re interested in a broad overview of at least circa 2022, there’s a really nice survey or review paper by Fetis et al. in 2022 that covers many of these aspects. Many of my figures are credited to that paper.

When we think about how we’re going to route or essentially match tokens to experts, this is the core component, because tokens are going to be coming in. You have your sequence that you’re processing, and those sequences are going to be assigned to experts. Not all experts will process every token, which is the whole point of sparsity.

So you can ask how these routing decisions are made. You can have three kinds of choices. You can have token choice, where each token has a routing preference for different experts, and I will choose the top K experts for each token. Or I can have expert choice, where each expert has a rank preference over tokens, and then I’m going to choose the top K tokens for each expert. This has the benefit of being balanced over experts.

Lastly, you could solve a complicated optimization problem to ensure that the mapping between experts and tokens is somehow balanced. This is global assignment. Almost all the methods do token choice top K. In the early days, people tried many different implementations spanning the whole design space of token routers.

If you look at the big releases, they have converged to basically one class of routing mechanisms: token choice top K. Each token ranks order experts by affinity, and then there’s a kind of top K choice for each expert. This is referred to throughout this lecture because they have a series of nice ablations.

They compare token choice routing versus expert choice routing, and validation loss shows that token choice behaves much better and has faster loss decay.

The question is, is this function a function of the token itself, or its position? It’s a function of the hidden state, meaning the token gets processed with position embeddings and so forth, and the hidden state will come in and be processed by the MLP.

For the other two choices, when you say it’s more balanced across the experts, it still pertains to the current token sequence, but it forces them to be more distributed. It’s still the same set of tokens, but the ranking selector function, in token choice, I simply take the top K among the columns, while in expert choice, I take top K among the rows.

Top K among the columns balances the utilization for different experts with respect to tokens. There are various trade-offs at play in this routing.

You asked how does a token know which expert is the best? That is the role of the router. I will give you the router equation, but to spoil it a little, routers are much more lightweight than you think. Your token, let’s say, is represented by vector X, which is your hidden residual stream coming in.

X is going to get multiplied by a matrix W, and then you take a sigmoid or something. That’s the score. It’s a vector inner product, similar to an attention operation.

The choice of K, such as whether K is 1, is a hyperparameter and different work uses different values. I will talk about this again, but to give you the high-level intuition, the argument the earliest MOE papers made was that K should be greater than two to ensure exploration. If you do K equals 1, you may overly exploit a single choice and miss out on exploring others. With K equals 2, the second arm can provide exploration information.

K equals 2 was the canonical choice and continues to be popular. That would double the FLOPS. When people talk about results, they usually mention the number of activated parameters, accounting for the fact that when you put in two MLPS, it requires more resources.

When K is greater than one, do we combine the outputs of different experts? Yes, when K is one, the outputs get combined right away, like in the attention diagram—though the router routes to two MLPS up top, and their outputs combine right after. The aggregation happens just as a sum.

The variance people commonly use is top K in order to implement a high-performance system. Top K routing is what is mostly used in token choice routing. The residual stream inputs go into the router and act similar to an attention operation—performing a linear inner product and then a softmax before picking the top K most highly activated experts, whose outputs are gated.

Depending on the implementation, you might weight the outputs based on this router weight, or you might just output the weighted average or straight sum. Many papers and methods use top K, including Switch Transformers, GShard, Grock, and DeepSeek variants with different top K implementations.

A surprising fact is that you don’t even need a sophisticated router. You can just use a hashing function at the bottom to map inputs onto experts. Even with hashing, without any semantic information, you can still see performance gains, which is quite remarkable.

Some early work explored using RL for routing behavior. Although RL is great for learning discrete decisions, the cost of doing this is prohibitive, and the stability issues may deter researchers. There have been papers exploring solutions to linear assignment problems or optimal transport issues that are elegant but may not offer practicable benefits to offset the costs.

Now, I can point at a slide to discuss routing in detail. This is the top K routing that almost everyone has converged to. This routing method is used in DeepSeek V1 to V2, and Quen and Grock do almost exactly the same.

Instead of having a softmax at the bottom, DeepSeek V3 uses a modified approach; it’s a minor difference. Let’s walk through what’s happening. At the very bottom, we have our UFl input, and I need to determine which experts are activated.

To do this, similar to attention, I take my U input and compute the inner products with the learned vectors for each expert. These vectors represent the experts and indicate their activation direction. I calculate the inner products for expert and input affinity and compute a softmax to identify the best experts for each token.

After normalizing, I apply a top K function to select the K best weights, zeroing out the others before aggregating the outputs and adding that to my original residual stream to return it.

The mechanics of this routing process is straightforward, but learning it well can be quite complex. The benefit of softmax is that it tends to push towards a singular maximum, not a hard max, which is essential to shaping the routing behavior.

I’m having difficulty finding the intuition for combining the softmax with the top K selections. One way to think is that the softmax helps average the outputs later, ensuring they sum to one. The softmax is essentially a normalization operation designed to create a weighted average at the top.

People might wonder why not use softmax alone instead of top K. Using softmax universally would lose the efficiency aspect since too many experts would activate during training. It’s essential to maintain a sparse number of activated experts during both training and inference. This is why the gymnastics is required to uphold sparsity in activating experts. Top K, right? Okay. Yes. From the back.

Yeah. So, because you’re doing softmax first and then the top K get the weights, you no longer have to guarantee.

So, the question was, yeah, so the question was if you softmax first, you no longer sum to one. And yes, that’s absolutely right. You no longer sum to one. And in some ways, there’s no requirement that you have to sum to one because the next layer can magnify it back up. There are layer norms everywhere. It’s not as if it has to sum to one. But I think that is the reason why some of the other architectures basically move the location of the softmax. There’s a kind of aesthetic choice about whether you really want that weight to be normalized to one or not.

Yes. So I was wondering how the E vector here relates to the weight of the feed-forward. Okay. So the question was whether and how the E vectors relate to the feed forward. They’re not really tied in any way. The E vectors are just learned vectors for the… just think of the E as parameters for the router, right? They’re just separate objects from the FFM.

Yeah, I was just wondering how this compares to sampling from the softmax. Great. The question was about how it compares to sampling from the softmax. You can sample from the softmax, and some methods actually do a kind of soft sampling from the softmax. Specifically, one of the Google papers has a procedure where they take the top element of the softmax and then they randomly sample the second element proportional to the remainder of the softmax. And that gives you more exploration, which is good, but the drawback of that is that if you don’t sample at test time, now you’ve got a train-test mismatch.

Okay. Yes. Why not just re-normalize after the top K? Why not just re-normalize after K was the question. Is that right? Some models do that. Some models do re-normalize after the top K, but that’s kind of a choice. Some architectures don’t do that; some architectures do. It doesn’t actually matter because the scale can be basically adjusted post-hop, right? So there’s no reason why it has to sum to one after the G operation.

Cool. Oh, sorry. Yes, the bias term is U there up there. Yeah. So the first term of the sum if G is approximating a probability vector could be seen as an expectation of the function f_n right plus u. So, ff actually this is not an expectation of ff_n because each ff_n is a different f_n. So this is not actually an expectation and the gates are sparse. This is like a weighted selection operation over K different or actually capital N different ff_ns, and then the U_T at the very end there, if you remember the transformer, that’s the residual stream, right? So I’m adding back the inputs because I want a sort of identity connection throughout.

Okay. Oh, there’s another question. Why does the router have such a basic parameterization? What happens if you put more weights into your router function? The question was why is the router so basic? It seems like if you’re going to have experts, it’s important to route to the right experts. So why don’t you do that? I think there have been some ablations in some of the earlier Google papers on having MLP routers and more sophisticated things.

I think the sort of complex answer here is that systems concerns weigh heavily. If you’re using a lot of flops to make routing decisions, you have to pay for those flops, and so you have to get performance improvements in just the routing. And I think the one other thing to appreciate here is that there are really big limits to how well you can route because the learning process for this routing thing is actually pretty dicey.

How are you going to get gradients for which routers are good or bad? Well, the only thing you have is if you have the top two, then you can compare those two things that you have and you can push the gradients into S of T because your G is a weight, and then the S of T might inform your inner products. But that’s a very indirect way to be learning your affinity. So even if you make it complex, there’s no guarantee that you’re going to really learn the optimal router.

Great. Okay. So I think one of the great innovations of DeepSeek, which was very quickly adopted by all the other sort of Chinese UHE releases, is this idea of both a shared expert and fine-grained expert.

The basic structure that was originally proposed is to take your dense architecture and kind of copy the experts over. So in this case, if you have top two routing, you’re going to have twice the activated parameters of your original dense model. You take your model and you copy it over and you activate K equals 2. This is kind of what you might think of as the vanilla or basic model that you might start with.

People realized fairly quickly that having lots of experts is good. The logical next step beyond having lots of experts is that you want lots of experts, but you don’t want to pay the parameter cost for having lots of experts. DeepSsee basically argued that the right thing to do was to cut the expert up into smaller pieces.

Remember last lecture I was telling you that the kind of golden rule in some sense is to have your hidden layer and then multiply that by four, and that will give you your projection layer. Now what you would do is instead of multiplying by, let’s say, four, you might multiply by two. Now you have smaller matrices and more fine-grained experts. You can have twice as many of them, and you can take that logic much more to the extreme. You can quadruple or multiply by eight and keep decreasing the size of your projection dimension, leading to fine-grained experts.

There’s drawbacks; I’ll talk about later. It’s not free, so you have to be very careful about how you structure these things. The other thing that has been studied and noted is maybe it’s helpful to have at least some MLP that can capture shared structure.

Maybe there’s just processing that always needs to happen no matter which token you’re processing. In that case, it seems kind of wasteful to do all this routing work and to have parameters spread out everywhere when we can just have one shared expert or a few shared experts whose job it is to handle all the shared processing that’s needed.

And so they’re shared experts. This setup of using fine-grained experts plus shared experts originally came out in DeepSeek, although I think the original inspiration came from DeepSpeed and Quen and others. Almost all of the open releases since DeepSeek have adopted some sets of these innovations because it’s quite clear that especially fine-grained experts are really useful.

That’s kind of a no-brainer at this point to do. One of the things I really like about reading DeepSeek papers is that they do ablations. It’s not like a sales tech report; they actually care about whether or not their methods work. They have this lovely ablation in the DeepSeek paper where they show that the blue bar here is G-Shard. This is a very basic vanilla implementation.

You can have one shared expert; that’s the orange bar, and it gives you a big boost on some tasks and no boosts on others. You can have fine-grained experts; that’s the green and orange bars, and you get further boosts from that. If you compare the blue to the orange, composing all these differences gives you quite a big boost over others.

We can see that more experts and shared experts generally seem to help. Okay. Yes. Question. When it says seven out of something, does that mean it’s doing like top seven? Yes. Sorry, I should have explained that. That’s right. X out of Y means X activated out of Y total routed experts.

That’s right. And so you can kind of see the pattern here as well. As you increase the number of experts, you often also increase the number of activated experts. Especially if you’re doing fine-grained experts, flops-wise, it’s free, because each expert is now smaller.

Okay. So has the corroborating evidence that shows nicely that these things work. The bottom one I think I’ll start with because it’s more decisive. It shows fine-grained experts going from 8 to 32 to 64 fine-grained experts mirroring in some sense the DeepSeek ablations. You see very clear trends in losses and other kinds of metrics showing improvements going from 8 to 32 to 64. Fine-grained experts are great.

Shared experts, which is purple versus teal at the very top, you actually don’t see any gains, at least in the MO setup. They actually end up with no shared experts, even though the DeepSeek paper seemed to show more gain. That is maybe more mixed, given this follow-up or this third-party replication of these kinds of ideas.

At this point, you might be wondering what common configurations are. I think I’m going to take the page out of last lecture’s playbook of looking at a lot of the recent releases, looking at what people do and trying to talk a little about the patterns that have arisen.

Some of the early Google papers, such as GShard, Switch Transformer, Stmoe, some of them had really large numbers of routed experts. There were lots of interesting things going on in those papers. I’d encourage you to read them. Some of them happened in LSTMs and other architectures. Regardless, very quickly there was a kind of period of 8 to 16 experts like Mixtrol, DBRx, Grock with two active experts. Those worked reasonably well, but then DeepSeek v1 came out.

That has the prototypical configuration I told you about: fine-grained experts, 64 of them, six actively routed, two shared experts. Take that last column with a grain of salt because I had to back them out from config files and things like that, so I’m not 100% sure about the exact ratios here.

We’ve then got essentially Quen 1.5, Deepseek V3, Minax. These are Chinese models that follow essentially in the same footsteps as DeepSeek v1. The specific numbers are different, but they use fine-grained experts and they often have shared experts. They’re very similar to this original DeepSeek configuration.

OMO, Minimax, and Llama are very recent; they definitely do all this fine-grained expert stuff. Llama 4 also uses a shared expert, and you see variations in configuration, but you see what’s basically shared, which is this fine-grained expert idea, especially for the big models like Llama 4 and DeepSeek, which use very large numbers of routed experts or total experts.

Yes. Can you explain what the ratios represent? The ratio is representing roughly how much each expert is sliced relative to having just the standard dense configuration. In terms of hyperparameters, if you’re following the rule of thumb, your hidden dimension and sort of your projection from your MLP should be about 1 to 4 or 1 to 2.6 if you’re doing a gated network.

By looking at the hidden layers of these architectures, you can kind of see how many times they sliced up that original feed-forward size.

For those experts, does that mean that like still increasing their group like the factor? That’s right. You can think of this as roughly having 16 normally sized experts. Oh, okay. They have more parameters than the dense equivalent. They have six routed, so they have eight total active experts at any time, each that are quarter sized.

You should think of them as roughly double the flops of a dense equivalent. Some arithmetic, but hopefully the math is clear and consistent hopefully. Yes, the ratios like one are like… For some of the exotic ratios, I’m not quite sure why they’re that way, but they are very precisely whole numbers when you take the ratios between the FFNs and the implied hyperparameters.

I think those are exactly the split counts of how much they were sliced, but I’m not sure why they have one over 14. I mean, does it do you ever project to smaller dimension because that ratio is so small in the MLP?

So yeah. Oh, that’s why you’re asking like do they down project? Yeah, that’s right. In some of them, they are actually smaller. I don’t remember which models in particular, but in some of them, I do remember they were actually down projected.

Yes. What is the intuition for wanting more than one shared expert? Yeah, I mean, it does kind of seem like there was a period where some of the Chinese LM companies tried many shared experts and then people have come back to zero or one. If you look at the OM ablations, it’s not quite clear that even one shared expert is decisively useful.

I think the original motivation was that then you have equally sized experts. These are both one-quarter sized experts and now you have eight active experts total, so you can keep the sizes consistent. Otherwise, I don’t really see a particular justification for why it should be two smaller ones versus one larger one.

Okay, cool. So then hopefully you get a sense of how the routing works for a lot of these and how it’s all set up. The forward pass hopefully you fully understand.

Now we need to think about training, and training is pretty gnarly. The major challenge I foreshadowed earlier is that when we train, we cannot turn on all the experts because if we do that, then we pay the full flops cost of all the experts. Having a model that’s 256 times more expensive to train is a total no-go, so we need to train times sparsity, but sparse gaining decisions are obviously not differentiable.

We now have a kind of annoying RL-ish problem. So we could do any of these things like RL to optimize gating policies. We could do bandit-inspired things, randomization to explore, or we can just have some heuristics that try to balance things out, like put some loss terms in there and hope things work out.

Having gone through deep learning classes of many kinds, you can kind of guess internally which one people use in practice. I’ll talk about each one of these three in turn.

Okay, so RL, I think, is one of the earliest things that people tried. It’s probably the most principled thing you can do in this space. You have a non-differentiable routing decision. Well, think of that as a policy, throw RL at it, and then solve the problem.

Unfortunately, it’s not better than a lot of the other things that you can do. There is a paper by Clark et al. in 2020 who were exploring various scaling-related questions. They do have an RL baseline that I was able to dig up, but unfortunately, it’s not really that much better than using hashing for decisions.

They were really interested in benchmarking this thing called SBS, which is like a linear assignment kind of a method, and that thing handily beats doing RL. In practice, the gradient variances and complexity mean that it’s pretty finicky to use, and to my knowledge, no one at scale has really used an RL-based approach to optimize these gating decisions.

A thing that has been done much more at scale is stochastic approximations of various kinds. They might add a bit of perturbations. Here’s an example from Shazir in 2017. This is one of the early papers where they’re still going to do kind of top K routing. They’re going to keep the top K elements of this H of X operation and they’re going to softmax that to get the gate.

What we’re going to do to get this H of X operation is as follows. We’re going to have our original linear affinity. This is identical to what we were doing before. We were basically just computing our inputs X and a sort of learned weight for each gate.

This part is the same, but I’m actually now going to jitter it a little bit. I’m going to add normal noise and then I’m going to pick sort of a W noise scale that’s learned. This thing is going to control how much noise to inject into this process. You can think of this as a stochastic exploration policy.

By manipulating W noise in particular ways, like kneeling it down or doing various things, I can control the exploration-exploitation trade-offs that this is going to have. This is going to give you one solution to the explore-exploit dilemma. If you’re noising things up, each expert might randomly get some other tokens that it wasn’t expecting to get.

It’ll lead to experts that are less specialized but maybe a little bit more robust. That seems generally quite nice. Of course, the stochasticity also means you don’t get as much specialization, which leads to a loss of efficiency. There’s another approach that people have done where they multiply the router logits or add a multiplicative perturbation to the router logits, with the goal of getting less brittle experts.

But this jitter process was kind of removed in some of the later papers because they found it just didn’t work as well as some of the heuristic loss-based approaches. This stochastic routing trick was tried in early Google papers, but it’s generally been abandoned by a lot of the people training these models.

Okay. So yes, for the stochastic approach, what problem does that solve? Because we’re still taking the top K, so we still can’t differentiate backwards, right?

If you think about this, the question was we still can’t differentiate because we’re taking the top K. If you change your interpretation of the problem a little bit, you can see it as a bandit problem.

It has the same structure where you know you pull a bandit arm and you don’t see any of the other arms. You can’t allocate your resources efficiently. If you pull some of the other ones at random, now you’ve got enough data to be able to do some optimization.

This jittering is similar in spirit to an epsilon-greedy style exploration where you’re randomly pulling some of the other arms with some probability, where the probability itself depends on how confident you are about this routing decision. That’s the intuition, and then, of course, that’s going to give you some way of getting some signal back.

The thing that in practice people have ended up with is that we don’t do any of that. We don’t do RL; we don’t do stochastic exploration. But we rely on really another mechanism to keep things reasonable. If we’re doing top two routing, technically speaking, we do get some signal in the gradient descent process because we can compare the top two experts that we did evaluate.

It’s possible to do some optimization, but when we drop all the other constraints, the big issue that arises is that you just end up picking one expert all the time, and that expert is good at everything, and all the other experts are terrible. You end up in this local minimum where you’ve routed all of your tokens to one expert all the time.

So really the key game becomes how we get out of that local minimum, and loss balancing or balancing losses is the key trick to get out of this. This is important to understand because this is the loss that mostly everyone uses to train. If you were zoning out earlier, you probably should pay attention to this particular set of equations here.

This is originally from the Switch Transformer by Fillmore et al. in 2022, and they add this particular loss where they loop over each of the experts and take an inner product between the vector F and the vector P.

What are these vectors? F is for each of the experts; this is the fraction of the tokens that were allocated to expert I. You can think of this as a probability vector telling you what fraction of your tokens in your batch or whatever the unit is did you route to expert I.

Now P of I is the fraction of the router probability that was allocated to expert I. The router probability is the original softmaxed routing decision that I was sort of intending to send. This measures P of I is the intended probability from the router, and then F of I is the actual routing decision made by the top K method.

One thing to look at here is let’s take the derivative of this loss with respect to P of I. This is a linear function with respect to P of I, and you’ll see that the strongest down-weighing action happens on the biggest experts with the biggest allocations.

It’s actually proportional to the amount of tokens you get. You’ll be pushed downwards more strongly if you received more tokens. This is the basic behavior of this loss, and almost everybody uses this kind of F.P trick to balance tokens across different units.

The basic unit that you might want to balance over initially is batches. You might want each batch to get allocated evenly to experts, but you might also have other kinds of balancing that you want to do. DeepSeek does exactly this.

I’ll talk about all the variants they’ve thrown in, but the first thing is per-expert balancing per batch. Each batch they want to make sure experts get an even number of tokens. This is from the DeepSeek paper, and this looks very familiar to you.

This is exactly the same F.P inner product structure as before. P of I is defined a little differently; that’s S of I of T, but that should be familiar from earlier. That’s the softmax pre-top K, right? So hopefully this looks good to you. The other thing you might want is to balance across experts.

That’s all well and good, but you might also want to think about systems concerns because you’re going to shard your experts onto different devices, and you might want to balance per device. You might have another loss that’s essentially the same structure, but instead of summing which tokens go to which experts, you might measure which tokens go to which devices.

That’s going to be a different F that’s measured over device groups rather than over each expert. Now you can set up a different loss to balance over devices. If you optimize this, you’re naturally going to learn routing functions that ensure each GPU, each TPU, or whatever you have, has an even number of tokens, leading to even utilization. That would be great from a systems perspective.

Basically, everyone does this kind of thing. DeepSeek V3 actually kind of innovates a little bit. This is cool, and I don’t think I’ve seen this before. It’s one of the first things in the world that doesn’t actually come from Google, really. They have gotten rid of the per-expert balancing term entirely.

Instead, what they now do is they take their softmax scores and add a little fudge factor B of I, where B of I is a little fudge factor score for each expert. Expert I might get upped or downed. If an expert isn’t getting enough tokens, it’s going to be given a higher B of I, allowing it to grab more tokens.

The way this works is that they’re going to learn B of I through a simple online gradient scheme, online learning. They’re going to measure at each batch what each of the experts are getting, like are they getting an even number of tokens? If they’re not getting enough tokens, they add a gamma learning rate to B of I, making it higher. If they’re getting too many tokens, they’re going to subtract gamma, making that expert slightly less attractive.

They’re just learning little offsets for each of the S of I. Notice here, you’re only using the B of I to make the routing decisions. You’re not actually sending it over as part of your gating weights. That’s a somewhat important thing to do. They call this auxiliary loss-free balancing.

If you go and read the DeepSeek V3 paper, which all of you should because it’s a really nice paper, they’ll make a big deal about how this makes training stable, great, wonderful. Of course, you keep reading the section and they’re like, actually, for each sequence maybe we still want to be balanced, and this doesn’t work well enough, so we’ve added the heuristic loss back.

They do have something called a complementary sequence-wise auxiliary loss that’s basically exactly the auxiliary loss they decided they needed because what they wanted to do was balance load across experts at a per-sequence level rather than a per-batch level.

I’m not sure why they do this particular thing rather than any other B of style trick, but that’s just kind of what they do in DeepSeek V3. So it’s not fully auxiliary loss-free as they’d like you to believe.

Okay. Oh yes. Question. This is a bit of an unfair question, but if we didn’t have to worry about systems optimizations, do you think the performance of this model would be a lot better, or would it stay roughly the same?

If we didn’t consider systems optimization, would the performance of this model be better or stay the same? When you say this model, what do you mean? Deep Seek V3 or just in general? So are you saying if we ignore systems concerns, do we think it could still be good? Is that kind of one way of asking that? Question? Would the performance on downstream tasks, for example, be better than what we have right now? Yeah. So I think I didn’t have to balance this; I must set roughly equal numbers of tokens for every expert. Yeah. That’s right. That’s right. Well, I think actually per expert balancing this term, right? This is not a systems concern. So, you still want to do this because if you don’t do this, what you’ll find is—I’m going to keep referring to the old model paper because they have so many ablations. They have a really nice ablation where they get rid of exactly this. What they find is basically early on in training, the model just picks one or two experts, and all the other experts are dead. The router never sends anything to them. So, you’re just wasting memory at that point, right? So now you’ve just lost performance for free. You’ve effectively gotten a smaller model. And so even if you ignore all the other device balancing parallelism concerns, you’ve just gotten a worse model because you didn’t properly allocate your experts, right? It’s the same way as like you want to use all your parameters, right? You would like to effectively use your parameters. You want to do expert debalancing.

Sorry, say device. What does device refer to? Yeah, actually, so normally this would refer to GPU or TPU. There is a subtlety. I’ll talk about this maybe in the very last or second to last slide. There are more sophisticated and cool versions of this where you try to balance things to minimize communication costs as well. And so there’s broader notions of device, like one rack or whatever else, but here it usually refers to GPU.

Yes, going back to the fact that hashing as a routing algorithm seems to improve performance—like is there intuition for that? Because that’s effectively just like randomly choosing one of the few forward members to send it through. Right? So like why does having multiple copies of that, I guess each of which gets less data, why does that make performance better? Yes, the question was why does hashing do anything at all? I don’t have the really precise intuition for this, but you can make arguments either way. One is, you know, even if you’re hashing, the same tokens are going to go to the same kinds of sequences. And so each expert will still get some deterministic subset of the inputs. There’s some specialization that can still occur. It’s just non-semantic or, you know, non-learned. If you’re a distribution Zipian, like the word “the” might dominate one expert, and so you might still get actual semantic specialization where one expert is effectively dominated by very frequent things. A random routing function probably wouldn’t be a pure random thing that’s not dependent on input. Yeah, I would bet that that would be really terrible. Yes, I have never run or seen that, but yes, I think that would be horrible. Good.

Yes. So for like during LM, you have many layers, right? Many transformers. I think in the lecture you mentioned that each expert, okay, so like you do have like 32 layers, like 64 experts. That’s like a lot of GPUs. Or I wonder if like experts are bundled together on like a single GPU. Is that the question? Like won’t you need lots of GPUs if you have lots of layers and lots of experts? Yes, if you exclusively give a GPU to a single expert, that would be kind of crazy. But you would kind of shard things so that each GPU would hold enough of these units to effectively use memory, right? The name of the game in parallelism is you always want to use up all of your memory because that’s one of your resources, right? You don’t want to paralyze more than you have to.

Cool. Okay. Excellent. Oh, okay. I did put the ablation in here. Yeah. So, this is exactly what happens to the question of what happens if you don’t do expert balancing loss. I think the great picture to see is this bottom left one. If you don’t do load balancing, you know, what are the tokens assigned to which expert? You see the pink and the yellow expert; they just kind of take over. They take up about 50% of the tokens. All the other experts are dead. They do nothing, right? And so you’ve wasted the majority of your experts at this point. Six out of eight of your experts. And you’ve created a two-expert model unintentionally. That gives you worse losses as seen on the top right, the teal lines. Of course, maybe that’s still better than the dense model because at least you’ve got two experts going. But you could have done better, right, counterfactually speaking.

Okay. So, I won’t go quite as deep as I could into the system side because I haven’t really started to cover the core systems concepts necessary for you to deeply appreciate a lot of the parallelism concerns like the hierarchy of communication speeds in a data center and so on. But really, as I said before, one thing to keep in mind is just how nicely it can fit into devices. The thing that people say is expert parallelism involves sending one or a few experts onto each device. What happens when you are basically processing a token? Well, you would hit the router, and after the router, you now have picked a few experts. And so now you would have a collective communication call, like an all-to-all communication dispatch that would send the tokens to the relevant devices. The feed forwards would compute their outputs, and then you would return the tokens to sort of where they belong. Or you would combine, I guess, multiple experts, and so you would need another sort of collective communication call. If your feed-forward computations are sort of big and beefy enough, you can kind of pay for the cost of basically doing this expert parallelism.

One of the things that’s nice about this is that it’s another form of parallelism in your toolkit. You’ve got on the right side data parallelism, model parallelism of two or three different kinds, and then you’ve got expert parallelism. You can combine all of them to come up with sort of ways of trading off all the resources you have: the communication speed, the amount of data that you have, your batch size, your number of experts, and your memory. I’m not going to go into too much detail about how specifically this is going to help, but keep in mind that this gives you another sort of tool in your expert toolkit.

Another thing that is also useful is, let’s say you have multiple experts on a single device. You might hope that because the computations are sparse, like let’s say token one gets multiplied to expert zero, the second one is expert one, and this third one’s expert two. So, this is really three matrix multiplies that are small and sparse, and you might hope that modern GPUs can sort of take advantage of these kinds of sparse matrix multiplications. And that’s exactly right. So if you lay out your experts correctly and the weights are fused in the right way, then modern sparse matrix multiply engines can effectively make sure that you’re not wasting any flops in doing this one big matrix multiply. So, modern libraries like Meta Mega Blocks can basically take advantage of this device-level kind of sparsity support to do multiple expert computations all at once. This is yet another advantage that you get.

One fun side thing, which maybe isn’t mysterious to you all anymore because you’ve sort of grown up in the era of GPT-4. When the GPT-4 API first came out, it was kind of mysterious to me because when you set the temperature to zero, you kind of got different responses even though it was supposed to be deterministic. Lots of people speculated about why would that be. I’m not saying this is the answer to that reason, but there is actually an interesting source of randomness. So, think about what happens. You’re going to route your tokens to experts, right? And experts live in different devices. It could be that you have a lot of examples. You’re going to batch your queries when you’re processing them. And so if you’ve batched your queries, these tokens are going to get routed into different experts. So imagine you’ve got this batch to process and you’ve got a bunch of experts, but for whatever reason, this batch really loves expert number three. All the tokens go to expert number three. So now what happens? Well, the device for expert number three doesn’t have enough memory to load all of those tokens. And then what happens is what people call token dropping. This happens at training time as well. You often have what’s called a load factor where you’re controlling the maximum number of allowed tokens. And if the router just allocates too many tokens to an expert, you just drop those tokens off either for systems reasons or because you’re just worried that that expert is going to take over, at least in training time. So now this token has gotten dropped, and it’s not going to get anything at all. The MLP is just going to do a zero computation, and the residual connection is just going to pass things straight forward. And then you’re going to return an output. If your token got dropped, you’re going to get a different result than if your token didn’t get dropped. Based on who else is in your batch, this can induce stochasticity both at training time and inference time, which is kind of an interesting thing that you don’t normally think about because you almost never think about cross-batch effects when doing inference.

Okay, so that’s kind of the main bits of the main basic components of building the system. A fun side thing, if you were to actually go out tomorrow and try to train, I think the system side will make you a little bit sad, but the other thing that would make you sad is probably the stability side of things. These models have this property that sometimes they’ll just kind of blow up on you if you try to fine-tune them. They’re very difficult to fine-tune, and they’ll sometimes blow up on you. Barrett Zoff and others really studied. They had a whole paper on trying to make things more stable. There’s a paper which is the one I’m referencing here, whose entire purpose is to stabilize training. There are a couple tricks that I’ll mention that I think are relevant and that people do. The first one is if you’re doing the router softmax—this goes back to last lecture about stability, right? Like what did I say about stability? The thing to be afraid of is the softmaxes. Softmax is always where you want to be afraid. So they do all the computations in float 32 for the router computations just to be safe. Sometimes, they also add an auxiliary z-loss. Hopefully, you remember that it was just last lecture when you do log of the sum of the exponentiated values in the softmax, square that, and add that as an extra loss. This is going to keep the normalizer values near one, which is nice for stability. This is actually one of the places where z-loss was used earlier before it got sort of more popular for training models. You can kind of see the effects here if you look at the losses. I think the second plot here is maybe great. If you remove the z-loss from your routing function, you see these giant loss spikes in your validation loss where the model just kind of goes a little bit crazy for a couple iterations and then gets pulled back. Of course, it still trains okay, but you are better off having the z-loss than not having a z-loss. There is a pretty noticeable gap in the validation loss by the end here, right?

Other things that can happen—of course, you want to fine-tune your RLHF if you’re going to ship and release it. This turns out to be kind of problematic. Some of the earlier work, you know, when people were starting to do this was back in the BERT and P5 era. There was a lot of fine-tuning going on. One of the things people saw was there’s a lot of overfitting that happens if you were doing sparse models. You see this big gap between train and val, right? This blue and orange line, whereas the dense model, this green and red line, has a smaller train-test gap. There were a lot of worries about overfitting because you have these gigantic parameter models that you’re fine-tuning on small data. One of the solutions proposed at the time—though I don’t think this is very popular, as far as I understand—is to architect yours such that not every layer is a layer, but you alternate dense layers and sparse layers. Then you can just fine-tune the dense layers, and that will still be fine, right? That behaves just like a dense model.

Another solution, which we saw in the DeepSeek MOE paper, is to use a lot of data. If overfitting is a problem, we have access to lots and lots of SFT data, so just shovel all of those in. In the case of DeepSeek, they used 1.4 million training examples. Maybe you’re not quite as worried about these overfitting concerns. The last thing I’ll end with, which is a trick in the toolkit that people have done and seen, is upcycling. This idea is to take a dense model, like the one over here, and then you take your MLP and make a bunch of copies of it. Then you maybe perturb it, and then you have your router that’s initialized from scratch, and then you just pretend this is—train it from that point on. You just initialize these from a dense model. This is a trick that’s called upcycling, and people have shown that if you can get it to work, it is a very cost-effective way of training. It is great for inference because not every MLP is going to be active at inference time. So, you’re going to effectively get a much larger parameter model without doing the training of a much larger parameter model. Several people have succeeded at this. Mini CPM, which I’ll mention again in the scaling wall lecture, is a Chinese open LLM that basically tried to build really good small language models. They succeeded at taking a dense model and upcycling it. You can see that their numbers get significantly better in the last two rows, right?

The dense models get a pretty non-trivial bump in performance. Quen, I mentioned at the start of this lecture, one of their earliest attempts was taking one of their dense models and then building upcycled. They got fairly significant performance gains relative to sort of smaller models at the time. They got models on par with their 7B models with a 2.7 billion parameter active model.

To wrap up, I want to walk through the DeepSeek architecture at the very end here. Hopefully, this will give you a sense of the first thing I want to do. I want you to understand the DeepSeek V3 architecture setup and all the changes that they did because that’s an example of a modern high-performance open-source system. I also want you to maybe appreciate that architectures don’t change that much. DeepSeek v1 is not that new; it’s maybe a year and a half or something, maybe two years old. They basically nailed the architecture at that point. I want you to see what they changed from that very early attempt to their big training run. This is the very first starting point. I’m calling it DeepSeek v1, but actually, the right way to refer to it is DeepSeek; it’s a 16 billion parameter model with 2.8 of those parameters active. You’ve seen already this diagram. This is the shared two shared plus 64 fine-grained experts, of which four of them are active at a time or maybe about six of them are active at a time. Sorry. The routing—you’ve already seen this; I presented this in the middle of the lecture. This is the very standard top K routing where the softmax is at the bottom before the top K selection. For balancing right at training time, all they do is add this auxiliary loss balancing term, right?

Both the expert and device level balancing terms, right? So hopefully, you remember those from earlier. So that’s DeepSeek v1. They saw how effective their model was. To add some more context, DeepSeek originally had a dense model, and then they had a model, and that model was remarkably good. So when they went to v2, they went straight to that, and now this is a 236 billion parameter model, of which 21 of those billion parameters are active. You need a lot of memory, but your flops consumption for inferring this model is not so bad now. The architecture is identical. I copied literally the same figure because the architecture is literally the same minus changes to the number of experts that are active. We’ve got some new things happening, but not too many new things. The top selector is the same. The equation from before, this previous equation, is still how they do things. They have this very clever trick that they add on.

At the beginning, I was going to say, what’s the drawback of having fine-grained experts? Why can’t I have, I don’t know, 1024 fine-grained experts or 2046 fine-grained experts? The problem is when you shard your experts very finely and have a lot of active experts, you’re going to have to route to those experts, right? Your communication costs potentially grow, and if you’re very fragmented, you might have to send a lot of tokens to a lot of devices. The clever thing they come up with is to say, I’m not just going to, you know, for each batch route to the top K experts naively, which might force me to send my tokens to lots of devices. What I’m going to do is I’m going to first pick top M devices. So I’m going to do my normal scoring calculation, but I’m first going to subset the set of allowed devices to top M. Once I’ve picked my devices, I’m going to pick top K for each token within each device. So now I’ve restricted the devices. This really controls the communication cost. This gives you more efficient training when you’re scaling up to these gigantic sizes. You need to start really engaging with the systems aspect of things when you’re training a 236 billion parameter model.

The other thing that reflects the systems concerns at this scale is that they add a communication balancing loss. One way of thinking about things is, you know, for an expert, there are kind of inputs and outputs. The inputs are the token that comes in, and you route to your expert. The outputs are you have to bring the tokens back where they belong. If a batch belongs on this device, it has to go back where the original device was. We have to think about both the input communication cost and the output communication cost. They add a balancing loss to try to balance out the output communication cost as well, not just the input side. That’s a minor note, but you can kind of see their attention to detail on trying to make sure all the different systems aspects are properly taken care of.

Finally, we get to the big DeepSeek v3—sorry, that should say v3 not v2 up there—671 billion parameters, of which 37 are active. Once again, exactly the same figure because the architecture itself doesn’t change. That’s stayed the same since DeepSeek MOE, right? If it works, don’t change it. They do change a couple of things. Maybe they were, you know, hearing you all say, “Why don’t you normalize to one?” So, you know, they’ve normalized the gate to one. They’ve moved the softmax normalizer operation up there. They are not actually exponentiating the gating decisions. They’re actually taking sigmoids, which is a sort of softer, more nicely behaved operation than the softmax. They have some changes here, but conceptually this is still the same as the top K routing decision. You hopefully see very similar things happening.

In terms of the losses, they’ve gone to this auxiliary loss-free trick of this being incremented or decremented based on the expert load. They have a sequence-wise auxiliary loss. Just to add some context, why would you want to balance different experts on a single sequence? The thing they’re very concerned about at training time is that it’s fine to not have a sequence-wise balancing loss, but at inference time, it might be the case that someone sends you very out-of-distribution sequences, and that might overwhelm certain experts, right? So, at inference time, you can’t control which sequences you get. You might want sort of stronger balancing that operates at a single sequence level rather than the overall batch level.

Okay. And in the Oh, sorry. Yes. Does v3 still do the top M devices? Does it keep the B2 improvement? Yeah, they keep the top M improvement. They do not keep, for example, the communication loss. So they’ve jettisoned some things, but top M is a clever idea; they keep it.

Yeah. But it’s not like they always add things. They have removed some of the things. In the last two or so minutes of the class, I’m going to go over the non-core parts of DeepSeek v3 because I think we’re already at the point where I’ve explained most of DeepSeek v3. I might as well go through the rest of DeepSeek v3 at this point. You all know how that works. They have a clever sort of optimization for the attention piece called MLA or multi-head latent attention. You all actually already know all the ingredients that you need to understand this because at the end of the last lecture, I talked about GQA and MHA, right? Those are all inference optimizations that you need to optimize the size of the KV cache.

The DeepSeek folks take a different approach to optimizing this. Instead of reducing the number of heads, they’re actually going to project the heads into a lower dimensional space. You have your inputs H of T, and instead of generating the K’s and V’s directly from these H of T’s, what I’m going to do is generate a low-dimensional C. You can think of this as a compressed version of H. This C is going to be smaller and easier to cache. I’m just going to cache these C’s. Whenever I need these K’s and V’s, I can sort of up-project from this KV conceptually speaking. Then I can take the inner products with the Q’s, right? You can see how this would be a KV cache savings if I only have to save the C instead of the higher dimensional H of T. That’s exactly the idea. You take your H of T, project it into a lower dimensional C, and then up-project this back into the K’s and V’s. If the C’s are small, you’ve compressed the KV cache. That’s good.

In terms of the computation, if you’re thinking about flops, you might think this is not good because I have to multiply an extra matrix W U K. I didn’t have this matrix before; that’s an extra matrix multiply I have to pay for. The clever thing here is remember that on the other side of K, I’m going to take K and Q. That Q.K is going to be an inner product in the attention operation, right? Q itself has a projection matrix Q. The trick here is you can merge this W U K and this Q matrix together into one matrix. I haven’t gotten extra matrix multiplies; I’ve just merged this new matrix multiply into my other one. This is just associativity. I can just merge the two. They also compress the queries for memory savings during training, but that one is not quite as necessary because it doesn’t interact with the KV cache.

I’m only going to mention this last one in passing because it is a subtlety, but it’s kind of a clever subtlety that you realize. This original trick, the sort of thing that I just described at the top, is not compatible with rope. The reason is that, you know, the rope matrices, you know, basically you have the Q’s and the K’s, and you rotate each of those Q’s and K’s by multiplying with a rotation matrix RQ and RK. But if you do that, these RQs and RKs are in between the query projection and this latent vector up projection matrix. Since I can’t reorder these matrix multiplies, rope kind of gets in the way. They still have a solution of basically doing rope on non-compressed dimensions. That’s kind of a side point; I think it’s not quite as important. You can look at the paper if you’re super interested.

The other thing they do, and this is the last thing I promise, is they have a minor change in their loss function called MTP where they predict multiple tokens in parallel. Normally, you have your inputs, you shift them to the left by one. You’re predicting one token in the future, and then your transformer is going to predict all those tokens. That’s your normal transformer loss. Before you make those predictions, you can take the hidden state; you can pass it to a very lightweight one-layer transformer, and that model can predict one token in the future. The model is not just predicting the next token; it’s predicting two tokens into the future. Hopefully, that all makes sense. This is just a small lightweight model that can do that. You can sort of see the architecture right here. The one thing that is kind of disappointing that I learned as I was researching for this lecture is that they only do MTP with one token ahead. Even though they have this very complicated diagram of how they could do it for many tokens, it turns out it’s only done for one token.

Okay, so now I’m all done. We’re kind of now at the core of how you would build and deploy a really high-performance large-scale system. They take advantage of the sparsity idea that you don’t need all of the parameters all the time. Discrete routing is the real big challenge. I think this is one of the big reasons why it didn’t immediately catch on. It’s very scary to have to try to optimize this top K routing decisions, but heuristics somehow seem to work, right? They just do. There’s a lot of empirical evidence now that at least for flop-constrained settings, it’s just a good idea. It’s cost-effective. Do it. So definitely worth learning.

Thanks a lot for listening.

Jeff Dean’s talk at ETH Zurich in April 2025 on important trends in AI

2025-04-22T00:00:01+00:00

Jeff Dean’s talk at ETH Zurich in April 2025 on important trends in AI

[Music]

All right, welcome everyone. Great to see a full house. It is my great pleasure to introduce Jeff Dean, who is Google’s chief scientist. He joined Google in 1999, where he’s been building, co-designing, and co-implementing the pillars of Google’s distributed technology with systems like MapReduce, Bigtable, Spanner, and TensorFlow, more recently, Pathways.

In 2011, he co-founded the Google Brain team, and since then, his focus and research have been on systems and applications for AI. Today, he’s going to tell us about important trends in AI, and I should also mention he’s won many awards. He’s the recipient of the ACM prize for computing, the IT Levonne Newman medal, the Mark Weiser award, and he’s an ACM fellow among many others. So, we are very excited to have you here, in case you can’t tell by the turnout, and very much looking forward to your talk. So, a warm welcome to Jeff Dean.

Thank you so much for the delightful introduction. I’m really excited to be here, and I’m going to talk to you today about important trends in AI. How do we get to where we are with the current state of what models can do? What can we do now that sort of the field has advanced to the current level? And how can we shape what we want AI to do in the future? This is joint work with many people at Google and elsewhere, so it’s not all my work. Many of it is collaborative work; some of it is not necessarily my work, but I think it’s an important set of work to discuss.

Okay, so some observations, most of which are probably reasonably obvious to you. Most importantly, machine learning has really changed our expectations of what we think computers are capable of doing. If you think back 10 years ago, computers could barely see with the rudimentary computer vision performance. Speech recognition worked but not super well. Language understanding in terms of language models was somewhat limited in capabilities.

What we’ve seen over the last 12, 13, 14 years is that increasing scale of compute used to train the models, the data, and the model size increases generally delivers better results. There’s an almost truism to that in many ways, where we’ve seen this over and over again over the last 15 years: bigger models and more data give you better performance in problems we actually care about in terms of capabilities of computers.

Algorithmic and model architecture improvements have also been really important in this, so it’s not just about throwing more hardware at the problem. Algorithmic and model architecture improvements have actually been more significant than just the hardware improvements we’ve seen in the last decade. As a result of all of this, the computations we want to run on computing hardware are really changing. How we think about building the computer hardware to run the applications of today and tomorrow is really shifting from traditional CPU-based computation.

First, I’m going to go through a section that is a whirlwind. One slide per advance. I should relaunch Chrome within two days. Hang on, let me agree. I should probably relaunch Chrome, but let’s try to not do it right now.

So, a whirlwind of one or two slides per particular technique that has been really influential in getting modern models to how they came to be, and let’s just launch right into that. It’s going to be mostly chronological but not quite.

A key building block from the last century is neural networks. A lot of almost all of the advances you see in machine learning, at the largest scale and in the capabilities you see computers have, are based on neural network-based computation. These are made up of artificial neurons, loosely based on how real neurons behave in some ways, but they are very imperfect reproductions of how we understand real neurons to behave. There are lots we don’t understand, but they are one of the underlying building blocks.

Another key building block is backpropagation as a way to optimize the weights of the neural network. By essentially backpropagating errors from the output the model gave you to the output you wanted, backpropagation gives a very effective algorithm for updating the weights of a neural network to minimize errors on training data. Because of the generalization properties of neural networks, you can then generalize to problems or particular examples the neural network has not seen.

These two things are key to a lot of the deep learning revolution: backpropagation and neural nets. One of the things that I and some other people worked on in 2012 was this notion that maybe if we were to train really big neural networks, they would be even better than small ones. We had this hypothesis and in 2012 we decided it would be kind of fun to train a very large neural network and see if we could do it using an unsupervised learning algorithm.

We trained this large neural network that was about 60 times bigger than the previously largest known neural network in 2012, using 16,000 CPU cores. At that time, we didn’t have GPUs in our data center; we had a lot of regular CPUs. What we saw was that this unsupervised training objective followed by supervised training actually gave a 70% relative improvement in the less thinly contested ImageNet 22K category. Most of the ImageNet results you hear about are in the 1000 category section. This was more interesting, perhaps because it has 22,000 very fine-grain categories.

This was a significant advance and proved our hypothesis of larger models being more capable if you put sufficient training computation behind them. As part of that work, we developed our first large-scale neural network infrastructure systems project. This was called Disbelief, partly because it was distributed over many machines for a distributed computing system but also because our colleagues didn’t think it was going to work. It was a little bit of a play on words.

When training these large models, and the model doesn’t fit on a single computer, there are a few different ways to imagine parallelizing that computation. The first is to take your model, which typically in a neural net has many layers of neurons, and slice them both vertically and horizontally to produce pieces of the model on each computer while managing communication between the edges crossing between the different splits made in your model. The other thing you can do is data parallelism, where now you have many copies of the underlying model on different machines, perhaps combined with model parallelism, with each copy being on many machines.

Then, you partition the data you’re training on across those different model replicas. In the case of what we were doing in Disbelief, we had a centralized system that could accept gradient updates from different replicas of the model and apply them to the parameters. We did this not in a mathematically correct way, as we were doing it completely asynchronously. Different model replicas would compute a bit of data, send a gradient based on the parameters and training data for that batch back to the parameter server. By then, the parameters had moved because other model replicas had applied their gradients in the interim, which is clearly not mathematically correct according to the gradient descent algorithm, but it works.

That’s nice, and it enabled us to scale to very large models, even using CPUs. In 2013, we used that framework to scale up training of dense representations of words using a word embedding model called Word2Vec. One of the things that is really useful coming out of this work is that having a representation of a word that is a high-dimensional vector gives you two nice properties if you train it in particular ways. One way to train it is by taking the representation, the vector representing the middle word, and trying to predict the nearby words from that representation.

Another version is taking all the surrounding words and trying to predict the middle word, but they both work kind of roughly equally well. When you train embedding vectors for words in this way, you find you can represent words with these high-dimensional vectors that have two nice properties. One is that nearby words in this high-dimensional space, after you train on lots of data, tend to be related because you nudged all the words related to cats, pumas, and tigers into the same part of the thousand-dimensional space.

The other interesting thing is that directions are meaningful in this space. To transform a male version of a word to a female version, you go in the same direction, regardless of whether the words are king and queen, man and woman, bull and cow, or various other examples. Linguistic properties emerge from the training process in the directions between different points in the space.

In 2014, three of my colleagues—Ilia Sutskever, Oriol Vinyals, and Quoc Le—developed a model called sequence-to-sequence learning with neural networks. The idea here is you have some input sequence and you want to predict an output sequence from that input sequence. A classic case is translation, where you have the English sentence and then, using the representation you’ve built up by processing the input English sentence one word at a time, you now have a dense representation that you start to decode into the French sentence.

By processing lots of language sentence pairs of English and French, you essentially learn to do a language translation system purely from this kind of sequence-to-sequence based neural network. If you use that to initialize the state of the neural decoder when starting to translate, it actually works, and you scale up the LSTMs to show that it can work better and better.

In about 2013, I started to get worried because as we were making bigger and bigger neural networks for things like speech, vision, and language, I began to calculate that if speech recognition starts to work better, people might use it and that might be problematic if we want to serve many users in the system. I did rough calculations and determined that if 100 million of our users started talking to their phones for three minutes a day, and at that time the models were big enough that they couldn’t run on devices, they had to run in our data center.

I discovered that rolling out a better speech model that we had, which would reduce the error rate by 40%, was significant. We knew it was going to be better if we could serve it to a lot of people. However, my calculations indicated that serving those 100 million people for three minutes a day would require doubling the number of computers Google had just to roll out that improvement in the speech recognition model. This is one of our many products.

I started talking to some of our colleagues in our technical infrastructure group who had hardware expertise, and we decided it would be sensible to build more customized hardware for neural network inference. This was the genesis of the tensor processing unit (TPU) line. The first version was specialized for inference only, using reduced precision and operating with only 8-bit integer operations in its multiplier. The target was to build something really good at low precision linear algebra, which would be useful for serving a lot of different kinds of neural network-based models without needing all the complex features of modern CPUs, like branch predictors or caches.

Fast forward, the largest team produced a TPU that was 15 to 30 times faster than contemporary CPUs and GPUs for these kinds of tasks and 30 to 80 times more energy-efficient. By the way, this is now the most cited paper in ISCA’s 50-year history, which is quite impressive since it was only published in 2017. This really started our foray into more specialized compute for machine learning models.

Then we considered scaling up and focusing on training, not just inference. That’s when we began thinking about systems that resemble machine learning supercomputers, with high-speed interconnect between many chips densely connected by custom high-speed interconnect. We have done six generations of TPU pods that are great for both inference and training. These connect thousands of chips together. The initial pod had 256, then 1000, then 4000, and the most recent ones have been around eight or nine thousand chips, all connected with custom high-speed networks.

Since version 4, they have featured a really exotic optical network. You can take a rack of 64 chips and connect it to another rack of 64 chips, using optical switching and mirror movements to make them function as though they’re next to one another on the data center floor, even if they’re not. You can read about that in the ISCA paper.

We announced the latest version last week—Ironwood. We’ve stopped naming them with numbers, which confuses me, but Ironwood has a fairly large pod size. It’s got 9216 chips, each of which can perform 4614 teraflops, totaling 42.5 exaflops in one of these pods, with reduced precision floating points. This is 8-bit floating point precision, quite a boost from the previous generation.

Compared to the first training pod, it represents about a 3600 increase in compute capability in the pod over seven years. Doing lots of clever circuit design and shrinking fab processes, with lower precision operations than the original TPUv2, we’re achieving about a 30x improvement in energy efficiency per flop compared to the first training pod of 2018.

Another trend that’s important is that open-source tools for machine learning have enabled a broader community to participate in improving those tools and using them to tackle machine learning problems across various disciplines. TensorFlow, which we released in 2015, PyTorch, which came in 2016, and Jax, another Google-developed open-source framework with a more functional style, emerged around 2017 or 18. These three packages have significantly pushed the field forward in terms of accessibility and standardization.

In 2017, some of my colleagues observed that in a recurrent model, you have a sequential process of absorbing one token at a time and updating the internal state of the model before advancing to the next one. This inherent sequential step limits parallelism and efficiency in learning from large amounts of data. They proposed saving all the internal states and developing a mechanism called attention to refer back to all the states you went through to alleviate this.

This is a hugely influential paper because it demonstrated that, with 10 to 100 times less compute and 10 times smaller models, you could achieve better performance than the state-of-the-art LSTM and other model architectures at the time. This log-scale difference has been significant. Nearly all modern large language models you hear about use transformers as the underlying model architecture, with variations.

This was not new in 2018 but really came into vogue then, realizing that language modeling at scale can be done with self-supervised data. You can use any piece of text to predict other parts of the text, generating large amounts of training data. This is a major reason these language models have become so good—more text to train on equals improved quality. There are various training objectives; the first is autoregressive, where you look at the prefix of words and predict the next word.

Many models today follow this approach, letting you create training puzzles. For instance, “Zurich is blank.” The model uses the context to predict the missing word. You can also employ fill-in-the-blank style training examples, creating diverse training examples from the same text. Both training objectives are useful, but autoregressive ones are more common, especially in applications like chatbots, where only past context is available.

In 2021, other colleagues of mine developed a method to map image tasks into a transformer-based model. Prior to that, most people used convolutional neural networks of some form. Essentially, they were able to take an image, break it into patches, and similarly to how Word2Vec embeds words into dense representations, represent those patches with high-dimensional vectors that incorporate aspects like color and orientation.

Then, you feed these patch representations into the transformer model. Instead of using word embeddings for the input, you use patch embeddings, allowing you to handle image data. As you’ll see, when training multimodal models, you can combine text and images, embedding visual patches with a visual model and text patches with a part of a text model.

The attention operation in the transformer attends to relevant parts of the image when asked what’s in it. For example, it’s focused on the airplane or the dog, but when faced with confounding elements, the attention is less focused, scanning over the entire image to gather visual clues to predict the correct response. This has been hugely influential in unifying transformers for text with those for images.

Another innovation came in 2017, when I and some colleagues developed a way to create sparse models that have a large capacity but activate only a small portion of the model for each token or example. In our original paper, we used around 48 experts per layer but would activate just two. This allows the model to maintain a large capacity while only selectively using portions based on what’s relevant, enhancing efficiency.

The choice of which experts to activate is learned end-to-end through backpropagation, enabling the model to handle various contexts, like dates and times or geographical locations. We achieved an 8x reduction in training compute cost for the same accuracy, or major improvements in accuracy for the same training cost. When you encounter graphs comparing compute budgets and accuracy scores, you want to line things up horizontally to illustrate less compute needed for the same accuracy.

We’ve continued to conduct substantial work on sparse models because we see it as a vital direction for models with large capacity that require activation of a small percentage of the model.

In 2018, we began rethinking software abstractions for large distributed machine learning computations. We aimed to train models at a larger scale, connecting together many TPU pods in software. Each smaller box with yellow dots represents a TPU pod, and we wanted to enable seamless connectivity among many of these. distributed system manage the right sort of communication mechanism for when one of these chips needs to talk to another. So when two yellow chips in the same small box need to talk to each other, you use the very high-speed TPU network.

When the chip in the upper left box needs to talk to one in the pod in the same building, it will use the data center network within that building. If it needs to talk across buildings, it will use the network that goes between buildings in the same data center facility. And you can even have TPU pods connected together in different regions via larger wide area network links. That big orange orangey red arrow and by having this nice scalable software that can simplify running these large scale computations.

So in fact, one of the abstractions that pathways gives to the sort of machine learning developer researcher is you just have a single Python process and Jax has a notion of devices. So normally if you’re just running on a single machine with say four TPU chips in it, it shows up as a process with four chips. But what Pathways does when you run it under Jax with Pathways underneath it, all the chips in this entire training job just show up as devices for Jax.

So you have a single Python process and it looks like you just have a single sea of say 10,000 or 20,000 TPU devices, and you can run computations on that and Pathways takes care of mapping that computation onto the actual physical devices. One of the things we’ve just done last week was made the Pathway system, which we’ve used internally for now six years, available for cloud customers using our cloud TPU products.

Another observation by some colleagues of mine was that thinking longer at inference time is very useful. So, in the same way that your third grade math teacher told you to show your work when you were solving problems because you were more likely to get the steps the sequence of steps right in order to solve the problem correctly. It turns out large language models are the same way. If you just give them an example problem, Sean has five toys for Christmas he got two from his mom and his dad. How many toys do you have now? The answer is nine. That’s the one-shot example in the input.

Now you’re asked a new problem. John takes care of 10 dogs. Each dog takes half an hour a day to walk and takes care of the business. How many hours a week does he spend taking care of dogs? Then the model got this particular problem wrong. It said 50. That’s not correct. But if you encourage the model to show its work by in the one example problem you’ve given it, actually show it that hey, this is kind of the sequence of steps to work out the problem. Sean started with five toys. If he got two toys each from his mom and his dad, then he has four more toys. 5 plus 4 is nine. The answer is nine.

So that seems very simple, but it actually turns out that this tremendously helps models become more accurate because they are now encouraged to think through the steps in order to solve the problem in a finer grain way. You see that as the model scale improves, the solve rate goes up somewhat if you just use standard prompting but goes up dramatically when you use chain of thought prompting. This is for like a benchmark of like roughly eighth grade math level problems. So prompting the model to show its work improves the accuracy on reasoning tasks.

You can think of this as also a way of using more compute at inference time because now it has to produce all these extra tokens in order to actually get to the right format of answer. In 2014, Jeff Hinton, Oral Vinol, and I developed a technique called distillation, distilling the knowledge in a neural network. The idea was you have a really good model and you want to put its knowledge into a different model, typically a smaller one.

So the typical way you’re training the small model is let’s say you’re doing next token prediction. So the prefix you see is perform the concerto for blank and the true next word is violin. So you can train your language model with that objective and if you guess violin correctly, great. If you guess it wrong, then you get some back propagation error from the training objective. It turns out that works okay. But if you can use your teacher model to give you not just the correct answer, but a distribution over what it thinks are good answers for this question for this particular word, it gives you a much richer signal of training.

Think of the loss you get for the original just violin. You get zero correct for everything except violin and then you get a one. But here the distribution of probabilities is violin 0.4, piano 2, trumpet 0.01, but airplane is extremely unlikely in this circumstance. The concerto over airplane, I don’t know, I guess you could have one, but unlikely. That really rich gradient signal is something that you can use to inject much more knowledge into every training example for the smaller model and enables you to get to convergence much more quickly.

If you look at some of these comparisons, this is a speech-based setting where you have a training frame accuracy, but what you really care about is the test frame accuracy of did you predict the sound in this frame of audio correctly? The baseline with 100% of the training data gets 58.9% on the test frame accuracy. If you strip the training set down to only 3% of the training data, then your training frame accuracy actually goes up because your model overfits to the very small number of training examples you have. But your test frame accuracy plummets because now you’re in an overfitting regime and you can’t do very well on new test examples you’ve never seen before.

But if you use these soft targets produced by the distillation process and use only 3% of the training data, what you see is you get pretty good training frame accuracy, but you get almost as accurate at the test frame accuracy with only 3% of the data. This is a really nice property because it means you can suddenly transfer the knowledge of a large neural network into a small neural network and make it almost as accurate as the large one.

This was rejected from NeurIPS 2014. We published it in a workshop and put it in an archive and it now has 24,000 citations. In 2022, some colleagues and I looked at different ways of mapping computation onto our TPU pods for doing efficient inference. There are a whole bunch of variations one can consider. You know, do you keep the weights stationary in one of the dimensions of the network? Do you keep them stationary in both dimensions so that your weights are now spread across a two-dimensional thing? Or do you gather the weights and bring them to the part? The details aren’t that important, but there’s a bunch of different ways of doing it.

One of the things that is true is the right choices for how to do this actually depend on a lot of different factors. One is what is your batch size, which can have a lot of influence on whether one of these three techniques is actually better. Latency constraints can also have a big effect. So if you think about this, we have these three different techniques: weight stationary, weight gathered, and XY weight gathered, and there’s even another one XYZ weight gathered. What you see is the little dotted things at the bottom of these techniques are the best to do at varying different batch sizes and that the right answer changes as you change the batch size.

That also means your floating-point utilization of your hardware also changes depending on your strategy. The right answer depends on how large your batch size is. At very small batch size, you want to use a 2D weight gathered in this case. At larger batch size, a 2D weight stationary at small sizes, and a 2D weight gathered at larger. It’s just to say that there’s a lot of complicated choices in how you decide how to partition a model and do inference at scale.

In 2023, some colleagues of mine developed a technique called speculative decoding. The idea here is we’re going to use a small drafter model, maybe 10 to 20 times smaller than the larger model, with the idea being that many things are actually quite predictable by a small model. We can sequentially predict from the very small drafter model much more rapidly than we can sequentially predict from the very large model.

We’re going to predict the next K tokens with the small model, and then we’re going to ask the large model to predict K tokens in a row. We can advance this generation by as many tokens as match in the prefix of size K. Essentially, if you do this with just the large slow model, it’s going to trundle along predicting one word at a time. But if you do this with the drafter model, you see the drafter is predicting four or five words at a time and then the larger model is trying to predict and will advance as many as the words match that the drafter model created for you.

By doing size K predictions for K words, you essentially amortize the memory overhead of bringing in the weights of the model in order to then predict K words instead of just one. There’s an awful lot of things that have happened all kind of combining together to really improve the quality of models that people are seeing today. Better accelerator hardware. That’s true in TPUs, but also Nvidia GPUs have gotten a lot better in recent years for machine learning focused applications as well.

Software abstractions are really important because they enable you to have these nice layers where you can focus a lot on the performance and the abstractions provided by those things and then people on top can build useful things without necessarily having to think about the details as much underneath those abstractions. Model architectures have seen huge improvements, in particular transformers, visual transformers, and are really heavily used in the most modern models.

Training algorithms, unsupervised and self-supervised learning, asynchronous training, distillation, and I didn’t talk about supervised fine-tuning after you’ve pre-trained your model or RL from human feedback or other kinds of computational feedback. That’s a super important aspect: chain of thought, speculative decoding, and inference time compute scaling. All of these are really important in the modern era.

Now I’m going to talk a little bit about the Gemini models that we’ve been training and how most of these innovations are used in various iterations of the Gemini models. Gemini is really a project that started as a collaboration between Google DeepMind, Google Research, and the rest of Google. We started this in February 2023 with our goal being to train the best multimodal models in the world and use them across Google.

There are all kinds of ways in which these models can help various Google products. They’re also available externally through our cloud APIs. This is kind of a timeline of what we’ve been up to since February 2023. We released Gemini 1.0 in December 2023, followed soon thereafter by Gemini 1.5 and so on. One of the things we wanted was to make these models multimodal from the very beginning because we felt like just text models were not as useful as models that could sort of understand language, understand visual inputs, understand audio, and also produce all those things.

The initial versions of the model did not produce audio as output, but they could take audio, video, images, and text as input and produce images and text as output. We’ve since added the ability to produce audio output as well. Gemini 1.5 introduced this very long context length so that you can provide inputs that are millions of tokens in length.

Think about a thousand-page document; that is about a million tokens. So you can now put 50 research papers or a very long book or multiple books into the context window. One of the nice things about the input data in the model, particularly transformer models, because of the attention mechanism, is that information is very clear to the model. Unlike training data where you’ve sort of trained on trillions of tokens, and you’ve optimized your billions or tens of billions of parameters of weights with those trillions of tokens, you’ve kind of stirred them all together and you’ve lost a little bit of the fidelity of the exact pieces of information there.

In the context window, that information is very clear to the model and enables it to sort of extract, summarize, and reason over that data much more capably than other kinds of data. In Gemini 2.0, as I said, these models build on a lot of these innovations. We use TPUs, we do cross data center training across metropolitan areas, using pathways, using Jax on top of that, the distributed representations of words and image data is super important, transformers, sparse mixture of experts, and distillation, and a lot more things besides.

But really these all kind of come together in our model training recipe and our model serving recipes. Just about a month ago, we released Gemini 2.5 Pro, which is our most recent model. This has been pretty well received because it has a significant leap forward in some of our various benchmarks that it performs on. It’s gotten a lot better at coding compared to our previous Gemini models.

Actually, there’s an arena for how to compare model quality across different models that is run by LM Marina, which is a Berkeley affiliated group of grad students. They enable users to enter a prompt and then pick two random models that they’re backed by that are behind the scenes, and then they show the output from both models to the user anonymously. So you don’t know which model is which. And then you’re asked which output do you like better.

It’s sort of a head-to-head competition of language models, and through thousands of trials like this, you can actually get a very good sense of the strength of models, at least in terms of how well the answers reflect what people using this LM arena like. We found it pretty useful. It does correlate quite well with the strength of the models.

This has a pretty significant ELO improvement over our previous models. It’s actually done pretty well on a whole bunch of independent evaluations that people do across the web, and on various academic benchmarks on the left there. We are sadly number four on New York Times connections. So we’ll have to work on that. But in general, this set of leaderboards covers quite a broad set of areas. Some of these are coding related, some are math related, some are sort of multimodal related.

We really try to focus on making good general-purpose models that are effective at a lot of different things. Users are generally enjoying this. Some of this is a little over-the-top phrase, but people do seem to like it. In particular, the long context abilities are really good for coding, particularly now that the reasoning capabilities of the model are also greatly improved.

Having a million or two million tokens of context enables you to put large code bases entirely into the context window and then ask the model to do fairly complicated things like can you please refactor this for me or can you introduce a new feature that has this capability. It also enables you to process other kinds of data. For example, this bottom person has a dataset of a thousand poems, 230,000 tokens, and then asked a bunch of stuff which requires reasoning over all those poems. They were quite impressed by that because I guess that’s hard.

One of the things we really focus on is the ELO score I mentioned from Ellarina. Higher in the ELO score means a more capable higher quality model as judged by those users. On the x-axis, there’s the cost of a whole bunch of different kinds of commercial models. Importantly, the x-axis is a log scale, so don’t miss that important point.

Just emphasizing the point, where you want to be is as far up and to the right as you possibly can. We produce a series of different models with different quality and cost trade-offs. Our flash models over to the right are generally quite cheap. They are about 15 cents per million tokens. Our most recent 2.5 Pro model is more expensive because it’s a much heavier weight model, which costs more for us to run it, but it’s still quite affordable for the quality you get.

Generally, we like to see that we have a variety of offerings on the Pareto frontier of this quality-cost trade-off. We are going to work to keep pushing up and to the right there as much as we possibly can.

Gemini is a pretty large-scale effort. If you look at the Gemini 1.5 paper, we do have quite a few authors. It’s very hard to write a short paper if you have to list all your authors. Truly, it’s a large-scale team effort and everyone here contributed tremendously to this. One of the things we’ve had to figure out was how can we best structure this so we can have that many people effectively contributing to a single model project.

Some of the structuring techniques we use are to have different areas that people loosely affiliate with. Some people are much more focused on the pre-training process or on data or on safety or values. Not to say that these are very hard boundaries, but generally some people have some affiliation with some of these more than others.

There are overall tech leads of the project, which include myself, Oriel Vinyols, and Nom Shazir. We have a really capable program management and product management team. Although Gemini is kind of a model creation thing, it does have a lot of product implications because we want to release that model into lots of different surfaces at Google. Interacting with all those other teams about what features they need, where they see the model perform well, and, more importantly, where it is not performing well, and getting feedback from them is something that’s really important.

We kind of have three broad categories of these different areas: model development, pre-training where you’re training on a large corpus of text and other multimodal data; post-training where you’ve finished pre-training the model on lots of data and now you’re trying to coax the model into behaving in certain ways with relatively small amounts of data using things like reinforcement learning or supervised fine-tuning.

On-device models are another important aspect; we have Gemini models running on phones that have a slightly different character than some of the larger data center-based ones. The core areas are kind of the ones that crosscut most aspects of Gemini: training data evaluations, infrastructure, the codebase for research and for model expressing, the production model training, and inference systems.

Serving is really important for long-term research within Gemini. There’s also a lot of research that happens outside of Gemini, and we sort of keep an eye on that kind of work, and our colleagues will say, “Hey, we have something that might be sensible to consider for the next generation of Gemini.” Capabilities are generally about particular narrower aspects of the model: can we make it safe and behave well? Is it really good for coding? Can we make it good at vision tasks in particular or audio tasks in particular?

Agent behavior is now a very important aspect of what we’re doing. Internationalization is crucial because we want this thing to work well in hundreds of languages, not just five. These are kind of broad areas. We have roughly a third of our people in the San Francisco Bay Area. I’m based in Mountain View. About a third are in London, and a third are in a bunch of other places including Zurich, New York City, Paris, Boston, Bangalore, Tel Aviv, and Seattle, which are some of the bigger concentrations of people not in the first two areas.

Time zones are really annoying. The golden hours between the California West Coast and London, Europe during the workday are relatively limited. It’s maybe two or three hours a day that you really have sensible meeting times for both sides. Past that, one side is like, I don’t know, our poor Bangalore colleagues are never in golden hours with anyone else. But it is a worldwide effort. There are some benefits to having people all around the world because when the model is training, there’s always someone awake and sort of paying attention to a large-scale training run.

Often, you might fire off a question to a colleague in London, and they are not there, but when you wake up in the morning, you know they’ve answered and done a bunch of work on your behalf. There are benefits, but distributed work is challenging. One of the ways we’ve been able to make this work is we have lots of large and small discussions and information sharing conducted in virtual Google chat spaces. I’m in 200 of these.

I wake up brushing my teeth and get probably seven alerts while I’m brushing my teeth in the morning because my London colleagues are busy at work and excited about sharing things in various chat rooms. We have a slightly formalized request for comments, which is really a one to ten-page document about some piece of work or thread of work or results that have been gotten or experiments they’re thinking about to sort of get some results.

People will give feedback in Google Docs style. We have a slightly formalized way for some of these to say, yes, we think this should make it into the next generation of our model training, or the new recipe. We have leaderboards and common baselines to enable good data-driven decision-making about how to improve the model. There are many rounds of experimentation, lots of experiments at small scale. You want to advance the smaller scale experiments that seem promising to the next scale to see if the results kind of hold up and are on trend.

Every so often, every few weeks, you… Incorporate successful experiments demonstrated at the largest scale into a new candidate baseline. You run that candidate baseline, see if it’s better than the previous baseline, and does it have any sort of unexpected interactions among the few things you piled in there. And then you repeat. So that’s kind of particularly for some of the pre-training recipe development. That’s the way we do that.

I mentioned scaling of people but also training of computing hardware. Scaling of computing hardware is quite annoying. So I’ll give you just one example. Silent data corruption. Despite the best efforts given the scale of these ML systems and the size of the training jobs, you will get hardware errors that sometimes are not going to be detected by the hardware and these incorrect computations because it’s a very large coupled system. One buggy chip can then spread to the entire model. Non-deterministically producing incorrect results which can happen for particular pieces of hardware, which can happen on any piece of hardware randomly due to various background radiation kinds of aspects. These become worse at scale with synchronous stochastic gradient descent and it can spread bad results.

One of the things we do is we, as we’re training, we monitor the norm of our gradients and if we see large spikes in that we get concerned. Is it justified to be concerned? We don’t know. It’s certainly a large gradient relative to the ones we’ve seen recently. And you can also get anomalies with no silent data corruption error. The first one was actually a silent data corruption error and the way we detect that is we rewind a few steps and we replay in a deterministic manner and if we see the same result then we say well it must be in the data; it’s probably not hardware failures. If we see a different answer, though, that’s concerning because everything’s supposed to be deterministic when we replay.

In this case, we did see an anomaly in the gradient, but we replayed it and we actually saw that the same large gradient value occurred in the replay as well. Now you can also detect SDCs if you just happen to replay without an anomaly. This is probably like the low bits of your exponent getting flipped by an error rather than the high bits. The high bits being flipped is bad because then all of a sudden you have 10 to the 12th and the gradient when you expected a 7.

I’m going to skip that and give you some examples of what these models can do. They can help fix bugs in your code, which is nice. This person uploaded their entire codebase, all the issues, and it identified the urgent thing. I guess it was replaying; it was calling some handler twice and so the code added a flag to say has the handler been called and if it hasn’t then call it.

In-context learning, so Kalamong is a language spoken by about 200 people in the world. There’s a woman who wrote a PhD thesis on a grammar of Kalamong. There’s no effectively written internet training data on Kalamong. But what we’ve observed is that if you put this book into context in the model and then ask it to translate English to Kalamong or Kalamong to English, it can actually do about as well as a human language learner who’s been given the grammar book in a dictionary for Kalamong to translate.

That’s kind of nice because it shows in-context learning at the level of I put in a 400-page PhD thesis about a topic the model has no idea about and it actually is able to sort of make sense of Kalamong and translate it.

With that video of a bookshelf to JSON, it’s kind of fun. You might not have thought of that as an input method but you can do that. It’s kind of good. Video understanding and summarization. So you can actually put in fairly long videos. A million tokens is about two hours of video. The prompt is in a table, please write the sport, the team, and athletes involved, the year, and a short description of why each of these moments in sports are so iconic. The model gets to see the pixels of the video and the audio track.

It’s like an 11-minute video I think. And so then the output of the model is that which is probably more sort of text extraction, structured data extraction than you thought you might be able to get out of in-context video. I think people are not yet clued into the fact that you can actually take multimodal data like that and do pretty interesting things.

Digitization of historical data. You can take weather data that looks like that from 100 years ago and just say, “Please give it to me in JSON,” and it will do that. They’ve done it for 144 tables and that cost them 10 p. But now they’re able to actually sort of unlock all this weather data.

Code generation via high-level language. Here’s the prompt we’re going to give to our Gemini 2.5 model. P5JS to explore a Mandelbrot set. That’s the prompt. Oh, can’t. I’m so sad. Why is it not able to do that? It was working before. Oh, I’m not on the Wi-Fi. It’s true. I’m not. Well, anyway, it makes a really nice interactive visual Mandelbrot explorer like that.

Now that we have these models, what will this all mean for us in society? I think it’s a really important set of topics. So I and eight other co-authors recently got together and wrote this paper called Shaping AI’s Impact on Billions of Lives. A bunch of computer scientists and people with machine learning backgrounds from academia, big tech companies, and startups; we wanted to propose what the impact of an AI in the world could be given directed research and policy efforts.

A lot of people in this space are thinking about what will happen with AI if we’re laissez-faire. Will we all be doomed or will we have incredible advances? I think really a pragmatic approach is to say let’s, as society and machine learning researchers and practitioners and experts, all work together to try to shape things so that we get the best aspects of AI and minimize the downsides.

Really, that was what this paper was intended to be: a discussion of how might we do that collectively. We interviewed 24 different experts in seven different fields: employment, education, healthcare, information, and media. We talked to former President Barack Obama, Sal Khan in education, John Jumper, who we talked to before he won the Nobel Prize, but he won the Nobel Prize later, Neil Stevenson, Dario, Amade, Bob Octar, and we uncovered five guidelines for AI for public good.

I will ignore everything after this, but you can see shapingai.com. There’s an archive paper from that site that I think is a pretty nice discussion of what will happen in a bunch of different areas, including employment, education, healthcare, or what could happen in some of those areas. It’s pretty important for us to all work together to get this all right.

With that, I will conclude by saying we also proposed some nice milestones of what people should work on in some of these areas. These models are becoming incredibly powerful and useful tools and I think you’re going to see continued improvement in this as there’s more investment and more people in the field doing research and those advances get incorporated into the leading models. You’re going to see even more capable models.

It’s going to have a dramatic impact in a lot of areas and it’s going to potentially make really deep expertise available to a lot of people across a lot of different areas. I think that’s one of the things that is both most exciting but also kind of disconcerting to some people is that expertise being widely available and done well. I think our AI-assisted future is really bright.

Thank you. [Applause]

Thank you very much for the great talk. A little token of appreciation from the department. Thank you so much. Some chocolates and a systems group t-shirt. I love coming to Switzerland because I get chocolate and a t-shirt. Thank you very much.

And we’ll now proceed to the Q&A. We have one mic and we have one cube that we can toss around. We’ve discussed that we’ll sort of also try to prioritize students especially for questions. If you can raise your hands if you have questions and you can point in a general area.

And my aim is probably not that great any nice. Ah well done. [Applause]

Hi. Thank you so much, and especially for the last paper you presented. Oh yeah, hold it into your mouth. Like this. Yeah, perfect. There we go. So, thank you for the talk and especially the last paper. It’s very important, I think, and so on that point a bit.

AI safety is definitely on our minds I think and it’s super unclear especially from outside, for example, big research labs, what would even be positive and what would be really impactful. So maybe from the perspective of really making sure everything goes well, everything is in human control, and everything, what would you do as maybe a PhD student starting a thesis, a professor with a bunch of research grant money, or even a startup? Let’s say you could acquire a startup this year; what would it do in the area of AI safety, particularly in the area of AI safety?

Yes, exactly. I mean I think AI safety is a pretty broad topic. I think there’s a bunch of concerns about the increasing capabilities of these models being able to enable people to do things that they wouldn’t otherwise be able to do that are somewhat nefarious or undesirable from a societal perspective. So I think some of that can be addressed with some technical means, but I also think that there’s going to need to be policy-based and regulatory-based things that impose some restrictions on some aspects of that.

One of the topics that we covered in the paper was about misinformation and public discourse. There, I think you know there’s clearly an ability for AI models to create more realistic misinformation in the world and enable people to create it at mass scale with lower costs. Misinformation is not a new thing; you could always create it, but now you have these tools that enable sort of more realistic and more rapid creation. So that is definitely an issue.

I think there’s a corresponding research question of how do you detect this information that is perhaps generated by a different machine learning model. There’s also some questions about how do you turn the problem onto a more positive spin. One of the things we’ve suggested in the paper was there’s actually some early evidence that AI models can be used to enable more constructive discourse in online forums.

That’s an area where I think looking at how could AI models encourage more positive conversations, identify misinformation in the flows of conversations that people are having with each other, these are some things that I think are pretty interesting. There’s a whole bunch of ideas in that paper that I think are worthy of study, and I don’t think the solution is necessarily going to be purely technical for all these problems.

Thank you. Yep. And send the cube over to him, but we’ll take someone else for the moment if that’s okay. Sure. Yes. Where was the question here? I thought there was one over here. Yeah, there we go. Should I? Yep. All right.

So, when I go to social networks, I’m very hyped, right? And I see messages like the ones that you saw. These LLMs are truly incredible. However, in my day-to-day work, when I try to use AI or LLMs, I’m often disappointed. Who needs training? Is the LLM that needs more training or is it me? I’m asking wrong.

It’s an excellent question. I suspect the answer is a bit of both, right? I mean, I do think you know using these tools, like first the arc of progress in these models has gotten quite steep. The Gemini models from eight months ago are not nearly as good as the Gemini models now. Sometimes people develop an impression of what the models are capable of from their previous experience trying to ask them to do something complicated and they failed miserably.

But now that might be something that is on the border of possibility or actually will work really well there. So I think part of it is looking at what the current models can do, not what the ones of ancient history eight months ago can do. Another aspect is becoming familiar with how to coax the models to do what you want. It’s quite interesting that with a one-page carefully crafted prompt you can almost create a completely different application of a general model than if you craft a different one-page prompt.

You know one one-page prompt might say, can you take this video contents and please make me an educational game that reflects the concepts explored in the lecture video? And it will actually in some cases create a fully working software-based game that highlights the concepts in an arbitrary lecture or scientific video. It doesn’t always work, but that is kind of at the frontier of possibilities now; 30% of the time it might work or something.

But also, more training for the models will help because then the models are going to get better and I think you’re seeing this from Gemini 1 to 1.5 to 2 to 2.5 a lot of progress and I suspect Gemini 3.0 models and beyond will be substantially better than the current ones. That’s a general trend in the industry; the models are becoming better.

Thank you for your talk. I noticed on your slide where you summarized all of the innovations in AI, you listed hardware, you listed algorithms, you listed all the improvements, but data was absent. There are lots of concerns in the field that data might be the new bottleneck. I’m curious about your personal opinion on this. Is it a bottleneck? And if not, how do people get by? How do we get past scraping all of the internet?

I guess I didn’t list data, but it has been really important. It’s just there’s not like a specific artifact generally to point to in a lot of the data-related work. It’s really about curation of high-quality data that we spend a lot of time on, say within the Gemini project. I think there’s concerns I’ve heard of about running out of high-quality data in order to improve the capabilities of these models.

I find that not very credible at the moment because, first, there’s an awful lot of data we’re not training on right now. If you think about all the video data in the world, we’re training on some video data, but it’s a very tiny fraction of, say, the YouTube corpus. That’s only some of the video in the world. So, I don’t think we’re running close to running out of raw data.

The other thing I would say as an ML research problem is there’s a whole bunch of work I think we can do to get more quality improvements from the model per unit of training or per token of training data. If you think about, we were discussing this in a session earlier; you have a two-sentence description of how to add numbers together, right? The model is just trained to absorb that by predicting the next token, but that doesn’t generally mean it’s actually learned the algorithm for adding two numbers together in a deep and sort of algorithmic way. It’s got an X token predictor for predicting the rule, but in some sense, it’s oblivious to the actual algorithm.

If you think about what you would really want the model to be able to do, it would be to read that algorithm and then build a representation internally that enables it to run that algorithm when it needs to. That would be extracting way more value out of those 15 tokens than what it is currently. I think there’s lots of room to go.

In the improving image convolutional neural network era, you know people were training on a million images with a thousand categories and one of the ways they would make the models more powerful is they would make many passes over that training data. The textual data corpus we have is large enough that we’re not able to computationally afford to make lots and lots of passes over it, but with improving hardware capabilities, you might be able to make 50 passes over the data instead of three, and that would probably improve the qualities of the model, but we don’t know how much.

Thanks a lot for the super interesting talk. Where in your personal life or work do you use AI most, and where do you use it least because it doesn’t work yet? What are you like surprised by on both ends of the capability spectrum, like as you in your work as an employee of a research lab or leader?

I think where I personally use it and where many of my colleagues use it is like helping to write some bits of code. I often tend to ask it to do things that are not super complicated. With the more capable models, I should start venturing out, as this gentleman perhaps should, to more and more expectations of what the model can do.

It will sort of do a reasonable job of writing sort of test cases for code I’ve written or extensions of things that are straightforward. I’ve used it to generate images for various kinds of things. I think I used it for this kind of thing. I use it to summarize papers or I put in a large piece of textual content and ask it questions about that. More and more you’re seeing people integrate the use of these models into things they find that they’re able to do that are useful for them.

I think that’s sort of the general trend in society. Where doesn’t it work? I’ve asked it to do more complicated coding questions and sometimes it works, sometimes it doesn’t. Then you’re like, okay I understand why it didn’t work because that’s pretty complicated and it would have taken me a long time to figure out, so thanks.

Thank you for your presentation; it was super interesting. I was wondering for the upcoming research, what would be the most interesting part to focus on? Is it improving transformers for the computer vision area more important or AI safety regarding to prevent hallucination of large language models? What would be the most important part that you are going to focus on?

I think one of the beauties of this field is it’s not that there’s just one important problem. There are many important problems. One of the meta things I do when I’m trying to think about research topics is to try to pick something that if I make progress on it or we as a collective set of colleagues make progress on, something important will be advanced. So I think avoiding sort of incremental things where even if the best possible outcome happens, you’re kind of like you want to avoid that.

All the areas you mentioned and like 50 other ones besides are really important. Other ones that I’m personally thinking about are: how can we have much more efficient inference hardware? How can you have much larger context windows for these models than a million tokens? How do you identify higher quality data? How do you scale infrastructure? How do you do asynchronous training in a better way in a distributed fashion with low bandwidth between the systems?

How do you have interesting more exotic sparser model structures than just kind of branch out to experts and come back together, which seems kind of relatively too simple for truly sparse interesting model structures? I think there’s like 50 other ideas I could rattle off. You should pick something you’re really excited about and that you think will matter.

One more question. Yeah, one more question. Oh, I don’t know. You pick. How about we get one farther in the back because we have ignored the back? The gentleman in the black t-shirt there, and it’s close enough to throw.

Hi. Thank you very much for the presentation; it was incredible. My question is about what’s the next challenge because I see that these models are getting better and better in all the benchmarks gradually, but is there some sort of binary challenge, some outcome that they are not yet able to do? I don’t know, formal reasoning, some activity that’s, let’s call it the next breakthrough?

I think one thing that’s not quite a discreet step but I think is going to be very hard is the current models. If you think about what we’re going to want the models to be able to do, it’s to operate sort of a bit autonomously and to do fairly complicated things that you ask the model to do with relative independence. Can you, you know, can you go off and plan me a visit to Zurich for two days because I have a couple of extra days and I want to do some fun stuff?

That is a little ambiguous; it might require the model to use some tools to go figure out, well, what is the Zurich place and what could I do here? What you’re seeing is that the models are capable of breaking down complex things into a few steps, maybe doing some limited amount of tool use to chain some things together in order to do those relatively simple tasks. But you’re not seeing models able to take a very complicated thing and break it down into 50 substeps on its own or use many, many complicated tools to accomplish some major piece of work that might take you two months.

There’s a huge vast difference between where we are now, which is it can do those kind of three or four or five-step tasks with maybe 60 to 70% accuracy, and it can do a month of work in a thousand steps with 95% accuracy. I think that is where people would like to be able to get systems, but it’s a very vast gulf between where we are now and what one imagines would be possible that is definitely not now.

That’s maybe a sort of continuum rather than a single thing that suddenly now you can do this, but you will see more and more capabilities of the models as they can do 10-step tasks with 90% accuracy as an intermediate point. Thank you very much. Let’s thank Jeff one more time for his talk. [Applause]

This is an experimental rewrite

[Music]

Host: All right, welcome everyone! It’s great to see a full house. I’m thrilled to introduce Jeff Dean, Google’s chief scientist. He joined Google in 1999, where he has played a key role in the development of foundational technologies like MapReduce, Bigtable, Spanner, and more recently, TensorFlow and Pathways.

In 2011, Jeff co-founded the Google Brain team, and since then, his research has focused on AI systems and applications. Today, he’ll be discussing important trends in AI. I should also mention that Jeff has received numerous awards, including the ACM Prize for Computing, the IT Levonne Newman Medal, the Mark Weiser Award, and he’s an ACM Fellow among many others. We’re very excited to have you here, Jeff, and we look forward to your talk. So let’s give a warm welcome to Jeff Dean!

Jeff Dean: Thank you so much for that kind introduction. I’m really excited to be here today to talk about significant trends in AI. We’ll cover how we arrived at our current understanding of what AI models can do, what advancements we’ve made, and how we can shape the future of AI. It’s worth noting that this work is the result of collaboration with many talented individuals at Google and beyond.

Okay, let’s dive in. Some observations I’m about to share might be quite familiar to you. Most importantly, machine learning has transformed our expectations of what computers can achieve. If you look back 10 years, computers had very basic capabilities in computer vision, speech recognition wasn’t very accurate, and language models had limited functionality.

Over the past 12 to 14 years, we’ve observed that as we increase the scale of computation used to train models, the amount of data and the size of the models, we generally see better results. It’s almost a truism at this point: bigger models and more data yield improved performance in tasks we care about regarding computer capabilities.

That said, it’s crucial to note that advancements in algorithms and model architectures have also played a significant role. This means it’s not just about scaling up hardware but that algorithmic developments and architectural improvements are often more decisive than hardware enhancements over the past decade. Consequently, the way we think about the computations we want to run on hardware is shifting, moving away from traditional CPU-centric computation.

Jeff Dean: Now, I will take you through a whirlwind review, with one slide per major advancement. I’ll likely need to relaunch Chrome soon, but let’s not pause for that right now.

So let’s jump into this rapid overview of pivotal techniques that shaped modern models—but note that this will be mostly chronological, though not strictly.

A key foundational component from the last century is neural networks. Almost every major advancement you see in machine learning, especially at a large scale, stems from neural network-based computation. These networks consist of artificial neurons, loosely connected to how biological neurons function, though not perfectly accurate. There’s still much we do not understand about them, but they represent one of the core building blocks.

Another critical building block is backpropagation, a mechanism to optimize the weights of a neural network. By backpropagating the errors from the model’s output to the desired output, backpropagation provides a powerful way to adjust the weights and minimize errors on training data. Thanks to the generalization capabilities of neural networks, they can also perform well on unseen examples.

These two elements, neural networks and backpropagation, are fundamental to the deep learning revolution. In 2012, some colleagues and I hypothesized that training larger neural networks might yield even better performance than smaller ones. We decided to test this idea by training a particularly large neural network and employing an unsupervised learning algorithm.

We trained a neural network 60 times larger than any known network at that time, leveraging 16,000 CPU cores. Back then, we didn’t have GPUs in our data centers—only CPUs. What we discovered was that by using this unsupervised training objective followed by supervised training, we had a 70% relative improvement in performance in the less commonly contested ImageNet 22K category. This category is interesting because it includes 22,000 very fine-grained categories, unlike the 1,000-category section most are familiar with.

This outcome not only proved our initial hypothesis that larger models could be more capable with sufficient training computation but also led to the development of our first large-scale neural network infrastructure project, aptly named Disbelief. The name reflects its distributed nature across many machines and the skepticism from some of our colleagues who doubted it would succeed.

When it comes to training large models that can’t fit on a single machine, there are several ways to parallelize the computations. The first method involves partitioning the model itself, both vertically and horizontally, distributing pieces across different computers while managing communications between the model splits. Another approach is data parallelism, where multiple copies of the same model exist on different machines, possibly combined with model parallelism, where each copy operates on multiple machines.

In our Disbelief project, we centralized the system to accept gradient updates from different model replicas. This was done asynchronously; each model replica computes a bit of data, sends gradients based on its parameters and training data, and relays it back to the parameter server. The challenge here was that by the time the parameters exchanged hands, they had already changed due to updates from other model replicas, which deviated from the mathematically correct gradient descent algorithm—but it worked nonetheless.

This setup proved effective and enabled us to scale up to very large models even with CPUs. In 2013, we applied that framework to enhance training dense representations of words through a word embedding model called Word2Vec. This work illustrated how representing a word as a high-dimensional vector could yield two beneficial properties if trained correctly.

One method involves taking the representation of a middle word and predicting nearby words, while another looks at surrounding words to predict the middle one. Both methods yield similar results. By training word embedding vectors in this way, we discovered that words closely situated in this high-dimensional space tended to be semantically related—similar words would cluster together, like “cats,” “pumas,” and “tigers.”

Another intriguing discovery from this approach is that the directional relationships within this space are meaningful. For example, transforming a male-associated word to its female counterpart consistently follows the same directional path, regardless of the specific pairings—such as “king” and “queen” or “man” and “woman.” This reflects that linguistic properties emerge as a result of the training process in the relationships between different points in the space.

In 2014, my colleagues Ilia Sutskever, Oriol Vinyals, and Quoc Le developed a model called sequence-to-sequence learning with neural networks. The concept is simple: you take an input sequence and aim to predict an output sequence from it. A classic example is translation, where you input an English sentence and use the dense representation built from processing that sentence word by word to then decode it into the French counterpart.

When trained on substantial language sentence pairs, like English to French, you create a translation system purely based on this sequence-to-sequence neural network model. By initializing the neural decoder using this trained state for translation, the system proves effective and shows improved scalability with LSTMs.

In 2013, I began to feel the pressure of increasing model sizes as we worked on applications like speech recognition and text generation. I calculated that if speech recognition improved significantly, it could overwhelm our resources, especially if 100 million users started interacting with their devices for approximately three minutes daily.

At that juncture, I estimated that deploying a superior speech model, anticipated to lessen error rates by 40%, would necessitate doubling Google’s computer fleet merely to implement that improvement.

This led me to consult colleagues in our technical infrastructure team who had hardware experience, and together we decided it would be prudent to develop specialized hardware for neural network inference. Thus, the tensor processing unit (TPU) line was born. The first TPU version was designed solely for inference, optimizing for reduced precision and executing 8-bit integer operations. The goal was to create highly efficient hardware for linear algebra operations without needing the intricate features typical of modern CPUs.

Fast forward, and our latest TPU generation has demonstrated performance up to 15 to 30 times faster compared to conventional CPUs and GPUs in these tasks, with energy efficiency increases ranging from 30 to 80 times. Interestingly, our TPU paper has gained substantial recognition, becoming the most cited in the 50-year history of ISCA since its publication in 2017.

Further, we began contemplating scaling for training, not just inference. This idea evolved into creating machine learning supercomputers with high-speed interconnections among numerous chips, resulting in six generations of TPU pods optimized for both training and inference.

These TPU pods connect thousands of chips; the initial pod housed 256 chips, which grew to 4000 in some of the latest iterations—currently, we’re operating around eight to nine thousand chips, all linked by custom high-speed networks.

Since version four, we’ve incorporated an innovative optical network. You can connect racks of 64 chips in distant locations, functioning seamlessly as if they are adjacent to each other within the data center.

We recently unveiled the latest version, Ironwood, which has abandoned numerical naming for clarity. Ironwood offers a substantial pod size with 9216 chips, each capable of executing 4614 teraflops. In total, this pod achieves 42.5 exaflops using reduced precision floating points. This represents a roughly 3600x increase in computational capacity over the span of seven years.

This incredible boost is thanks to strategic circuit design advancements, optimizing fabrication processes, and lowering precision operations compared to the original TPUv2, allowing for about a 30x improvement in energy efficiency per floating-point operation compared to our initial training pod from 2018.

Moreover, another significant trend is the emergence of open-source tools for machine learning, which have empowered a broader community to both improve and utilize these tools for diverse machine learning challenges. TensorFlow, released in 2015, PyTorch, which debuted in 2016, and Jax—another open-source framework from Google—emerged around 2017 or 2018. Together, these frameworks have propelled the field forward in terms of accessibility and standardization.

In 2017, some colleagues noted that in recurrent models, the sequential process of absorbing one token at a time limited learning efficiency and parallelism. They proposed saving all internal states while developing a mechanism known as attention, which refers back to all previous states.

This influential paper illustrated that, utilizing 10 to 100 times less compute with 10 times smaller models, you could achieve better performance than existing architectures like LSTMs at that time. This breakthrough has enabled nearly all contemporary large language models to adopt transformers as a foundational architecture, often with various enhancements.

While this concept was not entirely new in 2018, it gained traction as the realization emerged that language modeling at scale could leverage self-supervised data. You can use any piece of text to predict other parts, creating vast amounts of training data. This innovation is a major factor in the quality and effectiveness of these language models—more text leads to improved results.

Different training objectives can be employed, one of which is autoregressive training, where the model looks at the prefix of words and predicts the subsequent word. Many of today’s models operate on this principle, creating training examples like, “Zurich is _____.” The model fills in the blank using context.

Another approach involves fill-in-the-blank training, which generates diverse training scenarios from the same text. While both training objectives are valuable, autoregressive methods tend to be more prevalent, particularly in applications such as chatbots, which only have access to past contextual information during interactions.

In 2021, my colleagues developed a way to apply transformer models to image tasks, transitioning from the previously dominant convolutional neural networks. They innovatively dissected an image into patches, representing these patches with high-dimensional vectors similar to Word2Vec’s approach with words.

This transformation enables patch representations to be fed into the transformer model, allowing the handling of image data through patch embeddings rather than solely word embeddings. As you will see, when training multimodal models, you can integrate text and images, enabling visual patches to work alongside text patches.

The attention operation within the transformer remarkably attends to pertinent areas of an image. For instance, when asked about the contents of an image, it can focus on details like an airplane or a dog. However, in the presence of distracting elements, it broadens its attention, scanning the entirety of the image for visual clues that help generate the correct predictions. This pivotal innovation has unified transformer capabilities across textual and visual data.

Another development occurred in 2017 when some colleagues and I created a mechanism for sparse models. These models possess large capacity but only activate a fraction of the model for each token or example. Initially, we used around 48 experts per layer but activated just two at any given time. This architecture allows the model to retain substantial capacity while efficiently utilizing a small subset relevant to the task.

The activation of the appropriate experts is learned end-to-end through backpropagation, enabling the model to manage varied contexts—like handling dates or geographical locations. This method allowed us to achieve an 8x reduction in training compute costs for equivalent accuracy or significant accuracy gains at the same computational expense. When you see graphs that compare compute budgets to accuracy scores, you want to align them horizontally to demonstrate that less computational power is sufficient for maintaining the same accuracy levels.

We are continuing to explore sparse models’ potential because we believe it to be a crucial avenue for developing models with substantial capacity while only activating a minimal portion relevant to the current task.

In 2018, we also began rethinking the software abstractions necessary for large-scale distributed machine learning. Our goal was to connect multiple TPU pods together and streamline the training processes. Each small box with yellow dots in our diagram represents a TPU pod; our objective was to facilitate seamless integration among these components.

This distributed system manages communication effectively, ensuring that chips within the same pod can utilize the high-speed TPU network, while those needing to connect across pods within the same building, or even different regions, use appropriate networks for efficient data transfer.

The Pathways framework simplifies this by allowing the machine learning developer or researcher to operate with a single Python process. When using Jax, devices can be abstracted seamlessly. For instance, when using four TPUs in a single machine, they are recognized as a cohesive unit. However, under Pathways with Jax, all devices across the training task appear as a comprehensive array of 10,000 or 20,000 TPU devices.

This capability simplifies computation management, with Pathways automatically mapping operations onto the actual hardware. Just last week, we made the Pathways system, which we’ve utilized internally for six years, available for cloud customers through our cloud TPU offerings.

Additionally, some colleagues observed that extending inference time to think longer can be beneficial. Just as your third-grade math teacher advised you to show your work to increase the likelihood of solving problems correctly, large language models can benefit from a similar approach. For example, consider a problem framed like this: “Sean has five toys for Christmas, having received two from each parent. How many toys does he have now?” The model needs to calculate the answer as nine.

In contrast, when posed with a new problem, like “John takes care of ten dogs, each requiring thirty minutes a day. How many hours does he spend weekly on this?” The model initially responded incorrectly. However, if encouraged to show its reasoning, it could clarify the steps shown in the previous example—”Sean started with five toys; if he received two from both parents, that totals four additional toys. Therefore, 5 plus 4 equals 9. The answer is nine.” Jeff Dean: It might seem simple, but this actually greatly enhances the models’ accuracy. Now, they are encouraged to think through the steps to solve problems in a more detailed way. You can see that as the model’s scale improves, the problem-solving rate increases somewhat with standard prompting, but it skyrockets when you use chain-of-thought prompting. This is particularly evident with benchmark tests that cover roughly eighth-grade math problems. So, prompting the model to show its reasoning improves accuracy on reasoning tasks.

You can also view this as a strategy for utilizing more computational resources during inference since it requires the model to generate extra tokens to produce the correct answer format. Back in 2014, Jeff Hinton, Oral Vinyals, and I developed a technique known as distillation, which transfers knowledge from one neural network to another, typically a smaller model.

In the classic approach, you’d train a small model using next token prediction. For instance, if the input is “perform the concerto for _____,” the expected word is “violin.” When training your language model with this objective, if it predicts “violin” correctly, that’s great. If it guesses incorrectly, you get a back-propagation error from the training objective. While this method works decently, using the teacher model to offer not just the correct answer but a probability distribution of what constitutes good answers for any given word delivers a richer training signal.

Instead of just receiving a binary signal for “violin,” where it’s correct only once, that distribution—like “violin, 0.4; piano, 0.2; trumpet, 0.01; airplane, unlikely”—provides a far richer gradient signal. This allows you to inject more knowledge into each training example for the smaller model, enabling it to reach convergence more quickly.

As you can see from some comparisons in a speech-based setting, training frame accuracy is important, but what really matters is the test frame accuracy—did the model correctly predict the sound in a frame of audio? The baseline with 100% of the training data achieves 58.9% on test frame accuracy. However, if you reduce the training set to only 3%, the training frame accuracy might actually increase due to overfitting to the very limited examples, but your test frame accuracy would plummet, rendering it ineffective for unseen test cases.

When you implement soft targets generated through the distillation process with just 3% of the training data, you still get decent training frame accuracy and nearly equivalent test frame accuracy. This trait is advantageous because it means you can transfer the knowledge from a large neural network to a smaller one, maintaining nearly the same level of accuracy.

Interestingly, this approach was initially rejected from NeurIPS 2014, but we published it in a workshop, and it now has 24,000 citations. In 2022, some colleagues and I investigated different strategies for mapping computation onto our TPU pods for efficient inference. There are many variations one could consider, such as whether to keep the weights stationary across various dimensions of the network.

While the details vary, it’s clear that the appropriate choices depend on numerous factors, including batch size, which significantly influences which technique works best. Techniques like weight stationary, weight gathered, and variations of these can greatly affect performance based on batch size.

For instance, at small batch sizes, a 2D weight-gathered approach might be most effective, while at larger batch sizes, a weight-stationary method could work better. This complexity highlights the importance of choosing efficient strategies for model partitioning and inference at scale.

In 2023, some of my colleagues developed a technique known as speculative decoding. This involves utilizing a smaller drafter model—10 to 20 times smaller than the larger model—since many tasks can be effectively predicted by a smaller model. We can promptly predict the next K tokens with the drafter model, and then the larger model makes predictions for K tokens in succession as well.

By doing this, you’ve amortized the memory overhead of loading the model weights, allowing for K predictions instead of just one. Many developments have combined to significantly enhance model quality in recent times. We’ve seen progress in better accelerator hardware, notably with TPUs and Nvidia GPUs optimizing for machine learning applications.

Software abstractions play a crucial role too, as they allow easier building of useful applications without needing to delve too deeply into underlying details. Model architectures, particularly transformers and visual transformers, are now integral to modern models. Significant advancements in training algorithms—unstructured and self-supervised learning, distillation, and others like supervised fine-tuning, reinforce the learning process.

Next, I’ll discuss the Gemini models we’ve been training and how many of these innovations are reflected in various iterations. Gemini represents a collaborative effort across Google DeepMind, Google Research, and the broader Google team, which we started in February 2023. Our objective is to create the best multimodal models in the world to integrate across various Google products.

Here’s a timeline of our progress since February 2023, culminating in the December release of Gemini 1.0, followed swiftly by Gemini 1.5. From the outset, we aimed to make these models multimodal, recognizing that models limited solely to text would not be as beneficial as those capable of understanding and generating language, audio, visual inputs, and more.

Initially, the model could process audio, video, images, and text as input, producing images and text as outputs, and we later added audio output capabilities. Gemini 1.5 introduced an extended context length, enabling input of up to millions of tokens.

To illustrate, imagine processing a thousand-page document—it translates to roughly a million tokens. This allows the model to handle multiple long research papers or entire books within the context window, leveraging the attention mechanism that makes information very clear to it.

In Gemini 2.0, we build on numerous innovations. We leverage TPUs, utilize cross-data-center training, apply Pathways and Jax, focus on distributed representations for words and image data, and integrate sparse mixtures of experts alongside distillation techniques.

Just a month ago, we released Gemini 2.5 Pro, which has received positive feedback due to significant improvements across various benchmarks, especially in coding tasks compared to earlier Gemini models. The model evaluation landscape incorporates user feedback through platforms like LM Marina, which allows users to compare outputs anonymously and gauge preferences—this provides valuable insight into model strengths.

This evaluation method aligns well with independent assessments across the web and academic benchmarks. Currently, we find ourselves in the fourth spot of the New York Times connections, indicating areas needing improvement. Nonetheless, our goal is to deliver general-purpose models effective across a wide array of tasks, including coding and reasoning, enhancing user experience.

Providing a million or two million tokens of context enables the embedding of large codebases entirely within the context window. The model can then be tasked with complex operations, such as refactoring or introducing new features. One user was able to take a dataset of a thousand poems—230,000 tokens—and ask the model to perform reasoning tasks over them, yielding impressive results.

An important metric we focus on is the ELO score from Ellarina. A higher ELO score indicates a more capable and higher-quality model from users’ perspectives. The comparison includes various commercial models, with the x-axis displayed on a logarithmic scale, highlighting the need for maximal performance along the right-hand side.

We offer a variety of models that cater to different quality and cost trade-offs. Our flash models are cost-effective, priced at around 15 cents per million tokens. The newer 2.5 Pro model is more expensive due to its increased complexity, but still reasonably priced given the quality it provides.

Ultimately, our goal is to keep progressing towards the upper-right corner of the quality-cost trade-off in our model offerings. The Gemini initiative remains a large-scale project, with contributions from numerous authors. Structuring such broad efforts requires delineation of roles across areas like pre-training, safety, values, and more, coordinating smoothly to enhance our model capabilities.

We rely on effective communication, using platforms such as Google Chat to facilitate ongoing collaboration across different regions. Despite time zone challenges, the global team structure has advantages, with team members always available to monitor large-scale training runs and contribute insights based on their work while others rest.

In addition to structured discussions and feedback via Google Docs, we maintain common baselines and leaderboards to fuel data-driven decisions about model improvement. Experimentation at varying scales is crucial, moving successful small-scale trials into larger scale evaluations to test trends.

We monitor for silent data corruption during training, aware that hardware errors can emerge in our ML systems, potentially affecting overall computations. Monitoring gradient norms helps us identify anomalies—if a problematic gradient emerges, we can rewind and replay computations to check for data issues versus hardware errors.

Let me share some examples of what these models are capable of. They can assist in fixing bugs in codebases effectively, as seen when one user uploaded their entire repository, allowing the model to pinpoint urgent issues.

In-context learning is another fascinating aspect. For instance, there’s a language called Kalamong, spoken by a mere 200 individuals globally. One researcher wrote a PhD thesis on its grammar, but no internet training data exists for it. Interestingly, when this thesis is used as input, the model can achieve translation accuracy comparable to a novice language learner, thanks to the understanding fostered by the grammar and dictionary provided. Speaker 1: With that video of a bookshelf to JSON, it’s kind of fun. You might not have thought of that as an input method, but you can do that. It’s actually quite useful.

Speaker 1: When it comes to video understanding and summarization, you can input fairly long videos—about a million tokens translates to roughly two hours of video. The prompt I would use is in a table: “Please write the sport, the team, the athletes involved, the year, and a short description of why each of these moments in sports is so iconic.” The model gets to see both the pixels of the video and the audio track.

Speaker 1: For instance, consider an 11-minute video that the model analyzes. The output is structured data extraction, which might be more text extraction than what you initially thought you could achieve from in-context video. I think many people still aren’t fully aware of the interesting possibilities of taking multimodal data like this.

Speaker 1: Let’s talk about the digitization of historical data. You can take weather data from 100 years ago and simply ask, “Please give it to me in JSON.” The model can handle that. They did it for 144 tables, and it only cost them 10 p. Now they’re able to unlock all this historical weather data.

Speaker 1: Now, regarding code generation via high-level languages, here’s the prompt we’re going to give our Gemini 2.5 model: “P5JS to explore a Mandelbrot set.” Oh, wait! I can’t do that right now. I’m so sad. It was working before, but oh, I’m not on Wi-Fi. That’s true. Anyway, it generates a really nice interactive visual Mandelbrot explorer when connected.

Speaker 1: Now that we have these models, what will it all mean for us in society? This raises a really important set of topics. I, along with eight other co-authors, recently wrote a paper titled “Shaping AI’s Impact on Billions of Lives.” We are a group of computer scientists and machine learning experts from academia, big tech, and startups, and we wanted to explore the potential impact of AI on the world through directed research and policy efforts.

Speaker 1: Many people in this field are contemplating what will happen with AI if we take a laissez-faire approach. Will we all be doomed, or will we see incredible advances? A pragmatic approach would be to collaborate as a society—machine learning researchers, practitioners, and experts—to shape the future, maximizing the benefits of AI while minimizing the downsides.

Speaker 1: This paper is intended to be a collective discussion on how we might achieve that. We interviewed 24 different experts in seven fields: employment, education, healthcare, information, and media. Noteworthy individuals included former President Barack Obama, Sal Khan in education, and John Jumper, who later won a Nobel Prize. We uncovered five guidelines for AI for public good.

Speaker 1: I won’t delve further into the paper, but you can visit shapingai.com, where there’s an archive paper that nicely discusses potential impacts in various areas, including employment, education, and healthcare. It’s critical that we all collaborate to get this right.

Speaker 1: To conclude, we also proposed some important milestones for research in these areas. These models are becoming increasingly powerful and useful tools. As more investments pour in and more researchers join the field, you’ll see continuous improvements, leading to even more capable models.

Speaker 1: This progress will have a dramatic impact in numerous fields, potentially making deep expertise widely available. That’s both exciting and a bit concerning to some people—that kind of expertise can and should be done well. I genuinely believe our AI-assisted future looks bright.

Audience: [Applause]

Speaker 1: Thank you very much for the great talk! We have a little token of appreciation from the department—some chocolates and a systems group t-shirt. I love coming to Switzerland for these treats! Thank you so much.

Speaker 1: Now, let’s proceed to the Q&A session. We have a mic and a tossing cube for questions. We’ll prioritize students for asking questions, so please raise your hands if you have one and point in a general direction.

Speaker 1: My throwing aim might not be great, but let’s try! Ah, well done! [Applause]

Audience Member 1: Hi! Thank you for your presentation, especially for discussing that last paper. AI safety is definitely at the forefront of our minds, but it seems unclear from an outsider’s perspective—especially for big research labs—what would be considered positive and impactful. If you were a PhD student starting a thesis, a professor with grant money, or if you could acquire a startup this year, what would you focus on in AI safety?

Speaker 1: That’s an excellent question. AI safety is quite broad. There are concerns about the increasing capabilities of these models enabling people to engage in nefarious actions that would be undesirable from a societal viewpoint. While some of these issues can be addressed technically, policy-based and regulatory measures will also be essential.

Speaker 1: One topic we explored in the paper was misinformation and public discourse. AI models can generate increasingly realistic misinformation and allow mass production of it more cheaply. While misinformation isn’t new, these tools make it easier to create quickly and effectively.

Speaker 1: There’s also an interesting research question about how to detect misinformation produced by AI. We suggested that AI can actually enable more constructive discourse in online forums. Looking at how AI can promote positive conversations and identify misinformation in discussions is intriguing and worth studying.

Audience Member 2: Thank you! I’ll pass the cube to the next person.

Audience Member 3: Currently, when I visit social networks, I feel hyped by claims about LLMs being incredible. But in my daily work when I try to use AI or LLMs, I’m often disappointed. Who needs more training? Is it me, or is the LLM just not trained well enough?

Speaker 1: That’s a great question! I suspect the answer is a bit of both. The progress in these models has been steep. The Gemini models from eight months ago can’t compete with today’s versions. Sometimes users form opinions based on their past experiences with older models, which might have failed.

Speaker 1: It’s important to remember that the current models may excel at tasks that previously seemed impossible. Additionally, becoming familiar with how to effectively prompt the models is crucial. A thoughtfully crafted prompt can lead to significantly different outcomes.

Speaker 1: For example, a one-page prompt might ask, “Can you take this video content and create an educational game that reflects the concepts explored?” In some cases, it will generate a fully functional game based on the lecture’s materials. It doesn’t always work, but it’s on the frontier of what’s possible; it might succeed around 30% of the time.

Speaker 1: More training for the models will also contribute to improvement. You’re noticing substantial advancements from Gemini 1 to 1.5, 2, and now to 2.5. I expect Gemini 3.0 and beyond will be even better. This trend in the industry shows continual improvements in models.

Audience Member 4: Thank you for your talk! On your slide summarizing innovations in AI, you listed hardware, algorithms, and improvements, but data was absent. There are concerns that data might become the new bottleneck. What’s your take on this?

Speaker 1: I should have mentioned data—it is indeed crucial. There’s often no specific artifact linked to many data-related issues. Instead, it’s about curating high-quality data, which we focus on in the Gemini project.

Speaker 1: Although some worries exist about running out of high-quality data for improving model capabilities, I find such concerns hard to justify. There is an immense volume of data we are not utilizing. For instance, while we’ve trained on certain video data, it represents a tiny portion of the overall YouTube corpus and far less than the total video data available.

Speaker 1: As a machine learning research problem, there’s also substantial work left to improve the quality obtainable from each training token. For instance, if a model learns from just a two-sentence description of how to add numbers, it may not genuinely grasp the underlying algorithm.

Speaker 1: Ideally, a model would be capable of reading and developing an internal representation that allows it to execute an algorithm when required, thus extracting more value from the training data.

Speaker 1: Consider the era of improving convolutional neural networks, where researchers trained on a million images across a thousand categories. They’d often bolster model power by making multiple passes over the training set. While we have a large corpus of textual data, our computational limitations have prevented repeated passes. However, with advancing hardware, making additional passes could yield significant improvements in model quality, though the exact impact remains uncertain.

Audience Member 5: Thank you for your engaging presentation! I’m curious: where in your personal or professional life do you find AI most useful, and where does it fall short? Are there any surprises on both ends?

Speaker 1: Personally, I use AI with tasks like coding assistance. I often have it handle relatively straightforward requests. With more capable models, I should explore various uses to challenge what they can do.

Speaker 1: The models generally do a decent job of generating test cases for the code I’ve written or extending straightforward code. I also utilize it for generating images or summarizing papers. It’s fascinating to see how these models have become integrated into tasks that genuinely help.

Speaker 1: On the flip side, sometimes when I request complex coding solutions, the outcomes can vary widely. I recognize why they sometimes fail—that really complicated requests can be challenging for anyone.

Audience Member 6: Thank you for the super interesting talk! For your upcoming research, what area do you find most intriguing? Is it enhancing transformers for computer vision, or focusing on AI safety to prevent hallucinations in large language models?

Speaker 1: The field is beautiful in that it encompasses many significant challenges. My approach to selecting research topics is to focus on those where progress will yield substantial advancements.

Speaker 1: The areas you mentioned, plus many others, are critical. I’m personally interested in topics like creating more efficient inference hardware, developing larger context windows, identifying higher-quality data, scaling infrastructure, and enhancing asynchronous training in distributed networks.

Speaker 1: Also, exploring more exotic, sparser model structures could lead to groundbreaking advances. There are numerous ideas worth pursuing, and I encourage you to choose a topic that excites you and holds the potential for real impact.

Audience Member 7: One more question, please!

Speaker 1: Sure! Let’s pick someone from further back—we’ve neglected that area.

Audience Member 8: Hi! Thank you very much for the incredible presentation! I’d like to know what the next challenge is. These models are improving steadily across benchmarks, but is there a specific outcome they still struggle with? Perhaps formal reasoning or some other breakthrough activity?

Speaker 1: Great question! While it’s not precisely a discrete challenge, one significant hurdle is the need for models to operate autonomously in a more complex manner. We want them to undertake relatively complicated tasks with a good amount of independence.

Speaker 1: For instance, could the model plan a two-day visit to Zurich, suggesting activities based on what it learns about the city? That’s a task mangled with ambiguity, requiring tools to gather information about Zurich and potential plans.

Speaker 1: Right now, models can handle simpler tasks—breaking down complex tasks into a few steps with some limited tool use—but they struggle when faced with intricate challenges that involve many elements to process over time.

Speaker 1: There’s a vast gap between current capabilities, like the ability to manage three to five steps with around 60-70% accuracy, versus effectively managing a hundred tasks over a lengthy period with high reliability. Bridging that gap is a major goal going forward.

Speaker 1: So while there isn’t one singular breakthrough, we’ll undoubtedly witness gradual improvements, enabling models to perform more ten-step tasks with increased accuracy along the way.

Speaker 1: Thank you very much! Let’s give another round of applause for Jeff and his talk. [Applause]

Model Architecture Design for Modern Hardware with Tri Dao

2025-04-21T00:00:01+00:00

Model Architecture Design for Modern Hardware with Tri Dao

Hi everyone, so it’s a real pleasure to introduce Tree Dao.

He’s just done spectacular work at the intersection of ML and systems, things with attention, long context models, and I’ll just let you take over. If you have questions during the talk, all right. Hi everybody, really excited to be here, thanks for the invitation.

This talk is going to be at the intersection of machine learning and systems, as Sham mentioned. Basically, there’s been a ton of really exciting progress, especially recently. So I just want to share some of these progress with you guys. The talk is relatively informal, so feel free to stop me and ask questions.

Okay, we’ll get started. In this talk, I’ll talk about how you design models for modern hardware. The motivation we’ve seen in the last five years or so is that progress in AI has been driven by scaling laws.

These are empirical laws that show as you increase the amount of compute, as you add more compute, as you have better algorithms, and you have more data, generally you get better models. Not only better models on existing tasks but also models with new capabilities. Nowadays, we have models that can understand humor, write jokes, transform images into glibby cartoons, and so on. Just five years ago, these things were sort of unthinkable, but now we have actual systems that you can pay, I don’t know, $20 a month to use.

Where do we go from here? For the vast history of AI, for our 70 years of discipline, we were mostly concerned with do things work at all, right? Now things are working. As we move some of this from research labs to products and delivering this kind of intelligence to a lot of people, the question now is can we deliver that kind of intelligence in a cost-effective manner?

Now we want to focus on intelligence per dollar. Intelligence per dollar can be factored out into intelligence per flop. Flops is a floating-point operation multiplied by flops per dollar. In this talk, we’ll focus on both parts, but I’m trying to motivate why you need to care about both.

So intelligence per flops I would call this algorithmic or data efficiency, which is given a fixed amount of compute, a fixed amount of floating-point operations, how can you train a better and better model? That usually comes down to having better data, having better algorithms, and so on.

The other side of the equation is flops per dollar, which is given a fixed amount of money, how can you get as much compute as possible? That’s a question on the hardware side, and there’s been amazing progress there as well. We’re seeing two to three times improvement in flops per dollar every two or three years, which is way faster than Moore’s law.

I would argue that there’s this really fertile area to work on which is this intersection of algorithms and hardware, so you’re tackling both sides of this equation. For this talk, I’ll illustrate with some examples some of the things you can do at the intersection of hardware and algorithms.

The main idea I’ll talk about is you can design hardware-aware algorithms. Algorithms that take advantage of the hardware that they run on. In particular, I’ll just focus on one example of state-space models, where it turns out you can exploit the memory hierarchy on the hardware to design a much more efficient sort of state-space model, or recurrent models that can actually do quite well at sequence modeling and can rival transformer architecture and so on.

This started as joint work with Albert Goo, who is now a professor at CMU. Okay, so the talk is going to have two parts. The first part will talk about state-space models. RNNs, linear attention, and so on. We’ll look at some of the ingredients that you need to have a really strong architecture and to have really high hardware efficiency.

In the second half of the talk, I’ll talk about how to design things like attention variants that take advantage of hardware, really focusing on inference. Both the state-space and the attention are going to come together with hybrid models. We’re going to show that they can be a lot better at inference; they can be a lot better at test time compute.

Let’s get started on the state space. Nowadays, most of the applications in modern deep learning are powered by deep sequence models, from text to image to audio to video. They all share a pretty much common architecture as backbone, and typically this is a transformer.

As we want to understand the trade-off between different architectures, one useful way to look at these different architectures is through the lens of the autoregressive state. I’ll explain what that means.

The autoregressive state is just whatever the model needs to remember from the past to predict the future, and we’ll see that this is a really useful lens to look at different architectures. For example, you can look at recurrent neural networks, which were pretty common around 2012 to 2016. They were state-of-the-art for a while in machine translation and so on. Google was deploying them for on-device translation.

The way they work is that at different time steps, indexed by t, you have at each time step a fixed-size hidden state h of t. H of t is going to be a function of the previous hidden state h of t minus one and the current input x of t. You can design some update function or some transition function f of theta, where f of theta is some parameter that you’re going to learn.

Typical examples include the LSTM or GRU. These are different instantiations of the same idea that you can have a fixed hidden state, and this H of t essentially contains all of the information you need from the past in order to predict the future.

Now state-space models, in some sense, you can view as a simplified version or a linear version of these RNNs. They actually look quite simple and their history goes back to the 1960s. If you’ve heard of the common filter, that’s sort of where state-space got started.

Again, we’re still going to have these hidden state h of t, we’re going to have the input x of t, and we’re going to have output y of t, and they’re all linearly dependent on each other. The hidden state h of t is now a linear function, a t times h of t minus one plus bt times x of t.

So you force a linear form on the recurrent update, and then the output is also a linear function of the hidden state h of t. The only twist going from RNN to state-space is that you do this dimension blow-up where you treat the input as a bunch of one-dimensional signals and the hidden state is going to be N times larger than that.

We’ll see why this is important, but we’re going to turn the 1D input x of t into an ND hidden state h of t, and N can be on the order of 10 to 100. Then you’re going to do a projection down to 1D to get the output y of t. Conceptually, it’s very similar to RNN; it’s just the hidden state update is linear and the hidden state is usually quite a lot larger than the input.

When we look at the autoregressive state, we see that these things have a fixed-size state vector. That means the model is forced to do very strong compression; it has to compress whatever it has seen in the past into a fixed-size vector. Because of this, you can immediately deduce the pros and cons of these models.

On one hand, this is efficient during training and inference because to do inference, you don’t need to look at the entire past. You only need to look at the hidden state that has compressed all that history, and then you use it to predict the next time step. It’s going to be constant time when you want to predict the next token. During training, you can work this out; it’s also going to be linear time.

But these things don’t do very well on information-dense modalities simply because, for language, usually there might be a lot of information that you need to compress, and that might overwhelm the size of the state in these RNNs and state-space models.

On the other hand, we have the fantastic attention, which was actually invented in 2013 or 2014 but got popular with the transformer architecture in 2017. The inductive bias is no longer we want to compress the history; the inductive bias is we’re not going to compress at all. We’re going to sort of store the entire history.

This is known as the KV cache, and the way it works is every time you want to predict a new token, you sort of compare the current token with the entire history to see which one is similar. You take a weighted combination of the previous tokens.

By looking at the autoregressive states, we see that there is this all-to-all connection between all the tokens. Generally, these models have very strong performance; they can model long-range dependencies quite well. But the flip side of that is that during inference, if you want to predict the next token, it’s going to be linear in sequence length because you got to look at the entire history.

Similarly, during training, it’s going to be quadratic time. If you want to double your context length, then you’re going to pay 4x the cost in terms of flops. We want to get the best of both worlds. We still want strong performance of transformers, but we want efficiency and lean scaling of these state-space models.

It turns out there are a couple of ingredients that you need. One, you need a relatively large state size. Two, you need an expressive state update. Three, you need efficiency. I’ll explain each of these.

How large is your autoregressive state? Previous RNNs, I would argue, like LSTM and GRU had a hidden state the same size as the input, and that didn’t allow the model to store too much information from the past. With the state expansion trick that I show with state-space models, where the hidden state is 10 to 100 times larger than the input, you can store a lot more information by a factor of 10 or 100.

The second ingredient is that you want the state update to be quite expressive. You want the model to be able to decide, “Yes, this is something I want to commit to memory,” or “No, this is not so relevant; I’m not going to commit to memory.”

A simple idea to do that is that you can make these matrices update matrices A, B, and C, depend on the input. This is a very simple change but turns out to make a big difference in empirical performance.

For example, if you’re reading a text or listening to audio, you have some sense of like, “Hey, this information is really important,” or there might be filler words that you can sort of filter out. The way the model can do that is it can adjust for example the matrix B of t to control, “Hey, this input x of t is important; I’m going to make B of t quite big to commit more of that to memory,” or it might decide that this input x of t is not important, and you might put B of t to be close to zero to essentially not commit that to memory.

That’s the second technique to make the state update more expressive. The last one, as I motivated at the beginning, is very much about efficiency. Can you get more intelligence per dollar by getting more intelligence for flops and flops per dollar?

At the end of the day, you’re going to train these models; you’re going to run inference on these models. It really matters how fast you can do this. One of the weaknesses of older RNNs is that they were not very hardware-friendly.

Things like LSTM and GRUs have sort of linear dependence between the time steps. When you put that on GPUs, which are very much parallel devices, they can’t fully use the GPUs. If you make the update simpler, turning it into a linear form, even though it reduces expressiveness, it makes it easier to parallelize the computation.

You can write this as an associative scan, and it’s parallelizable. Previous models like linear RNNs have already exploited this fact. The other way you can exploit hardware is by thinking about the hardware when you do this state expansion.

As I mentioned, one of the tricks is to make the state about 10 to 100 times larger than the input. On the hardware, on GPUs and other accelerators, memory usually forms a hierarchy.

At the bottom here is GPU HBM, high bandwidth memory. This is usually quite large, on the order of 40 to 80 gigabytes. If you type Nvidia SMI, it shows you 80 gigabytes of HBM. But there are some much faster but much smaller memory called SRAM, and these are usually an order of magnitude faster but three orders of magnitude smaller.

If you’re familiar with things like CPU, you can think of it as CPUs can have really large DRAM but can also have really fast and really small L1 and L2 cache. It’s the same idea showing up in all of computer science. Memory hierarchy is kind of fundamental to computer science, and we can exploit this asymmetry.

We know that we want to expand the state; we want the states to be quite large. So what happens is we can keep the state in SRAM only. Here I’m showing a picture of how we actually do this. Most of the inputs are stored in HBM: the input x, the A, B, C matrices, and so on.

The output is read from HBM and written to HBM, but the states, we’re only going to compute and store in SRAM, and we never write the states down. Turns out writing the state is the expensive part, so we don’t do it. By not writing the states, we can make this run quite a bit faster.

I think this is one of the things that unlocks, in combination with the other two ideas, a way to train linear time models at reasonable speed that can rival transformer architectures. Since then, we made some improvements; I’m not going to go into too much detail, but one of the things we can do is leverage the matrix multiplication units in the hardware.

You can reorder or refactor the computation and turn it into these little matrix multiplications. That’s great for modern hardware, which is very much a bunch of matrix multiplication engines like tensor cores on Nvidia GPUs and TPUs.

In all the chips, they are very much specialized for matrix multiplication. If you can turn your algorithm into little blocks of matrix multiplication, you can go a lot faster. Those are a couple of ideas, and it turns out that if you train models like this, it can perform reasonably well, pretty well on average.

On average, it’s around the same as a transformer. But there’s no free lunch; we want to understand the trade-offs for these different architectures. There’s been excellent work from folks at Harvard, where they were looking at compression as it being a feature or a bug.

If you want to actually compress information, yeah, it’s a feature, but if you want to copy or retrieve information, it’s not so great. Here’s an example from this paper where they show, for example, the Mamba model in this task of just looking up information from the past; I think this is called the phone book task.

The Mamba model does quite a bit worse than the Pythia model, which is around the same size but is a transformer architecture. This sort of makes sense. If you have a fixed state, you can’t remember everything, and attention is clearly quite a bit better at retrieval.

There’s always some trade-off. When you think about wanting to read an article or a paper, you would in your head have some understanding or representation of the paper after you read it, but you’re not going to remember all of the details. To get all the details, you might need to retrieve or look back at the paper, and my intuition is that attention is sort of doing that.

Since then, there’s been tons of development on the RNNs and linear attention side. One of the frameworks to derive new models is test time training. This is excellent work from EUN and collaborators. The idea is you want to treat the recurrent state, the hidden autoregressive state, as a kind of fast weight.

Every time you see a token, you’re going to adjust that weight, and that weight is then used to predict the next token in the sequence. You can treat it as an online optimization problem. Different ways you could parameterize this model turn into different model architectures.

It depends on the loss; you can put the loss as maximizing or minimizing the dissimilarity between what has been stored in the past versus the current vector, or you can minimize the reconstruction loss. There are different losses you can put on it, and then you can come up with different online learning algorithms.

These online learning algorithms, like online gradient descent, and if you put weight decay and momentum and so on, correspond to different RNN architectures. This is a neat framework from these folks, and since then, this is sort of a way to address some of the weaknesses of having a fixed-size state and having to remember lots of things.

As a result, there’s been a resurgence of recurrent and linear models. This rednet from Microsoft, XLSTM, test-time training, RWKB, Deltaet, and so on, all of these models have the same three ingredients: you want a large state size, you want an expressive state update, and you want efficient forward and backward passes.

One example that I really like from the folks doing test-time training is they have this cute demonstration. Pre-trained transformer a video generation model right that could generate sort of 5-second videos. And then they add these test time training layers to them and it turns out they can start generating actually really coherent one minute long video. So we look at some of this. It’s real cute because they took Tom and Jerry and fine-tuned on a pre-trained transformer on like six or seven hours of Tom and Jerry video, and it actually works surprisingly well. There’s a lot of coherence between the shots.

This is, I think, Tom in the World Trade Center. He shows up, takes the elevator. I mean, Tom and Jerry’s like a very old cartoon, so I don’t know back then if there was a World Trade Center at all. But you know, he’s using computers, right? I don’t think back then there were computers. And then Jerry shows up and I think he starts chewing on the cord. So there’s actually a lot of coherence between the shots, and by using some of these layers, linear time layers, you can scale up to one minute video while maintaining coherence.

Tom shows up to a meeting that’s led by Spiky. I think his name is Spiky. So these new layers of new models are actually opening up sort of new applications, things like video generation up to a minute long. In the future, we can go much longer, but for these applications, I do think you kind of need new architectures that can scale to that length because a minute of video starts to get to around 128 to 500k sequence length where a transformer doesn’t really handle, so you need some of these new layers for this.

So that’s the first half of the talk. I’ll pause here to see if there are questions on this.

State space models, there’s probably an overhead cost memory across. Is there a regime in which that doesn’t make sense? How do you value, you know? Right, for these, an overhead of moving memory, when would that make sense? When would it not make sense?

So I think a general rule of thumb is if your sequence length is less than 1,000 or 2,000, it doesn’t really pay off to use these linear TAM architectures. Once it is around 4,000 to 8,000, then their speed is around the same. When it gets to like 16,000 to 32,000, then these linear architectures start to be faster. This depends very much on the implementation too, right? Like on the transformer side, I’ll show some of this as well. It’s been extremely optimized on the state space and RNN side and not as optimized, I would say. There’s probably a factor of two to three that you can make these things faster.

I’m working on some of that, the test time training folks are working on some of that, but ideally, we want to get to the point where maybe even 1k or 2k sequence length these linear models can be faster.

Great question! Do you think there’s an architecture out there that takes the advantages of SSM and transformers, or will this trade-off always exist? Yeah, are there architectures out there that will take advantage of state space and attention, or is there always this trade-off? I think there should be a way to combine them. I think we have been trying to combine them. You can do block attention within one block and then do recurrent across blocks. That seems like a sensible idea.

I think you can try different combinations. I haven’t seen one that sort of outperforms all the rest. There’s probably something out there, something clever. Right now, I’ll talk about the simple way to do it is to just combine them. You have some layers that are mamba layers or space and you have a couple of attention layers and that seem to do really well as well. So practically, short term, I would advocate for using these hybrid models. I think longer term there’s an academic question of do we need attention at all?

Great questions! Okay, so let’s move to the second half of the talk. As I mentioned, nowadays test time compute is what’s really driving a lot of the progress in AI. So nowadays when we think about model design, we should actually think about inference first. I call it inference first model design. So what does this look like? For existing models, we want to look at how, for example, attention is optimized for inference.

During inference, you have a KV cache which is as long as the history and you have a new query, right? So you want to compare the query to the rest of the history. The way to do that is you can parallelize the work. This is work we did with Meta where you can split the KV cache into this example. We split it into five splits and then do a parallel sort of local attention to each of those. As long as you store the lock sum or the softmax denominator of each of those splits, you can combine the results to get the correct output. This is now used everywhere. I think if you run lm inference, you’re almost certainly using this algorithm.

So that’s how people have been optimizing attention for decode. When it’s very memory bound, most of the time is loading memory, so we want to parallelize as much as possible to maximize memory bandwidth.

But one emergent architectural system that’s been adopted pretty much everywhere is GQA group query attention. You would have multiple query heads attending to the same KV head and the motivation is you want to reduce the size of the KV cache. That’s where most of the time is spent. So how do you optimize for that? Since at the hardware level, you’re really operating at the granularity of like 64 or 128, that means the hardware wants to do matrix multiplication as blocked 128 by 128, something like that.

But if you have one query or if you only have a couple of query vectors, it really doesn’t fill up the tensor core, so you end up wasting a lot of the compute. For example, you might have four or eight query heads attending to the same KV head, so you might stack them. So you might stack, let’s say, if you only have one query multiplying by the KV, you’re essentially wasting the tensor core. The tensor core can do like 64 by 128 or 128 by 128 matrix multiplication, but essentially you’re only doing one by 128 multiplication. You’re wasting most of the flops, like 90-95% of the flops.

It turns out you can pack things differently. You can pack multiple query heads into the same matrix multiplication, and that allows you to get much, much better utilization of the hardware. This theme is going to come up again and again where we want to use the hardware efficiently. We want to think about the arithmetic intensity and I’ll talk about that in detail.

Here’s an example whether you pack different query heads in GQA into one sort of matrix multiplication block or not. In Flash Attention 2, we didn’t do it; we did it for a very special case only. But in Flash Attention 3, we did that, and in a lot of cases, you see a difference of 4x to 8x better inference throughput just for the kernel. These things really do matter.

Now, how do these different attention variants map to hardware efficiency? I would argue it’s all about the arithmetic intensity, which is how many flops do you perform per byte that you load. Because at the hardware level, this is called a roof line analysis.

On the X-axis, this is arithmetic intensity, which is how many flops you do per the bytes you load, and on the Y-axis is the attainable computational performance measured by teraflops per second. This is on an H100. At the top, there’s this computational group roof which states that the theoretically maximum amount of compute it can do is around a thousand teraflops per second.

But there’s this roof called the IO bandwidth roof. If you’re loading too many things, then you won’t be able to do that much flops, right? Because the device has some peak bandwidth. The peak bandwidth is around three terabytes per second. If you’re loading three terabytes and you’re only doing one operation per the byte you load, then the maximum is going to be around three teraflops per second.

That’s the case if you do multi-head attention. Every time you load the KV, you’re only multiplying with one query. So you’re doing one or maybe two, depending on how you count, one or two flops per the bytes you load, and you get really low attainable computational performance.

But as you start sharing some of these, you have more query heads attending to KV heads, you get higher and higher computational performance. For example, if you use something like GQA or MQA, which means you have more query heads attending to the same KV heads, then you get higher computational performance.

With DeepSeek, you know one of the things they didn’t talk about in the paper, but it’s mentioned in the config, they use a very large number of query heads. They use 100 to 128 query heads attending to the same latent K and V, so the arithmetic intensity is actually really high. It’s close to 300, which is this sweet spot of where you’re maximally using the memory subsystem as well as the compute subsystem. I would say this is the right direction, thinking about inference hardware-efficient inference.

I might need to refresh my memory a little for how these group attention mechanisms work. But in these systems, they’re training with the modified attention as well, right? And now you’re looking at efficiencies based on test-time efficiencies.

Do we have a reasonable sense of the performance trade-offs? This is computational efficiency, but the perplexity regards the context dependence type. We have a sense of the trade-offs here, right?

So what’s the trade-off in terms of quality? Generally, you can talk about trade-offs in two ways. One is you hold the number of parameters constant and you look at the quality. The other is you hold the flops constant and you look at quality. The two are different because when you do something like GQA, it means you share a bunch of query heads attending to the same KV. You actually reduce the number of parameters in the KV projection.

So if you want to hold parameters constant, then you want to increase parameters elsewhere. If you do that, generally CQA would perform a little bit better than MH. If you hold the flops constant, then GQA is going to have fewer parameters than MHA because some of the things are shared already. GQA would be a little bit worse, and the other variants are similar.

As you go up the arithmetic intensity if you hold the parameters constant, it’s just more parameter sharing. It means you do more flops. If you hold the parameter constant, they do better in terms of quality. If you hold the flops constant, then they do slightly worse in terms of quality.

In a sense, what we ultimately want is constant. But a proxy would be something like serial runtime because it’s not fair that the flops are dominating the cost here. During training, the flops are dominating, so these variants would train slower.

But we’re doing it for inference. So during inference, if you say, “Oh, I’m going to load the same amount of KV cache, but then I get to do more computation during inference,” yes, the quality would be better during.

Great question! So how do they actually work in practice? In some sense, the roof line analysis is a theoretical device, looking at what is the maximum thing you can do. These dots actually measure empirically. There exist implementations that are pretty close to the limit.

Here’s an example. We’re measuring DeepSeek Flash MLA across different sequence lengths with a relatively large batch and a large number of query heads. Also, Flash Attention 3 has an implementation for MLA, and this is during decoding. So you’re decoding one token, and on the Y-axis on the left, we’re seeing that it can get close to like 600 teraflops per second, you know, 60% of the theoretical max plot, which is actually amazing.

If you just do matrix multiplication, you get to like 70% to 80%. That would consider flop bound or compute bound, and this is during inference. You’re decoding one token, and you’re already hitting 600 teraflops per second.

That’s amazing. On the right, you’re hitting 2700 gigabytes per second, and remember the max is around 3.3 terabytes per second, so it’s like 85% of the memory bandwidth. This is hitting 60% of the compute and 85% of the memory bandwidth.

This is what I mean by it was designed very much to use the hardware efficiently, and this is one of the first architectures to hit compute-bound during decode. I think all the other architectures are pretty much memory bound during decode.

More recently, my student Ted has been working on improvements on this. Ted is great, but it’s really a pain to shard during inference because you only have one latent KV. The simple idea is you can actually just shard that into two groups.

So on the left, if you have MLA, all of these queries will attend to the same latent. On the right, we’re showing GLA, group latent attention, where half of the query heads are attending to one latent and the other half are attending to another latent. Just a way to shard, and it turns out it works in terms of quality just about as well, but it’s a lot easier to shard.

In some cases, it can be up to two times faster. If you’re doing something like spective decoding where you’re not just decoding one token, you might be verifying one or two tokens or something like that, then GLA can be about 2x faster simply because MLA is already sort of compute-bound if you decode one token.

So if you want to decode two tokens or more, you probably want to use some variants like GLA that have better efficiency for decoding two tokens. So coming back to sort of, we’ve seen that attention is heavily optimized for inference as well, and Mamba and RNNs are also very attractive for inference because they don’t have this large KV cache.

But there are different trade-offs in terms of quality, as we’ve mentioned. I think the most straightforward way people have been adopting these architectures is they do a hybrid where they take most of the layers being Mamba layers or MLP and have a couple of attention layers.

You can have around 10% of attention layers and do really, really well. It turns out that’s kind of the optimal ratio, and lots of folks have been adopting this kind of hybrid architecture from Microsoft to Nvidia, Mistral, IBM, and so on.

For a while, one of the most challenging aspects is the ecosystem, which is really hard to use these new architectures because they’re not supported in something that you know and love, things like hugging face, VLM, and so on. For a while, they only ran on Nvidia chips, but now they run reasonably well on AMD chips and AWS chips, the Tranium.

I think VLM now has pretty good support for some of these Mamba and hybrid models, so we’ve been really excited about working with some of these folks. I’ll talk about why you would want to use these hybrid models.

This is drawing from some collaboration with folks at Nvidia. The folks at Nvidia were excited to try some of these architectures, but obviously, you shouldn’t trust anything that’s written in a paper. You should verify.

The way they verified was they took their training setup, same dataset, same hyperparameters and so on. They trained. They already trained their transformer, this was an 8B transformer trained to 3.5 trillion tokens. They took that same setup and trained a Mamba 2, a pure Mamba 2 model.

In a lot of evaluations, the Mamba 2 model was around the same, but in MMLU, it was a little bit worse. This sort of makes sense. When you evaluate, you need sort of five shot. Usually, people do five-shot MMLU, and in-context learning requires the ability to sort of copy some of the previous content.

So the Mamba models were a little bit worse. But when they added in a couple of attention layers, that seemed to solve the problem, and the Mamba 2 hybrid actually does a little bit better. They tried to be as much an apple-to-apple comparison as possible, and you can see that the hybrid model actually outperforms both the pure transformer model and the pure Mamba model.

That’s really exciting. The folks at AI 21 have been using it as well, and the motivation for them is that they can train a large model and get better inference end-to-end latency for really long context. If 90% or 80% of your layers have linear scaling, that’s still a lot better than all of them having quadratic scaling in attention.

They’re able to get around 2.5 times improvement in terms of end-to-end latency for these really long. More recently, I think the Nvidia folks got excited about it, and they did this 8B comparison. They were pretty excited about it. They started training a much larger model and trained it for a much longer time.

I think this came out a couple of weeks ago and was announced a couple of weeks ago, and I think model weights were released a couple of days ago, two days ago. So they trained an 8B and a sort of a 50B hybrid model trained to 20 trillion tokens. This is sort of like all the things I’ve ever wanted to validate.

Okay, there were a lot of questions like, does it work at scale? Can you quantize it? Does it work at long context? And so on. That the Avidia folks have validated. So the way they do the hybrid is pretty interesting. I think they do some kind of new architecture search to figure out where to put the attention layer, and it turns out around 8% seems to be optimal. I don’t know why, so you would have most of the layers being Mamba and then FFN Mamba and FFN, but once in a while you insert, I think it’s about 8% layers attention; they found that to be optimal.

After they trained the models, I think they found that there was a pretty good trade-off in terms of quality and efficiency. The x-axis here is the throughput for long context, and the y-axis is MMLU pro accuracy, so higher is better on the y-axis. On the x-axis, you also want higher throughput to be better, so they found that with this kind of architecture for long context, you can get faster inference and maintain really high quality.

They did a bunch of work on the ecosystem side as well, which I didn’t really expect. They were doing things like FPA training long context multimodal and so on. That’s Nvidia. I mean, they trained for months on 4,000 chips, but for most of us, that’s not something we want to do. So what if you want to have a hybrid model but you don’t want to train from scratch for three months?

Turns out you can also do distillation from a pre-trained transformer model. What you can do is take the transformer weights, do a little bit of weight transfer; turns out there’s some correspondence between the QKV projection in the attention layer and these ABC and delta in Mamba. You can do some weight transfer, and then you do distillations. With about 10 billion tokens, you get actually a pretty strong model.

On the x-axis, this is how many billion tokens you do, and on the y-axis, it’s average accuracy. You can take a five model and you can distill from that and get a pretty strong model with I think six or eight billion tokens. Usually, it’s a better trade-off than training everything from scratch. The message is distillation still works quite well with very few tokens.

One of the last things I want to talk about is test compute. The way I think about model architecture is just an engine that takes in flops and data and converts that to capabilities. A better architecture is just a more efficient converter of computing data into capabilities. We want to design models to use the compute wisely, and nowadays, a lot of compute is shifting towards inference.

As you do test inference, you do large batch sampling and so on. For these Mamba models, I think generally if you distill, you’re going to lose some quality; that’s unavoidable. But you get pretty significant gains in terms of inference throughput. Here’s an example where we compare I think just a 3B Llama model with a 4B or 3B distilled Mamba model, right?

We’re measuring inference time and generally as you increase the batch size, the attention will take proportionately more time. At large batch size, the transformer model can take maybe four times longer to generate compared to a Mamba model. After you distill, obviously, you lose some quality, but my point is you can make up for that by having a much more efficient inference.

If you scale in terms of PAM, on the y-axis, this is coverage, which is a way to measure, you know, if you do large batch sampling what fraction of the right answer do you generate? With very limited time, the Mamba model generally performs a little bit better simply because they can sample much more than the Llama model.

This is a really fast distillation setup; I think we only distilled for like 10 billion tokens. If you do longer distillation, I think you can do a lot better, but this is just proving out the idea that you can distill to change the architecture and get better test time compute scaling.

Here’s another example of distilling from Deepseek R1. We took Deepseek R1 at 1.5 billion. That’s the smaller version already. Then we distilled to Mamba for I think 10 billion tokens or something; it’s a tiny thing. If you plot the accuracy by the number of samples, then the original model DC1 is going to do better than the distilled Mamba model, so that’s not a surprise.

If you plot normalized by TAM, then we see the distilled model actually does very reasonably; it’s very competitive with the original model. This is with I think 10 million tokens of distillation, so this is very encouraging. If you don’t want to pre-train large models, I think distillation could offer a pretty good trade-off in terms of quality and inference throughput.

Nowaday, inference throughput can be converted into model quality. I’ll conclude with a couple of open questions here, things that I don’t know or I’m trying to figure out. One is on the modeling side: why do hybrid models do so well? I don’t quite understand yet. I think the attention is doing some kind of retrieval; we need a couple of layers that do that, but precisely how they do that I’m still trying to figure out.

On the ecosystem and algorithm side, I think there’s still a lot to be done. A lot of systems have been written tailored to transformer architecture, things like distributed training with tensor parallel, context parallel, and so on. There’s a bunch of open questions about how you would do it with a non-transformer architecture. If, given test time compute, is so important, what is the ideal architecture for test time compute?

How do we design an architecture that’s inference-first? I don’t think right now any of the things I’ve shown you is exactly the right answer. But we’re taking steps towards that; things like state space and RNNs should play some role because of large batch sampling, attention seems quite important.

I think in the next six or nine months, hopefully, we’ll figure out what’s the ideal architecture for inference time scaling. Coming back to hardware, there’s still a bunch of things to optimize as the hardware evolves. New background about GPUs has a bunch of new features that I’m excited about that we’re going to go optimize for.

If you’re going to deploy these things on phones, you know, how do you optimize for that kind of setting as well? Those are sort of new use cases and new hardware that we want to work on, or I personally want to work on. With that, I will stop here. I’m happy to take more questions.

You were saying that as you kind of blow up the hidden space, this gives you kind of something closer to like you can store more stuff. Later, you were saying that adding a couple of extra, mostly Mamba, but then a few attention layers also helps too. Are you aware of any work looking at the trade-off between these things?

If I shrink that hidden state, do I need more attention layers or if I grow it, do I need less? I see there’s a trade-off between increasing or decreasing the hidden state and how many layers you need. That’s an interesting question. I don’t have hard numbers on this, but generally, the moment you add a couple of attention layers, the effect of increasing the state size is much less.

You still get a little bit of improvement, but it’s much less. One example is, when we did Mamba, we usually chose state size 16, and when we did Mamba two, we usually chose state size 128 because it fits the tensor core better. Some folks have found that for hybrid architecture, you only need state size like 16, and that works well enough.

So maybe the intuition here is that you don’t need to put as much information in the state if you have a mechanism to sort of retrieve. Great question.

On attention, people have been doing approximation; is there an analog? Have people been doing similar things for Mamba? I would say I don’t think Mamba is sort of exactly the right thing that people would approximate. There are multiple variants that you can do that all do pretty well. Some of them do actually slightly better than Mamba at the cost of being a little bit slower, but I don’t think they make a big difference.

One thing people have done is start to see if quantization makes a difference. People started tuning some of the heads; I think the Nvidia team has started doing that, but it’s very much an early stage. My intuition is that for Mamba and RNNs, you already do a lot of compression, so maybe you don’t need a lot of these other techniques.

For attention, people do tons of compression on the KV cache. There’s a paper saying you can use two bits per KV cache, or you can tie the KV cache across different layers, and that still works well. It’s quite clear that you don’t need that many attention layers, or you might not need to attend to all of the history.

I would view it pragmatically; you can take attention, approximate or quantize, or you can sort of put it in the architecture and say, “Model, I give you the freedom to compress.”

What is the property of the sequence data that I want to use? What kind of domains or data would Mamba do well on? I think it’s a pretty strong general-purpose sequence model. We’ve seen good success with audio. Albert at this company, Cartisia, has been using some of this stuff for text-to-speech models, and the models are quite good.

I think video would be good as well. Essentially, our intuition is that these all started out as sort of continuous TAM sequence models, and with common filters and so on. Anything that moves more continuously like audio and video, I think these things would do well. Text was one of the last domains that was really hard to crack.

It took us a while. I think we needed this sort of selective state space—the ability to adjust the ABC depending on your input—that was the core idea that made it work for text. I think the other domains, audio, video, and probably genomics as well, would do quite well.

I have a question about the memory, like HPM memory, in the beginning. I’m very interested in what if I wanted to implement this myself. I’m not used to controlling the memory. Is there something between PyTorch and CUDA I can use besides, of course, the flash detention module? If you want to do fine control of this stuff, what should you be using?

I would recommend Triton. Triton is this domain-specific language embedded in Python from Philip Ple and now he’s at AI. It sort of looks like Python, but you get to control when to load memory and what computation to do on device. I’ve been forcing all of my students to go through the Triton tutorials, and apparently, they’re having a good time; at least that’s what they told me.

I would recommend looking up the Triton tutorial. It might take you a couple of days to get comfortable with that, and that would give you I think 80% of the control at 20% of the pain. If you really need the last 20% of control, then yeah, you go down to the lower level, which is much more painful; I would not recommend as a first step.

I’m curious about distributed training—are there people who have invested in hierarchical SGD or local SGD type scenarios? Do you think there are unique opportunities in these Mamba-like architectures? Maybe in a similar point, how’s the fault tolerance in Mamba? In transformers, if one of my blocks just breaks, I can skip, and generally, it’s okay. How about Mamba?

The short answer is I don’t know. We haven’t really tried it. My intuition is that there are some new opportunities that will open up. The way you perform the forward pass and backward pass is very much performed by chunk. You compute for one chunk, pass information to the next chunk, and that processes information and passes to the next chunk.

It already has this flavor that you’re doing some local computation, and then you pass information to the next chunk. So you might be able to skip some of this; if you care about fault tolerance, you might say, “Oh, I’m supposed to pass to the next chunk, but that one is missing, or the GPU is missing. Maybe I can get information from the one before that.”

I think that’s probably something to do, especially as you go super long context, where you need context parallelism across the sequence length. You need to pass information along the sequence, but a lot of this is not yet explored.

I have a question regarding your insights about how much we know about actual long context modeling. It seems like there’s a point at which the frontier models work pretty well up to a certain point, and after that, I’m not convinced we really understand or have the right benchmarks to test how well it’s working.

Do you have a sense of how to think about benchmarks for long contexts? Is it plausible to me that maybe in very long context, the architecture should be a bit different because you shouldn’t remember all the runs? Those are sort of questions.

Absolutely on long context. I agree with you; I don’t think we have the right benchmarks. The HStack is sort of the most popular one, and it’s kind of easy in all of the models; they pass that test. Then people start doing multihop reasoning along a long context—that’s probably more interesting.

There’s this work from Batty Chen’s group at CMU. She has this benchmark for GSM infinite where they construct longer and longer math questions that you sort of need multihop reasoning for. We see that the models start doing poorly after a certain context; in a sense, they just don’t do well at all.

I think OpenAI, with the recent release of 12.1, released one or two benchmarks along this direction, multihop reasoning. So that’s probably where we’re heading. I agree. Right now, we claim to have long context models, but they are not doing that much long context. It’s more like they don’t blow up, or they can detect needles, but they don’t seem to do that much beyond that.

So what kind of software-hardware design do you think the black architecture will bring about? What are you using?

Software-hardware co-design with Blackwell—I’m pretty excited about the larger NVLink domain. In Blackwell, you can link 72 GPUs together in fast NVLink. I think that’s going to allow super long context for things like video training. Previously, when people were doing it, they would have tensor parallel across eight GPUs. They would split like the attention heads across eight GPUs, which would have a fast interconnect with NVLink.

But when you do context parallel, you need to start communicating over the InfiniBand, and that gets a lot slower and has much higher latency. Now, if you have 72 GPUs connected together, you can do both tensor parallel and context parallel within that NVLink domain. I think that’s probably one of the things that will unlock a really good video model.

That’s one application that I’m excited about. I think in a year, video models will be really good, and we will have real-time video generation. That’s going to be necessary for training. Inference is the same; video generation inference is actually very flop-heavy. We’ve worked on some of this together with AAI. Paralyzing across eight GPUs starts—it’s okay.

But once you start going beyond eight GPUs, it gets a lot slower. What if you can paralyze across 72 GPUs and have real-time, high-quality video generation? That’s going to change a lot of the workflow, and some companies are starting to pitch to Hollywood and so on. There’s probably a market there.

You mention a lot about SSM. Are you aware of any optimized SSM for modern hardware, or is SSM fast enough that you don’t have to? How well are they suited for modern hardware given that the chips are very flop-heavy?

At the end of the day, it comes down to arithmetic intensity, which is how many flops you do—that’s how many bytes you need to load. The ratio is around 300; it’s going to maybe increase to like 400, 500, or something like that.

In state space and RNNs, there’s usually a knob, which is the state size, that lets you control how many flops you’re going to do per byte you load. I think right now, because the implementations are not super optimized, we’re not hitting close to hardware limits.

I think we probably still want a larger state size, maybe adaptively growing the state as the sequence length gets longer. That makes intuitive sense, but it turns out it’s actually quite hard to do in practice. We have an ongoing project trying to do that, but it gets very hairy very quickly.

There’s probably some knob you want to tune that lets you do more compute per memory, but I don’t think the ratio is going to increase that fast. It’s going to stay at like 300-500 or so; it’s not going to go 10x. It’s going to increase a little bit every generation because the tensor cores are getting faster, but the memory is also getting faster—HBM 3, HBM 3E, now HBM 4, and so on.

So that ratio is increasing but not that dramatically. Thank you again for a great talk.

#135. DeepSeek 股权架构分析 - 揭秘AI独角兽的资本迷宫

2025-04-20T00:00:01+00:00

#135. DeepSeek 股权架构分析 - 揭秘AI独角兽的资本迷宫

十年间从雅各比到DeepSeek，这家公司经历了怎样的蜕变? 它独特的双层嵌套、三层防火墙股权架构背后又隐藏着怎样的玄机? 创始人究竟掌握了多少控制权? 这样的设计是利大于弊还是暗藏风险? 本期节目，我们将带着这些问题深入DeepSeek的资本迷宫，一探究竟。

真正的创业成功，你必须真的先开始创业。你创着创着的话就找到自己的方向了。如果你还没开始创业的时候，只是谈论创业，你即便做再多的规划，问再多有经验的人，其实你都没有办法知道自己真正适合成功的道路。

这里是牛油果烤面包。大家好，我是Kat。我是David。今天我们将聚焦一下备受瞩目的AI独角兽DeepSeek。我们再次非常有幸地邀请到了美国的刘律师，帮助我们来揭开他神秘股权架构背后的故事。欢迎刘律师。

主持人好，各位听众好，很高兴再次来到牛油果烤面包的节目，跟大家一起分享一下DeepSeek的股权架构。我想很多朋友可能对我并不陌生，之前我也在牛油果烤面包分享了很多关于美国公司股权架构的一些法律知识。这次我们就用DeepSeek这样的一个具体案例来跟大家说一下股权架构在公司法中的实操。

是的，刘律师是我们这个节目的宝藏嘉宾，欢迎再次光临。这次我们所谈论的话题DeepSeek，这家公司实际上是一家很新的公司。它成立于2023年，在非常短的时间内就成为了人工智能引力的明星企业。因为在当今AI大模型群雄逐鹿的情况下，DeepSeek可谓是横空出世。

首先一点，它的大模型R1具有非常强大的推理功能。在很多基准测试中，打败了许多当今的大模型，如OpenAI、Google和Meta等。其次，它的低成本和高效能，其训练所需的GPU数量远远低于OpenAI所需的GPU模型数量，相应地，它的训练费用也非常低廉，降低了用户的经济负担。最后，非常突出的一点是，它的模型完全开源，所有代码和论文全部公开。因此，这家公司也推动了整个AI领域的发展和创新。

其实，我们知道它的公司故事可以追溯到2013年。它的创始人梁文峰早在2013年浙大毕业后，就创立了一家名为雅克比的公司。对，是的，梁文峰的成功并不是像我们看到的那么横空出世。很多人知道DeepSeek是在2024年年底、2025年年初，尤其是在春节左右的时候，刚好是除夕当天，它让英伟达的股价暴跌17%，市值蒸发掉了6000亿。有人说老黄吓得都脱下了皮衣。

这个情况不仅影响了英伟达，也让如谷歌、微软、脸书、亚马逊等科技巨头的人工智能团队感到非常惊慌失措。因此，大家会觉得DeepSeek好像是突然爆发出来的，但其实任何突然爆发的东西，之前都经历了很长时间的积累。就像人们常说的，埋头苦干十几年，才能在一个夏天飞速成长。这个积累的过程是我们未曾看到的，而这才是真正别人付出的时间。

梁文峰也是如此，他在2013年就开始创业了。那时其实他并没有打算创立一家人工智能公司，最初的想法是做资产管理，也就是金融公司。他本科毕业于电子工程专业，但对金融科技领域很热衷。因此，在上大学时，他就在基金公司进行了实习，积累了相关经验。

他创立的第一家公司杭州雅克比也是一家金融资产公司。雅克比在数学上是雅可比矩阵的意思，这对科技界的朋友们来说应该并不陌生。最初梁文峰就是想做与资产管理有关的事情，与数学打交道。

然而，雅克比的表现并不理想，因此，我们想告诉大家，第一次创业很难一炮而红，实际上很多成功者都是经历了多个失败的过程。在美国，许多创业公司，往往在大二、大三时就开始创业，基本上每年成立一家公司，而前面一两家公司通常都是在交学费和试错。

梁文峰的第一家公司在默默无闻中拿不到太多的成绩，第二家公司则稍微摸到了点门路，从浙江九张资管的名字可以看出，这是呼应了九张算法的概念。由此可以看出，梁文峰那时还是对金融充满热情。

在做九张资管的过程中，梁文峰逐渐认识到自己在金融产业中的见解。这使得他转向量化交易。接下来的一年，他成立了新的公司，叫幻方量化，这家公司让他在金融领域一飞冲天。2016年公司刚成立，2018年两年后就获得了私募金牛奖，这个奖被称为金融界的奥斯卡奖。

自此，梁文峰迅速从一个量化小白变成了量化大神。你刚才提到2018年幻方量化获得私募金牛奖，这对一家私募公司来说是一种很高的认可，是不是？是的，这是中国私募界最大的奖项，可以与电影的百花奖、酒业的经济奖等相提并论。

据我了解，DeepSeek并没有公开融资或外部投资资金来源，所有运营资金都来源于他们之前通过量化对冲基金赚到的钱。是的，这里说得很对。等会儿我们深入探讨它的具体股权架构时会发现，它的股权架构实际上并未对外部融资表示友好，几乎没有给外部融资留下一丝空隙。

曾经有人说阿里投资DeepSeek100亿，但阿里很快就辟谣了，DeepSeek也称它们没有拿过任何外部融资。人们自然而然就会问，它的钱来自何处呢？随后我们分析DeepSeek的股权架构，你会发现其资金主要来自创始人梁文峰、智达伟、李欢和陈哲。这几位创始人的财富来源于他们在幻方量化管理过程中的收益。

作为GP，他们获得了附贷收益，相当于利润分成。这是一个相对可观的回报。可以说，DeepSeek的融资与外部资本无关，而是靠自身的努力和成果来支撑起一家公司。这一点非常独特且具有鲜明特点。

确实，DeepSeek在美国的竞争对手OpenAI等则依赖于各大公司的融资，尤其是微软的投资。而DeepSeek完全依靠自身的力量运转。接下来，大家在听到关于其股权架构时会有疑问是否可以效仿这种架构。在深入展开之前，我可以明确指出，这种架构并没有普遍的实用性。

它相当于有一些未通过的学员，依旧能够进入清华北大。然而，这种架构是个别的专属设计，并不能作为普遍的模板。因此，大家只需了解这样的公司架构，但并非可以照搬照抄，这是个特定的案例。

梁文峰的幻方量化能快速成名，离不开其中的人工智能技术。事实上，在2016年到2018年间，支持量化交易的人工智能已经相对普遍。不过，幻方量化之所以能够迅速崛起，是因为它全面应用了这种技术。很多资产管理公司在当时并不敢如此大胆地使用人工智能。

当然，光有胆量并不足够，梁文峰的人工智能技术也是非常扎实的，而他也懂得如何结合金融与技术。因此，幻方量化在发展中与人工智能技术相辅相成，逐步提升业务水平。

然而，随着幻方量化获得私募金融奖，梁文峰很快意识到人工智能算法无法跟上业务增长的速度。正因如此，他在2019年成立了一家专注于人工智能的公司，称为幻方人工智能，专门研究和开发AI算法。这个故事，大家似乎也可以猜得到。副业做得很好。一不小心做出圈做成主业，这样子一个故事其实这种故事也挺多的。咱们知道的很多知名品牌其实都是属于这种把副业一不小心做成主业的。例如说韩国的三星，它之前其实是一个小杂货铺，就卖点米面粮油，瓜果蔬菜之类的，卖点小鱼干之类的。后来的话呢，它一下子不小心做成电子产品的巨头了。

还有就是诺基亚，它一开始是一个造车厂，后来呢一不小心做开手机了。还有就是丰田，一开始其实是个印刷厂，后来呢变成了个造车厂。因此在这个地方，其实也是给大家说一个感悟，就是你想真的创业成功，你必须真的先开始创业。你创着创着的话呢就找到自己的方向了。

如果你还没开始创业的时候，只是单纯谈兵，即便做再多的规划，问再多有经验的人，实际上你都没有办法知道自己真正适合成功的道路。那你只有真的开始去创业，一边创业一边摸索，才知道可能冷不丁一个什么机会就让你抓住了，然后呢就变成一个突破口。所以总结一句话，就是大家都是一步一步试错试出来的。

说不定就是五星查了柳川茵，然后就开了一个新的菜道，还是很成功。对对对，梁文峰的发展其实也是这个样子。因此他从2019年成立了换换人工智能之后，他的人工智能业务就做得越来越出色了。直到2023年，也就是四年之后，他就意识到自己在人工智能这个领域可能要火出圈了。

因此，我们今天见到的梁文峰的这一系列的股权架构，包括说三层方口墙，两层千套，我们一会要说的这些，其实都是在2023年7月一次性成立的。这其实是挺神奇的一件事。因为一开始我们看到它的股权架构这么复杂的时候，观众一会可以看看说Notes里面的话，有我们团队去整理的这样的一个梁文峰的股权架构。

大家会发现它的股权架构其实挺复杂的，一开始看起来如果你没有特别仔细深入去研究，没有细致的方式去看的话，你会看起来就还挺乱甚至是。所以在我们团队一开始去研究它的股权架构的时候，最初的怀疑是是不是说它因为是一个有一定创业历史的这样子一个创始人，对吧？它前面其实有差不多10年的创业历史。

那么我们会认为它可能犯了一个常见的一些创始人常见的错误，比如说它新成立一个公司，但是新成立的公司的话呢和之前老的公司还有一些相关性，对不对？新公司就要反映一部分老公司的利益，所以导致出现一些这种相互持股以及一些看起来比较复杂的架构。但是实际上我们真的研究起来就会发现其实并不是这样子。

因为跟深入求索相关的这些股权架构，包括说一会儿我们要介绍到的宁波承信、宁波承谱和宁波承恩这些公司，包括杭州深入求所、北京深入求所，全都是在2023年7月份一次性成立的。也就是说是梁文峰在2019年成立了换方人工智能之后，又潜心钻研人工智能，钻研了4年，觉得自己这个公司马上要爆发之前，去找了专门的律师去做了一套整体架构的事情，然后一次性去注册的。

所以这个的话其实也是一个非常重要的一个素质，就是一个好的领袖，一个企业的引路人，一个领导者，他所需要具备的素质，就是他会需要有预见性，知道什么时候自己的这个业务可能会需要爆发。你因为任何一个公司在发展过程中，他其实都是在跟时间赛跑。那么很多时候的话，可能往往他的技术方面是会先行的，因为大家肯定还是要拼技术，谁高谁低嘛。

但是这个时候的话，其实相应的这些公司主架构呀，税务法务什么之类的，可能往往是落后的。那么梁文峰在这地方的话呢，其实他就非常的有先进之明。他在公司这种火速圈之前，进行了一个公司股权架构的重整。因为如果他真的火起来的话，可能大家一旦几个合伙人看到利益的时候，这个时候如果他再来搞股权架构，其实就搞不起来了。

也就是说大家其实都已经看到好处了，对不对？那个时候再去调整股权架构，就没有人愿意去配合你。而他其实是在公司真正一飞冲天爆发之前，去做这么一个股权架构的调整，把这个公司的架构变成一个非常稳定的，可以说是通过我们后面的分析，做到1%的股权控制到84%的这样一个控制权。这几乎可以说是梁文峰一手摄天的这样一个架构。

那么大家可以想象，如果说是在2025年春节之后，如果他再做这个架构，虽然都已经看到好处了，没人会同意这样的股权架构。但它其实就在这个公司即将爆发的前夜，把这一切架构全都给设置好了。这其实是一个非常罕见的，或者说是非常珍贵的一种创始人领袖才能。绝大部分的创始人其实做不到。很多人的话，其实是成功了，反而大家就是说都闹起来了，散伙了分家了，失败了其实大家也无所谓，反正就都归零嘛，对不对？

那可以说他从2013年开始，作为一个连续创业者，实际上积累了很多前面的经验。当当他2023年觉得这个属于他那一刻到来的时候，他把这些想得很清楚，然后在2023年一下子成立了多家公司，设计成了这么一个待会儿要详细介绍的股权架构。为了避免未来这个公司即使成功之后，可能会避免可能会发生的纷争，他做了一个很好的规避。

对对对，是的。所以从这方面的话，人家之前毕竟也创业创了几次了。我想他之前在创业过程中，可能也是有过类似的一些小纷争，因此他可能也是留了个心眼。那说了这么多，就是我们做了这么个铺垫，刘医师就是关于这个DeepSeek这家公司，它的股权架构的特殊性。那么你是不是可以给大家介绍一下，它特殊在哪儿，然后它是什么样子的一个结构，使得梁文峰他所创立的这家公司具有相当的稳定性？

这个公司股权架构的话，它的特点，我用两个短语吧两个词来形容，就是叫三层方果墙和两层千套。这个地方的话呢比较专业。最近我在其他各个平台讲，我发现说，基本上除非你是法律的金融专业这种出身的，其实可能我就即便是照着批评讲，可能大部分也听不懂。因此这块的话，我就不从技术专业的角度去详细介绍了。

但是呢，我可以用相对来说比较通俗的方式跟大家说说它的架构有什么样一个明显的特点。就是它这个架构有一个挺明显的特点就是良文峰is everywhere，就是每一层里面都有良文峰。其实它这个架构加起来差不多应该是有四层，但是其实每一层里面良文峰都是存在的。

包括说是从最下面直接持股杭州DeepSeek，到上面，杭州DeepSeek上面的一层公司也有良文峰，在这个上层公司里面也有良文峰。每一层其实都会出现良文峰的身影。还有另外一个比较明显的一个特点呢，就是它在其中出现了有一个普通合伙人的这样的公司去持股上下两个不同未接的有限合伙企业的情况。因此其实这些的话，我觉得应该这么说，从它这个架构的设计能看得出来，设计这个公司架构的应该是在股权架构领域非常有经验的一个律师。

他肯定是找了专门的做这个股权架构方面的一个专业团队。他不是说随便找一个什么律师，可能什么都能做的。从它在每一层的这样子一个设置可以看得出来，他对于设计这个架构的律师，应该是对公司法中的一些跟股权相关的一些概念都是非常的驾轻就熟，都是非常的轻车熟路的。因此其实它的每一步的话，应该说设计的都是咱们一般人可能不太敢往那个方面去想。就是他会把这些，他会把其中的这些很多一些咱们一般不太会敢去用的一些办法，去使用在同一个架构中。

那刘律师你说到了，它的每一层都有很精妙的设计，你能不能给大家举一个例子，说明它的精妙设计在哪里呢？对，这个DeepSeek它的架构其实主要是有四层，那么其实每一层的话都很独特的地方。那么我们举其中一个例子，就是最接近于杭州深度求索这家公司部分的地方。

这个地方的话，它们是用了一个宁波城有限合伙，还有一个是凌文峰，宁波城有限合伙占有99%，凌文峰占有1%。那么其实宁波城有限合伙，我们今天不去展开，但是上面的话呢，实际上凌文峰通过层层持股也是占有大股东的这样子一个地位的。但是为什么这样子的情况下，凌文峰仍然要在杭州深度求索直接占有1%呢？就是因为即便是如果在宁波城的上面，通过三层股权，每一层梁文峰都能占有50%以上作为大股东。

好像看起来，它通过这样大股东表决，也可以把它的意志执行到杭州深度求索。但是如果说它每一次做任何的决策都通过这样子一个层层往下压的话，那么大家可以想象的流程上的一个复杂性。而如果说梁文峰在杭州深度求索直接持有一个1%的话，那么就可以造成一个怎样的结果呢？就是梁文峰作为一个自然人小股东，即便是一个1%的小股东，它有一些最基本的权利，包括说是参会权、查账权、知情权，还有提出意义的这样的权利。这些权利的话呢，它都可以不需要经过上面的层层公司架构去做一个股东会表决，董事会表决什么之类的，然后才能够执行。

也就是说在现在这种99%和1%这样的架构之下，对于一些比较大的一些决策，梁文峰仍然需要通过上面的这个一层一层的公司架构，它作为大股东的身份去做股东会董事会中的一个表决，然后执行下来。但是对于一些很小的日常的一些股东的权利，他可以完全不需要经过上面的任何一层就可以下达到杭州深度求索。在这种情况下，是不是他作为1%也能够很好的判断，那99%就不会站出来反对他？因为其实他对那99%是有绝对控制权的，并不是，因为梁文峰的这个1%他所行使的这些权利并不需要别人同意。

他所行使的这些权利都是属于一些很日常的权利，包括说我要查看这个公司财务报表，我不需要99%人同意，我也有权查看财务报表。

对，所以这些权利都是他自己一个人可以去决定的。明白了，那确实很有意思。对，刘医生，就是根据你刚才说的三次防火墙的设计，那么根据计算，梁文峰对DeepSeek这家公司的实际股权大约是84%。那么这样子的股权设计有什么好处呢？对，像刚才的话，我们介绍了DeepSeek的其中一层，也是最接近于杭州深度求索的这一层，它在股权上的一个设计的特点。那么大家也看出，这样的一种设计即便是只有1%，也让梁文峰尽可能多的保障了自己作为股东的各种权利。包括说宁波成仁、宁波成普、宁波成信这类，每一层其实也都是这样子。一层一层都是最大化地体现了梁文峰的这样一个权利。

其实我们之所以把它叫做梁文峰的私人定制，包括说把它叫做三层防火墙、两层千套，其实都是因为这样一种股权架构。在一层一层搭建的过程中，规避了整个公司的风险，并且把梁文峰的一个控制权在整个公司进行了一个最大化。所以这个的话其实是绝大部分公司是做不到的。

绝大部分公司的话，基本上是在发展几年之后，融合两三分资之后，这个公司的控制权基本上就丧失到投资人那里去了。但是现在这样一种股权架构的话，梁文峰的控制权可能很长一段时间，它都是不太会被稀释掉的。

对，那这样子的这个股权的架构很复杂，那它会带来什么样子的坏处呢？对对对，就是这些事情的话，任何事情它都不可能是完美的。其实像刚才我们说，这样一种架构的话，虽然让梁文峰的控制权非常强，但这样的话其实也导致说，像刚才我们说的投资人的权力其实很难得到体现和施展，对不对？所以这样也有导致说，实际上是不太利于融资的一种架构。

但是呢，前面咱们也铺垫到了，梁文峰的这个团队，其实他们的资金已经完全足够支撑DeepSeek的发展，所以其实它们并不太需要外部的一些资金支持的。也就是它这个架构搭建就没打算是对外融资，它不是一个适合于对外融资的这样用5000架构。那这个的话其实是其中一个弊端，对于DeepSeek来说可能并不是一个弊端，但是呢，对于绝大部分来说是不值得参考的一点。

就这种架构的话，你创始能控制太强，那么就没有人敢去投资到你这个公司，因为无论什么投资的话，都被梁文峰的这个控制权给锁得死死的。这是其中一点。另一方面的话呢，就是说如果说它的这个架构中，现在看起来每一层的话其实还有它的一些联合创始人，包括李焕正、大伟、陈哲之类这些人。

那么现在看起来还有其他的人，并且如果说这个公司在正常表决，比如说每一次开会这些人都会来，也会提一些反对意见，在实际的每一层都有实际的业务去经营。这样子的话还好，但如果说梁文峰在其中真的完全是一言堂，其他人都是一些木偶、其他人都是一些木头人，对吧？比如说每次开会的时候，其他人可能基本上都不来，或者说是来的话也从来不提任何的意见。

那么在这样的情况下，在法律上就有可能会出现一个叫做横向人格混同，或者叫做纵向人格混同。这个我跟大家大概解释一下什么意思。横向人格混同的话，其实就有点类似于，比如说我刘律师想持有一个公司，但我不直接持有。我为了掩盖我的身份，我下面能弄上五个公司，每个公司都持20%，然后再去持有下面这个公司。

那么这样的话，是不是看起来好像我就躲在了背后？但是，如果说这五个公司它其实都没有任何实质的经营，它纯粹是一个空壳公司，纯粹就是一个障眼法，那么这个的话就变成了像是那个孙猴子拔一根毛，然后呢，就吹一下吹出来五只，对吧？无论是五只、五十只、五百只，其实都是一只存猴子。

那么这种情况下的话，这五家公司就会被击穿，就认为这五家公司其实形同虚设。这叫横向人格混同。纵向人格混同的话，其实就差不多了，就是五个纵向的，一二三四五。我从上面100%持有A公司，100%持有B公司，然后一直持有到E公司，再去持有最下面我想真正持有的公司。那么如果说这五家公司都有各自的业务、各自的经营，对吧？比如说这个公司是做电商的，另外一个公司是做科技的，还有一个公司做宠物什么的。如果每个公司的话，其实的确是有它的各自不同的业务，并且它的经营也是有独立的这样一些决策的，那么还好。

但如果说能够去认定这五家公司其实都是空壳，都是我的一个障眼法，那么也会出现一个纵向的起传，就跟那个灯泡串连变连，就是类似这样子一个概念。

那么所以就是我们说，如果梁文峰他在这个公司经营的过程中，还是能体现一定的民主表决的，也就是说虽然我是大股东，但我每次都给你们发言的权利，我都给你们表达自己意见的权利，然后表决的权利、出些权利或者什么之类的话，那还好。但如果说这个公司他真的就完全不顾及别人的一个意志，完全确认是自己一个人一手遮天的话，也可能会存在刚才你们说的这样一个纵向人格混同，或者说横向人格混同，这样的一种公司之间的一个击穿。

我是不是可以用比较通俗的话来总结一下，就是说，林医师，你刚才说所谓的横向人格混同和纵向人格混同，就是如果说DeepSeek这家公司出现了问题，然后它所持股的小股东收集一些证据说，梁文峰他就是一言堂，然后他不给这个小股东发言，享有我们这个应有的权利。那么，法院实际上可以很快的就认定说，梁文峰他实际上他就是，并没有。

虽然他在隐藏在多层千岛关系中，然后进行持股，但实际上因为他是DeepSeek这家公司的实际持股人，所以他对于这个问题负有全部的责任，可以这么理解吗？

对，如果这样的话，就会出现一个叫做刺破公司面纱的情况。在法律上就是法律上的专业术语，也就是说你这个公司其实就是都是你的一个包装，其实没有任何实质的意义，那就把他们等同于一家公司。

如果他做得不好的话，就会等同于两个方一个人。好，那刘医师，我觉得我们听众也许好奇一个问题，就是作为普通的科技从业者，如果我拿到了一个DeepSeek这家公司的offer，那有没有可能这家公司会愿意把它的期权值分布给普通的工作者，然后以激励大家的好工作呢？

我现在看到的这个公司的股权架构，包括说是我们从这个其他各种资源了解到的DeepSeek股权架构，其实是不适用于期权激励的，包括他现在其实是没有设置期权值的。未来的话如果说要做期权激励的话，那么其实现在的公司架构也要全都打乱，重新设计。

刚才我们也介绍了，其实这个架构的话，它就是梁文峰的一个私人定制，那包括说它的每一层的其实一个架构的设计，都是为了让梁文峰尽可能大的获得控制权。所以呢，实际上这个设计的话，它目前没有给员工期权留下任何的空间。

那是不是同样的道理也适用于机构投资者，也就是说如果机构投资者想进来投资DeepSeek这家公司，这个基本上现在是不大可能的。对，现在的这个架构也是不适合于外部投资的。它的架构如果说是要想拿融资的话，也是要拆了重新搭建。

因为现在这个架构的话，投资人进来的话，无论你进入哪一层，其实都是很难去让这个获得一些实质的控制权。也就是梁文峰的这个股权的话，其实经过很多人稀释的话，都不会有一个实质控制权的一个下降，虽然会下降，但是下降会非常缓慢，因为它在每一层其实都设置了一些障碍。

就是说，每一次融资肯定都是会下降的，但是它这个下降会非常的缓慢。所以就是说它的这个股权架构，其实从2023年的7月起设计这个架构的时候，它的利益和初衷应该就是没打算给外部人，无论说是员工还是说是投资人，有任何股权上的一个分享。

好了，那么以上就是这期的播客的全部内容了，感谢大家的收听。

好，谢谢刘医师，谢谢。感谢主持人，感谢牛油果口面包平台的邀请，我们下期再见，后会有期，拜拜。

Chips: Liberated? Trump’s Semis Tariff Gambit

2025-04-20T00:00:01+00:00

Chips: Liberated? Trump’s Semis Tariff Gambit

What are we looking at right now? Well, for the last two hours, nothing. It’s Eastern weekend and it may be calm. I was in the middle of writing my weekly column, and I decided for once that I’m not writing about tariffs. I’m writing about higher education and tourism, which are services that contribute to our trade surplus, which Trump is in the process of destroying—both higher education and tourism. But I’ve avoided tariffs; they’re on the way, and you picked out the big one: semiconductors. Not just semiconductors, but a very wide range of downstream products, which is going to be, depending on what he does, potentially very disruptive to the economy. We’re talking laptops, iPhones—potentially not cars because they’re covered separately—but if it has a chip in it, it may be vulnerable. Thanks to the internet of things, that’s a lot of stuff—you know, your toaster, your refrigerator.

If you don’t have 145 percent tariffs on China, you’re going to get some other percent tariff because it has a chip in it, unless they end up saying, “Well, we’re only going to assess the tariff on the value of the chip inside the product,” which would make it, in some cases, de minimis or small. But right now, it looks like it’s going to be broader coverage than that. The mystery is what he’s trying to accomplish because he’s given conflicting goals. One goal is the revenge goal: we’re getting even for all the people that have taken us to the cleaners for the last 50 years, and we’re rebalancing. Another one is it’s all a negotiating tool to get them to bring their barriers down. It’s artfully designed so we don’t have to do anything. All we have to do is wait for them to make concessions, and then our concession will be not to impose the tariffs that he’s announced. I mean, if that works, that’s a good deal, but we’ll see.

The third goal, which is inconsistent with the first two, is revenue. They need the money to pay for the tax cuts. Depending on what we’ve estimated at different levels, our current estimate is 330 billion a year, which is not peanuts. It could be more, could be less; that’s about half of what Navarro said it would be. But that’s only true if the tariffs stick. If he negotiates them away, then the revenue goes away. The fourth goal, which is the most interesting one, which you’ve talked about in the past, is let’s bring manufacturing back here. Let’s build all this stuff here because, as he said repeatedly, if you make it here, there’s no tariff. What he doesn’t talk about is the obvious time gap. Tariffs go into effect now. If you’re building a factory, if you’re building a fab plant, for example, you’re talking lots of money and lots of years to put the pieces together to build it.

Then you have the answer to the question we talk about later: can you even find the workers to fill the jobs that he’s creating at the same time that he’s busy deporting most of them? Which is the real goal? It changes from time to time, and with chips, it’s particularly hard to tell. It’s not really a revenge issue because most of this stuff is covered by the Information Technology Agreement, so it’s zero tariffs already. It’s probably trying to force manufacturing here, but in the process, you know, it’s going to be one of the largest revenue transfers from the poor to the rich in history because the poor pay the tariffs disproportionately. The tax cuts they’re going to pay for go to the rich disproportionately, so it’s going to make economic inequality in the United States a lot worse than it already is.

All right, Jay, we got a mess. It’s probably going to continue to be a mess, but we’re probably going to get more tariffs than we had before. You have a lot of executives who are trying to think about what the hell they’re going to do next. We got the design side; we got the fabs; we got SME. Let’s start with design. If you’re a major chip designer, what’s going through your mind at the moment?

So, if you’re a major chip designer, you have two concerns. One is: are you going to lose access to the China market because China’s going to have counter-tariffs, and now suddenly your products are much more expensive? Maybe we’ll come back to that. The other question is, what are the tariffs on—not China—but on Taiwan? I know there are all kinds of side agreements and other things going on here where some chips are coming in at zero tariff. But one, is that going to continue? Because it sounds like there are threats to have more tariffs on Taiwan for various reasons. Not all chips are covered, right? GPOs aren’t covered under the rules they’re supposed to; they actually get tariffed.

Semi-analysis actually put out a good piece last week showing that there’s a loophole: if you import your tariffs through Mexico as part of a system, then you don’t get that tariff. But if you import the GPUs directly to the U.S., they’re tariffed differently than CPUs, which hurts my brain and just seems counterproductive. You want to do assembly in the U.S.? Well, you’ve just made that more expensive than doing assembly in Mexico. You want to win the AI war—whatever that is—you’ve just made that whole process more cumbersome, more expensive. So, I’m confused; that’s kind of where I end up on it.

To answer your question, on the design side, if I’m running Qualcomm, Nvidia, or Broadcom, I’m worried that suddenly my products are going to be more expensive somewhere along the line and that I’m going to lose access to 10 to 20 percent of my revenue that goes through China. All right, let’s do manufacturers. I go back to Bill’s question, which is: what’s the goal here? Do we want to bring advanced semiconductor manufacturing back to the U.S.? If that’s the goal, we’ve already achieved it; we already have TSMC building advanced chips in Arizona.

The supply chain is secured; that plant is very dependent on Taiwan’s mothership for R&D. If you want an American company to have advanced manufacturing of semiconductors in the U.S., then the solution is to give Intel 50 billion dollars, and problem solved. Because Intel can manufacture chips; it’s not a technical problem anymore. The problem is an economic one, and they just need to get over certain cost curves in order to be able to manufacture here. If that was just purely a national security concern and we wanted to have a supply chain fully domestic in case of war, then just throw a bunch of money at Intel; they can get it done.

But that causes all kinds of other issues. We don’t like subsidies. It’s confusing to me what the goal is. Well, one, there’s a distinction that you’re making, and that’s an important one. Biden was a “carrot guy.” Trump’s a “stick guy.” Biden’s approach was let’s bribe companies into doing what we want; that’s really what the Chips Act was about. I think it’s what a lot of what the IRA was about—tax credits, subsidies, give them incentives. Trump says let’s threaten them and let’s bully them into doing what we want. It’s a lot cheaper for the federal government; it’s not necessarily cheaper for the companies, as you’ve just pointed out.

In a dim attempt at being bipartisan, I’ll say that I don’t think the Biden administration was particularly clear about what they wanted from semiconductors either; there’s a lot of conflicting goals within their semiconductor motions. I wouldn’t disagree with you on that. The process as it stands now is very confused; it’s not clear what we want, and most of the paths that I see going forward involve everything being more expensive for the average consumer.

Then we get into the whole question of what China is going to do to respond because there are some pretty serious counter-tariffs they have too, which are bad for all these companies. I think the level of tariffs is such that they’ve really prohibited most trade between the two countries if it stays there. My footnote to what you just said is that I think Trump wants a deal and believes that he can negotiate one with his good friend Xi Jinping. I think this causes some consternation inside the administration because he appointed a number of people that are just decouplers; they don’t really want to have anything to do with China.

But he has this view that he can make the deal, and when he says that, I always think back to 2018 and 2019. We’ve seen this movie before; he negotiated before, he made a deal, and in my view, he got played. He went in with a whole bunch of demands and in the end got what they promised to buy a whole bunch of stuff. Then they didn’t. When he was asked about all the things he wanted—no subsidies, no stealing IP, this long list of stuff, which by the way is good stuff—he said, “Well, that’s for phase two,” which, of course, never happened.

I think you’re going to see a replay. Right now, it’s like a sumo match. If you’ve ever seen one, you’ve got these two enormous fat people in the ring doing a lot of foot stomping, glaring and frowning at each other, throwing rice around, and then there will be a belly bump, and eventually, they’ll get to the table. All they have to do to get to the table is disguise who made the phone call first because that’s the issue—neither one wants to be the asker; they both want to be the giver. You know, you have an announcement that simply the two leaders have talked and now we’re going to have a meeting, without discussing who called whom.

So he has the meeting, and I think what you get is what you got before. One consequence may be not a 145 tariff on their stuff and not their 125 tariff on our stuff, but some smaller number. I don’t see right now any scenario where any of these numbers get below 10 percent, which is a significant bump up for the U.S., whose average tariff level for decades has been 2 to 2.5 percent.

I just want to say I’m a little disappointed there was a PolyMarket market on “Does Trump impose a 200 percent tariff on China before June?” and we’re now down to 19 percent. It spiked up to 60 at some point. If we’re going to go this high, you might as well get—I mean, give me a thousand. How about 949? Let’s really throw it in their faces.

All right, let’s be a little positive. So we’re criticizing them for not having prioritization. Bill and Jay, why don’t you lay out your kind of independent vision? All right, if we are going to have this tariff energy and bring it towards the semiconductor industry, how would you sort of deploy this tool in a way that furthers interests you define? I’m struggling here a bit because I feel like we’re having a—you’re both very intelligent, smart, serious people, and we’re trying to have a serious conversation about something that makes no sense.

But I think a few weeks ago, there was a moment when I think the Chinese establishment was very, very nervous. Obviously, I have no idea what Xi Jinping or the senior party members are thinking; I have no insight into that. But definitely from the commentary I was reading around the semiconductor industry, there was real fear, real trepidation. That shifted the minute tariffs went wider to the rest of the world. Liberation day was a really positive move from the Chinese perspective. If he’d stayed the course, if he just held on to that and just tariffed China, there was a moment there where he could have actually affected the balance of trade.

Then he tariffed everybody else, and we lost all our allies. Then he reversed on a bunch of the tariffs, which eroded the negotiating position. There was a moment there where he actually had a shot at it. But they should be worried because I think the way this is going to play out with other countries is there’s going to be negotiations. They’re lining up to show up here, and I think as that rolls out, one of the major U.S. asks of the other countries is going to be, “Do to China what we’re doing to China, put tariffs on them just like we’re putting tariffs on us.” This is not a semiconductor-specific ask, but it’s an important one, and it’s one that will resonate with some of these countries—not all of them—because Chinese overcapacity in a number of sectors is a huge issue.

Everybody knows about steel, aluminum, solar panels, also cars, tared cars, EVs, and next is commercial aircraft, which is already getting started. Other countries are beginning to see what it’s doing to their domestic industries. It’s not just the EU on EVs; it’s India, Turkey, South Africa, Brazil, Canada, Indonesia—all contemplating taking adverse tariff actions against China for various reasons. They’re not all the same; in Indonesia, it’s textiles, which is completely different. So there are a lot of people that have figured out what’s going on, and they’ve figured out that what’s going on isn’t good for them.

What is missing is a collective approach because right now, every time you act unilaterally, it’s like squeezing a balloon; it just goes somewhere else. The big victim of what we’re doing to the Chinese is going to be the EU because the stuff they’re not selling us anymore is going to go there, and they’re worried about that. So, the Chinese should be worried because if it works, everybody’s going to be ganging up on them. They’re going to have a much bigger problem than just a bilateral problem with us. I think that’s absolutely true. I just worry that the Europeans no longer trust the U.S. to sort of be the center of that coalition. You’re exactly right. We are busy alienating our friends actually more than our enemies. Russia and Iran are exempt from the terrorists; we don’t trade much with them anyway, but it was sort of a symbolic odd step. You’re right. I mean, why are these people going to help us the way we’re treating them? It just doesn’t make a lot of sense.

I saw a story recently that ASML, which makes these critical EUV tools that advanced semis are sort of dependent on, during the Biden administration, had given the software keys to the U.S. government to make sure that China wasn’t able to use the most advanced systems. My understanding is that with the new tariffs that came in, the Dutch government took the keys back. We’ve sort of moved in the other direction now. I think you’re right because there is a lot of common interest out there. All those companies you mentioned, and a few more, have reasons to worry about China’s export engine. Maybe we all come around.

But we need that sort of collective action because otherwise we’re going to be in a tough spot. That’s certainly the tone I got from the Chinese press, which was that they were very worried, very isolated, and they knew it. Then Liberation Day came, and suddenly the tone shifted overnight; they were very happy and very optimistic. Some of them explicitly said that the U.S. has overplayed its hand. Now it’s not the world against China; it’s the world plus China against the U.S. That’s really interesting. I hadn’t thought about it that way, but you’re making a good point.

I think you’re right. I’m trying not to be alarmist because I think there’s a clear sign that there are countries out there that are lining up to negotiate, as you said, and want to come to some form of deal. Hopefully, some of that parlays itself into a sort of collective action against China, not just a bilateral series of actions. My instinct, though, is that the countries that will be most reluctant to do that are probably going to be Japan and Korea, which are particularly relevant in the semiconductor sector.

Yes, the sort of second-order impact on semiconductor export controls is significant. We saw the H20 get hit earlier this week, but that’s the easy one. The manufacturing equipment requires some help from your friends in order to do it. Sure, you can go extraterritorial, you can bang the drum, and maybe you can squeeze tighter than you’ve been able to lately. But I don’t know; I guess I’m of a few minds on this. On the one hand, these countries are going to be so much more stressed about other trade disjunctures that, from their perspective, Tokyo Electron getting a little screwed over is really not the end of the world if we’re talking about 20% tariffs on an entire economy.

However, you’re just also in a less friendly mood. If this is how this country that’s not giving you a lot is treating you, it’s just going to be another ask. If you’re thinking about reconsidering your relationship towards China, like this is the thing that they’ll be most excited about. They’ve launched a new charm offensive; I think it’s the third one. What the United States is basically telling everybody is that we are no longer a reliable partner. If we’re not a reliable partner, that has a lot of implications.

One implication, particularly the way it’s rolled out, has magnified this concern: how can you be confident that anything we say is going to stick? Tariffs are on, they’re off; they’re up, they’re down; there are no exceptions, then there are exceptions, then they’re postponed. The deadlines change, and now we’re making deals. If I were a foreigner, I would have no idea what’s going to happen next. I wouldn’t have a very high level of confidence that any deal I make is going to stick.

Canada and Mexico are prime examples. Trump negotiated USMCA; you’ll recall he called it the best and greatest trade agreement in history. Now, five years later, he’s dismantling it. Why would anybody have confidence under those circumstances? It’s interesting because the Biden administration ended up pulling the Netherlands, Japan, and South Korea a little bit. But the prime directive was that we want to be friends with our allies. I think that did put a real brake on how aggressively they ended up being on SME.

With Trump, we’re getting a few different impetuses, which are different from that first. We don’t care about how the allies feel, but we also want to beat China more, maybe, and we want to make deals with China. There are multiple pathways that you can see this playing out where you alienate the allies. ASML doesn’t care about figuring out a way to service DUv and sell EUv into China, or they squeeze so hard and there’s enough IP in it that they can actually get this stuff stopped. They just don’t care what the allies think in a way that the Biden administration did, which sort of slowed their role.

There’s a very wide range of outcomes if you’re an SMIC engineer trying to think about what you will and won’t have access to over the next five years. I think it’s one of those things that’s fascinating to me. My sense is that there are all these competing camps within the Trump administration on these issues, some pulling in diametrically opposite directions.

There are some who favor complete decoupling and then others, Elon chief among them, who are dependent on China. Elon’s factory in Shanghai is the nexus of his whole fortune. I wonder about this a lot because, like Elon, where do his loyalties lie? He has a lot of influence in this administration; maybe it’s waning, but he has a lot of influence. Tesla is entirely dependent on that gigafactory in Shanghai because they use it to export everywhere. He has to be sympathetic to China because from his perspective, there are what, 56 EV companies in China today, all competing with him. Some of them are starting to surpass him.

You have to think that BYD, XPeng, and Li Auto would all love to see Tesla take a hit and get shut down. You have a strong voice in China arguing against Tesla. If Elon can’t deliver a friendly U.S.-China relationship with all his influence in D.C., what’s the point? They’ll shut him down. How does that play out? That gives me some hope that there is room for some form of peaceable outcome to this.

The gossip last year was that the Chinese were optimistic because they thought Elon would take care of them, that he would come in and save the day. What you have to keep in mind with Trump is that everyone who has ever worked for him gets thrown under the bus sooner or later. It’s just a matter of time. When he was inaugurated, I gave Musk six months, and I’m sticking to that. His term runs out in May anyway, but the question is when his influence runs out. They all get thrown under the bus; it’s just a matter of time.

I think that is the one thing we can be certain of: we are going to get big swings when it comes to whatever’s going to happen with the Chips Act, export controls, and tariffs. What that level of uncertainty does for the future of what was started in the first Trump administration of this whole push to bring domestic manufacturing back is critical. There is a strain of thought that he’s just negotiating and that the tariffs are just a bluff, a negotiating position. Ultimately, we’re going to come back to a much more reasonable lower level of tariffs, and there’s no lasting harm.

I think that’s misleading because one bill, to your point, is that we’re going to end up with higher tariffs no matter what; that’s pretty clear in a meaningful way. There are other second-order things that worry me a lot. In particular, I look at the way tariffs are being implemented in China to counter tariffs. What this whole thing has done is create a big window of opportunity for domestic Chinese chip design companies. The places where they’re strongest are trailing in manufacturing analog products, and they have, I don’t know, 500 companies competing in that space.

This is one of those things I was saying: up until very recently, the Chinese semiconductor industry was in a really gloomy headspace. They were very worried; they had too much competition and were constrained geopolitically. What these tariffs have done now is make U.S. products, like those from Texas Instruments and Analog Devices, much more expensive inside China, just at the moment when the Chinese companies are unleashing a massive wave of analog capacity.

This is going to propel some number of domestic Chinese chip companies into a place where they can now compete globally. Once they’re competing globally, we’ve seen this many times before: it sort of wrecks a lot of foreign competitors because they’re going to be very cost-competitive. They have a very low cost of capital; it’s going to permanently alter the chip design landscape because we’re going to start to see these Chinese leaders, these Chinese champions, emerge onto the global stage. I don’t think that would have been possible for another five years without the tariffs.

It’s a timing issue. When they first implemented the October export controls, our reaction at CSIS was that it raised three questions: the effect on U.S. company revenue, the effect on Chinese policy, and the effect on the design-out issue, particularly the extent to which it incentivized companies to leave out U.S. technology and avoid the scope of controls that way.

The second question, the effect on Chinese policy, seems to me is—it’s what you said. I don’t think it changed Chinese policy at all because I think they’ve spent the last 10 years moving in the direction you just described. But it’s accelerated them and we’re going to have to deal with this problem a lot quicker than we would have otherwise.

I think it’s pushed a lot of sectors past the point of no return, which wouldn’t necessarily have happened; they probably would have, but things could have turned out differently. Now we’re going to see a permanent layer of Chinese competition in some pretty important markets.

What is the corporate and policy response on the lagging edge and the analog side? I think it’s going to be very challenging for companies like TI, which has spent the last five years radically upgrading its capacity to the point where it can now compete on cost with Chinese low-cost competitors. They spent a lot of money, about 10 billion dollars, lowering their cost structure. They can probably skate through this. But for some of the others, particularly the European companies like STMicro and Infineon, NXP, and Analog Devices, are now going to have to deal with a new level of Chinese competition.

I think it’s true in analog, it’s true in memory, a lot of categories where they could have been operating for years while sort of keeping an eye on Chinese competitors but not really worried about them. Now we’ve moved forward to suddenly have a new bunch of serious competitors. Is there a tariff answer to this? Is there a subsidy answer? Is there a consolidation answer to this? I don’t think there’s an easy answer. I don’t think that the tariffs are working, because this is one area in which demand is actually coming from the Chinese side.

The bright spot in analog chip companies’ revenues lately is Chinese EVs, and those companies are going to lose a big chunk of that market. Their growth prospects are going to force them to lower their outlooks; it’s going to be a tough time for them. The other thing that’s happening too is that it’s clearly driving Chinese companies to move to more domestic solutions. This is already happening; since 2018, I’ve been tracking it loosely. Every year or so, Chinese domestic chip consumption goes up a point.

I think right now, every year it’s a little bit more. It’s going to take a long time to get serious; they’re still in single digits, but in certain sectors, it’s higher. I think it’s going to take a big step forward in the next year. Once the domestic competitors get there, they have more funding, more cash flow, and can do more R&D. This makes them more competitive; they become a permanent part of the industry in a way that they didn’t have to be.

Jay, do you want to talk about the leading edge a little bit? How are we feeling about Intel and TSMC in relation to Huawei deciding to vertically integrate the entire semiconductor ecosystem? I still think China has a ways to go in terms of leading edge. I know there are constantly rumors that they’ve made some big breakthrough, but they’re not quite there yet. I still think it’s going to take them a few years. Certainly, the tariffs have given them new energy, confidence, and sort of accelerated their efforts. That process, but Huawei is still pretty dependent on TSMC, and they’re just playing that constant game of whack-a-mole. They are constantly finding new avenues to get TSMC wafers, and that’s going to continue.

The other element of this, too, is we seem to be cutting back on the entire federal government of the U.S. So how does somebody like BIS enforce the export rules that are already in place when they can’t get staff and all the other branches of the government are getting cut back? Huawei and a lot of other Chinese companies are constantly playing shell company games, moving around entities. I think it’s tricky. I think it’s going to be very, very tricky. The thing that really confused me most this week was now it looks like Trump has got his sights set on Nvidia like Nvidia’s the enemy, which I thought the point was we want to win the AI war. Now we seem to be directly targeting the company that is our champion of that, which is leading the way in AI compute. It’s an American company, and we seem to be targeting them. So I don’t know what to make of that. Hopefully, that’s just negotiating bluffing, but it’s very strange to me.

Yeah, I agree with that. With Trump, it gets very personal. If he doesn’t like somebody, that becomes policy, and it’s hard to predict. I have to say, though, going to Beijing the day after the H20 ban drops is like meeting with Leongwen Feng. I get it, but I also don’t. It’s like your future is not in that country. I’m not sure how much clearer they need to make it. It did seem like an odd tactical choice. He showed up in a suit and tie too; he wasn’t even wearing his leather jacket.

Well, he’s trying to play both sides, isn’t he? But like Trump can force him to not play one side anymore. Still, it’s like seven percent of revenue, fifteen percent of revenue, something that depends on how you count Malaysia. I don’t know. This reminds me of what we went through with the financial crisis in 08 and 09. The discussion was, are there some banks that are too big to fail? In this context, are there some companies that are too important to punish or to fail? He may be thinking that Nvidia is one of those companies.

Trump is limited to what he can do because, exactly what Jay just said, what are the consequences? If you want to win the AI contest, you need Nvidia. If you’re going to destroy Nvidia, there’s a whole boatload of collateral damage that you need to think about.

I think so too. From Nvidia’s point of view, Jensen has to walk this tightrope because they have been the most vocal critic of these policies for years, going back to the Biden administration. They’re the ones bearing the brunt of it; it’s them and ASML who are really feeling the big hit. Every new wave of sanctions is a few billion dollars less revenue for Nvidia, and he has been quietly pushing against that. Now, I think going to Beijing makes it a little bit more prominent, and he has some reason to be concerned.

There are 20 or 30 GPU companies in China. None of them are particularly good, but they’re getting better. There’s certainly Huawei, which has some pretty impressive AI chops now. If Huawei can continue to get access to TSMC or whatever Smith gives them, Huawei suddenly becomes a pretty serious threat to Nvidia. It’s like I’m saying with the analog companies: if you create a vacuum in the market, there are going to be Chinese competitors who fill that. Once they’re there, you can’t remove them; they’re entrenched. I think Nvidia has some reason to be concerned about that.

To Bill’s point, what are you going to do? Are you going to tariff the biggest AI company we have? Are you going to punish them? That seems like a risky strategy for even Trump to pull. That was the really frustrating thing about, first, that the H20 ban took so long, and second, it hasn’t been accompanied by figuring out how to plug this TSMC loophole and how to plug all the semiconductor manufacturing equipment loopholes.

We’re in this weird world where we’re helping Huawei on both dimensions by allowing them to slowly self-improve and sell roughly competitive chips to what Nvidia can bring to the market. They have this crazy cool rack now, which is tailored for the Chinese market because they don’t have to stress out about power as much as we do. We are not getting the nuanced, optimized version of this. We are getting the lowest common denominator policy choice.

Even if you do have four goals which are kind of competing with each other, you are not at the sort of frontier of what can be accomplished because you’re just trying to write the regulation that takes the least amount of work and is the fewest pages, the simplest to convey, or that Chat GPT can write for you.

I have a positive suggestion: give BIS or Commerce a thousand inspectors. Let them go enforce all the rules that are in place. I would take that. I would like to get rid of the tariffs too, but that’s different. This seems achievable. Let’s actually put some people in place. Let’s not make them efficient; let’s just throw added capacity at them so they can go out and chase down all these Malaysian companies that are suddenly building up massive data centers. We have all these folks in the IRS we’re apparently firing; we can just shift them over, right?

I don’t think that the budget—well, the 25 budget—featured either a cut or no significant increase. The only significant increase BIS has gotten is on the import side to deal with the Huawei hardware coming into the United States. They haven’t gotten expansion on the export side, particularly on the enforcement side.

There are a bunch of us; we did a paper on this a couple of years ago, and we’ve been advocating for budget increases. We’ve met with people on the Hill, and it resonates. There’s a lot of sympathy for it, but it just never ends up quite happening. Now we’re in the era of drastically cutting government and coming in and saying, here’s one agency that ought to get a 25 boost in personnel. I don’t think that’s going to fly, even though it should.

Maybe let’s close on this, Bill. You’ve been in this game a long time, and we’re at a point where thinking does not seem to be valued, and that’s kind of what we do for a living, I guess. Do you have some words of encouragement for myself and everyone else who cares about this stuff to keep our heads in the game?

I wish I could, but I’m thinking of the old quote: the more things change, the more they stay the same. I used to work on steel when I was on the Hill. I worked for John Heinz from Pennsylvania and was staff director of the Senate Steel Caucus for 17 years. I spent a good part of my life trying to save the steel industry. Then, 20 years later, I came back to it when I left the National Foreign Trade Council and joined a law firm which represented steel companies. I ran into one of their lobbyists, and I said, “Well, I’m back 20 years later.” He said, “Nothing has changed. The issues are all exactly the same. The people are different, but nothing has changed.”

In a way, it’s the same here. It’s been a cat-and-mouse game for 50 years, and that’s never going to change. The enforcement people are always on defense, and the biggest mistake you can make is to think of this as something where you want 100 percent effectiveness, which is impossible. You have to look at it as a management problem.

The analogy I’ve always used is think of Iran trying to develop a nuclear capability. They need 100,000 centrifuges. If your enforcement goal is to make sure they get zero, you’re doomed. If your enforcement goal is to make sure they only get 25,000, and they all cost five times as much and it takes them 10 years rather than three years, those are attainable goals, and you can manage to that.

But that doesn’t satisfy the politics of this because politically, the Congress and the China hawks want zero. They want Huawei out of business. They don’t want any chips of U.S. origin going to China, and they don’t want ASML doing anything in China. Every administration has to contend with this faction that has been there forever, believing that China is an existential threat.

I once had a conversation in the Bush administration with a guy at the NSC who had this portfolio. We were talking about something else, and he interrupted himself, saying, “There are a million Chinese who wake up every morning thinking about how to kill Americans.” My immediate reaction was, “Well, there are a lot of people at the Pentagon who wake up every morning thinking about how to kill Chinese; that’s their job.” He looked at me and said, “Not nearly enough.” Those people are on the Hill; they’re in every administration, in both parties, in think tanks, academia, and in the media. They’re always there, and right now, they’re more ascendant than they have been in the past.

I’ve tended to have the view that the best policy we could pursue in both countries is to make sure that those people don’t end up in charge. I’m worried because a number of those people are in the administration lobbying. One of the topics under discussion is should we go to an embargo, which I think would add very little to what we’ve already done. Symbolically, it’s an issue. So I’m gloomy about this.

I hope that cooler heads will prevail. I think the semiconductor industry, in particular, has done a decent job of explaining what the consequences of over-control are for them. They didn’t get nearly as far as they needed to get in the Biden administration, and I don’t think they’re going to get as far as they need to in this administration either. But at least they are putting up a good fight; I give them credit for that.

Earlier on the show, we spoke about how Bill Ryan’s son teaches religion at a Catholic school, so I had AI generate a transcript. I asked it for some Bible quotes related to our specific conversation about BIS export controls. Bill, do you want to pick your favorite out of the list? “Nothing is concealed that will not be disclosed, nor hidden that will not be made known.” Luke 12. That’s a good one.

My favorite was, “No wisdom, no understanding, no counsel can avail against the Lord. The horse is made ready for the day of battle, but victory belongs to the Lord.” We’re going to throw it up to God to make sure we can get export controls right. My son would agree with that.

“Plans fail for lack of counsel, but with many advisors, they succeed.” Proverbs 15:22. That’s all right; I’m going with that. I like this gimmick; I think we’re going to stick with it. Thank you, Bill, for being a part of China Talk. It’s an honor to be invited. Likewise, thank you, guys.

Stanford CS336: Language Modeling from Scratch | Spring 2025 | Architectures, Hyperparameters

2025-04-20T00:00:01+00:00

Stanford CS336: Language Modeling from Scratch Spring 2025 Architectures, Hyperparameters

As you may have noticed, I’m a little bit less innovative in my lecturing than Percy. So you’re going to get PowerPoint slides rather than executable Python ones, but you should be able to find the PDFs on the website as well.

I’ve titled this lecture “Everything You Didn’t Want to Know About LM Architecture and Training” because we’re going to get into some of the nitty-gritty details that I think most other classes would spare you the details of, like what should my hyperparameters be and those kinds of questions. Some minor logistics: also, if you’re doing the assignments, we are updating assignments as we find some mostly minor bugs. Make sure you pull updates to the assignments as you go along.

Okay, so what we’re going to do, we’re going to start with a quick recap of a transformer. I’ll give you two variants of a standard transformer, one that’s probably coming from the standard transformer lectures that you might see in 224n, and then I’ll talk about what you implement and kind of the modern consensus variant of a transformer. Then we’re going to take a much more kind of data-driven perspective to understanding transformer architectures.

So the question that we’re going to ask is: people have trained lots of LLMs at this point, and you can go and read all of those papers and try to understand what has changed, what has been in common, and from that kind of almost an evolutionary analysis, try to understand what are the things that are really important to make transformers work.

Today’s theme is that the best way to learn is hands-on experience, but the theme of this lecture, because we can’t train all these transformers, is to learn from the experience of others. So the starting point is the original transformer.

Just as a review, hopefully, you all remember this from 224N or your other NLP classes. You’ve got some simple position embeddings at the bottom. You’ve got multi-head attention, you’ve got layer norms afterwards, you’ve got a residual stream going upwards, you’ve got an MLP, and then a softmax at the very end. We’re going to see variance in all these different pieces until we get to basically the most modern variants of the transformer, and the latest one I’ll talk about will be just a few months before.

What you implemented is not the vanilla transformer variant from the original paper. We’ve modified a few things; we’ve put the layer norm in front of the block. So you can see on this slide over here that there is the norm right before each of these blocks in the residual stream. We’ve asked you to implement rotary position embeddings. The feed-forward layers use something called a swiggloo, and then linear layers now emit these bias terms. You might ask why have you forced us to implement this weird variant of a transformer instead of the original transformer?

Then yesterday, I was thinking, okay, I should catch up on all the developments that have happened in architectures over the last year, and Percy warned me about this because he said you’re going to have to redo the lecture every year. So I started looking, and I was like, all right, there’s a couple of good papers recently. There’s Command A, there’s two-mode furious, there’s small LM, and then you go looking and you’re like, wow, there’s Gemma 3 and Quent 2.5 and intern LM, and then there’s more. I can’t even sort of cover the screen with these guys; there’s a lot of models.

There were about 19 new dense model releases in the last year, many of them with minor architecture tweaks. On the one hand, it’s kind of annoying to go through all these papers and say what is happening in all of these, but also it’s like actually a wealth of information because not all of them do the same thing, and you can kind of see, not all of you can especially in the back, can see the details of this slide, but I put together a little spreadsheet of what all these models are doing, starting from all the way from 2017 with the original transformer all the way to 2025, what the newest models are doing.

We’ll talk about this as we go, but you can see certain kinds of architecture changes being explored. Here in this column are position embeddings. People used to do all sorts of stuff like absolute, relative, rope; there was a sort of alibi phase for some people. But then starting around 2023, everyone just does rope, right? So you can kind of see this convergent evolution of neural architectures, and we’re going to talk about all of these different kinds of things.

Right, so the parts that I’ll cover: this is a preview of the three major sections of this lecture, and if I have time, I’m also going to talk about different attention variants at the end. The first thing is going to be architecture variations; that’s what I’m going to talk about—activations, feed-forwards, attention variance, position embeddings, all of those things.

Then, having nailed down the architecture, what do we have to do? Well, we have to pick hyperparameters, right? Like how big do we make the hidden dimension? How big do we make the inner projection layer inside of MLP? What do we do about the number of dimensions? How many vocab elements? Those are all sort of important things that you have to choose when you’re actually training your language model. You don’t want to just sort of pick these out of a hat; you want to select them in some fairly intelligent way.

So we’re going to start with architecture variations. The two things that I’ll mention right here, and I’ll go back to them as I talk: the first one is that there’s not that much consensus in a lot of the choices. There’s been sort of convergent evolution in the last few years, what I’ll call llama-like architectures at the very bottom here, but people do all sorts of things. They swap between layer norm and RMS norm. They do serial versus parallel layers.

There’s one choice that basically everyone does since the very first GPT, and I’ll talk about that in a bit. But there’s a lot of different variations that we can learn from here. The big one I’ve already talked about in 224N, so if you remember that lecture, this will be review for you rather than being totally new.

I think the one thing basically everyone agrees on and agreed on almost from the very start is the use of pre-norm versus post-norm. That terminology will get a little bit more confusing, but the original transformer paper did this thing on the left over here, where you had your residual stream in the gray. In addition to the residual stream, you had these layer norms after sort of every subcomponent. You would do your multi-head attention, you would add back to the residual stream, and then you would layer norm that. Then you would do the same thing with your fully connected layer, and then you would layer norm it.

Very, very early on, people realized that moving this layer norm to the front of this non-residual part, so this block on the right, did much better in many different ways. Basically, almost all modern LLMs that I know of use this kind of norm. There have been some sort of new innovations recently that I’ll touch on in two slides, but a lot of models have moved to this.

The one exception is opt 350M, which I’m guessing they kind of messed that one up, and that was sort of orphaned when they were training. That was a fun find in my survey of architectures.

The pre-norm versus post-norm thing, if you look into why it was originally developed, the arguments were that if you wanted to use this post-norm stuff, it was much less stable. You would have to do some careful learning rate warm-up style things to make it train in a stable way. If you look at some of the earlier papers arguing for this pre-norm approach, you almost always see sort of this comparison of, hey, if we use pre-norm and we do some other stability-inducing tricks, then we can remove warm-up, and these systems work just as well, if not better, than the post-norm layer norm with careful warm-up type approaches.

You see this in sort of a machine translation setting here. You see this as well on the right in various other tasks, especially using BERT, which was trained with post-norm. There were many arguments about why this was helpful. There were arguments about gradient attenuation across layers. If you do pre-norm, then the gradient sizes would remain constant, whereas if you did post-norm without warm-up, it would sort of blow up in this orange way.

It’s a reasonable argument, but I think maybe a closer to modern intuition would be this argument that pre-norm is just a more stable architecture to train. Some of the earlier work by Solazar identified all these loss spikes. If you were training with pre-norm, kind of in blue here, you would see a lot more loss spikes and the training would be kind of unstable as you were training.

So you see the gradient norm here is spiking and generally higher than the one with pre-norm. Today, you see pre-norm and other layer norm tricks being used essentially as stability-inducing aids for training large neural networks. This brings us to one new, fairly recent innovation. I think this didn’t exist when I gave this lecture last year, which is this variant that I don’t think really has a great name, but I’m just going to call it the double norm for the moment.

This is the original figure that I showed you at the very beginning, and we know that putting layer norms in the residual stream is bad. But actually, someone in 224n this year asked, why do you have to put the layer norm in the front? Why can’t you put it after the feed-forward network? Of course, you can, and not only that, sort of recently people have gone around and just added the layer norm after the blocks as well. Grok and Gemma 2 both take this approach of layer norms both in front and after.

Gemma 2 does only the layer norm after the feed-forward and the multi-head attention, and this is actually kind of an interesting change. Pre-norm has just been kind of dominant, and the only thing for a while, but things have been changed up a little bit. Now there’s a new variant, and there have been some evaluations of this kind of approach. People have argued it’s a little bit more stable and nicer to train on these larger models.

By the way, feel free to stop me and ask me questions as well. I have a tendency to keep going if no one stops me. So yes, why is the layer in the residual bad? That’s a good question. I don’t think I can give you proof of why it’s bad. I think one intuitive argument for why this might be bad is that the residual gives you this identity connection all the way from almost the top of the network all the way to the bottom. If you’re trying to train really deep networks, this makes gradient propagation very easy.

There are lots of arguments about how LSTMs and these other kinds of state-space models have difficulty propagating gradients backwards. An identity connection does not have any such problems. Putting layer norms in the middle might mess with that kind of gradient behavior. You see this back here; this is exactly the kind of plot you expect to see if that’s happening.

Cool! The other thing that people now do in the original transformer people did layer norm. Layer norm is this equation over here. What you do is you have the activations x coming in, you subtract the empirical mean, which is the average of the x’s up top, and then you divide by the standard deviation or the variance plus a little fudge factor epsilon. You then square root that so you can roughly think of it as a standard deviation. That’s going to standardize your activations x.

You’re going to scale it up by a gamma, which is a learnable parameter, and then shift it by a beta. This makes sense; you’re going to normalize your activations and then shift them around to whatever point you want. Many models use this layer norm thing, and it worked quite well, but many models have moved on to RMS norm. This is one of the consensus changes; basically, all the models have switched to using RMS norm.

Now what do you do? You just drop the mean adjustment, so you don’t subtract the mean, you don’t add a bias term. Many notable models do this: the llama family, palm, chinchilla, T5; they’ve all moved to RMS norm. What’s the reason for this? One reason is that it doesn’t really make a difference. It turns out if you train models with RMS norm, it does just as well as training with layer norm.

There’s a simplification argument, but really the argument that’s often given in these papers, and I think it’s good to appreciate the details of this argument, is that going to RMS norm is faster and just as good. In what way is it faster? If I don’t subtract the mean, it’s fewer operations. If I don’t have to add that bias term beta back, it’s fewer parameters that I have to load from memory back into my compute units.

So I don’t have to retrieve these states. Some of you might be thinking, but wait, you told me in 224n that nothing but matrix multiplies matter for the purpose of runtime. This is not a matrix multiply, and I shouldn’t care about any of this. That’s a reasonable perspective to take if you think about the number of flops and the percentage of flops taken up by different operations in a transformer.

This table has a nice paper by Even all in 2023. The title is something like “Memory Movement is All You Need” or something that does profiling of all the different components of a transformer. You see that tensor contractions, which are like matrix multiplies, make up about 99.8% of the flops that happen in a transformer. Saving 0.17% of your flops doesn’t seem like a huge win, but one important thing for architecture design is not just to think about flops.

Flops are important, but that’s not the only resource that you have to think about. You also have to think carefully about memory movement. Even though tensor contractions are 99.8% of the flops, if you have things like the softmax operation or layer norms, all these normalization operations that happen in a transformer, they’re 0.17% of the flops, but actually, they’re 25% of the runtime.

A big reason for that is that these normalization operations still incur a lot of memory movement overhead, right? It actually does matter to try to optimize some of these lower-level things because it’s not just about flops; it’s also about memory movement. I’m going to emphasize this quite a bit more as I get into the systems lecture. When we talk about GPU architectures, it’s going to become very important to think about memory not just about flops.

This is one of the reasons why RMS norm has become much more popular. I went back and looked at some of the earlier RMS norm papers. The sad thing is that there aren’t quite as many papers published by industry labs with big nice ablations. So many of the ablations that I’ll show you are from a couple of years back. But Nang et al. in 2020 had this very nice ablation showing the vanilla transformer versus the RMS norm version. You see the exact thing I told you: the number of steps per second you can do in a vanilla transformer is 3.5 per second, with RMS norm you get 3.68. Not a huge gain, but it’s for free.

You get a final loss that’s lower than the vanilla transformer, so that’s great. In some sense, we’ve gotten runtime improvements and we’ve also gotten, in fact, at least in this case, loss improvements. That’s a win-win for us.

The final thing that I’ll say, which is very much in line with this RMS norm thing in terms of theme, is that most modern transformers do not have bias terms. If you look at the original transformer at the FFN, it will look something like this: you have your inputs x, you do a linear layer with a bias term, and then you relu it, and then you have a second linear layer wrapping around it.

Most implementations, if they’re not gated units, look actually something like this. They’ve just dropped the bias terms. You can just make this argument from basically the same kinds of underlying principles; they perform just as well. Matrix multiplies are all that you need to get these guys to work. The other thing, which is maybe more subtle, is actually optimization stability. I don’t have the deepest understanding of why the bias terms are particularly bad for stability, but there have been really clear empirical observations that people have made, stating that dropping these bias terms often stabilizes the training of these largest neural networks.

Now, many implementations emit bias terms entirely, and train only on these pure matrix multiply kind of settings. This is the layer norm bit. There are kind of two things that you should think of. This is nice because the story is pretty clear. Everyone does something, and so you should just know this. Basically, everyone does pre-norm or at least they do the layer norms outside of the residual stream. That’s kind of the iron rule. You get nicer gradient propagation; you get much more stable training.

It just doesn’t make sense to do it the other way. Almost everybody does RMS norm in practice. It works almost as well and has fewer parameters to move around. This idea of dropping bias terms just broadly applies. A lot of these models don’t have bias terms in most places. I think the one exception to this RMS norm one, as I was reading yesterday, is that I think Cohere, both Command and R plus use layer norm.

Okay, any questions on the layer norm, RMS norm, and bias term stuff before I move on? Yes, questions? Do you think there are some long-term lessons you can take away from these details that are more future-proof, potentially?

So the question was if there’s something more future-proof. I think it’s hard to have the biggest picture. In many ways, deep learning has been very empirical and bottom-up rather than top-down, but I do think there are some generalizable lessons that you could draw from here. I think the lesson of having very direct identity map residual connections is sort of a story and a lesson that has played out in many different kinds of architectures, not just in these kinds of architectures.

The effectiveness of layer norm, as we will see later in this lecture, has been very effective. Not letting your activations drift in scale is another thing that generally has been very effective for training stability. Those two seem like fairly generalizable lessons. We will also see systems concerns come into play again.

This is another generalizable lesson of thinking carefully about the impact of your architecture on the systems components of your design. Okay, so now there’s this other component, which is the activations. There is a whole big zoo of activations: relu, swish, lu, glu, and then there’s different kinds of MLPs: galu, regul, swiggloo, and lilu. I think this is exactly the kind of thing that I didn’t originally want to learn when I got into deep learning. I thought I don’t care about activations; it’s going to train anyway.

But it really does matter, unfortunately, for both you and me, that swiggloo and other glu variants just consistently work well. I will explain those to you, and you should think about them carefully because they do work and internalize that. I think the relu and maybe the galu you should already know; relu, you learn in some of the most basic deep learning classes.

You take the max of zero, and in the case of an MLP, I’ve dropped the bias terms here. You know xw1, you take the relu, and then you do w2. Fairly easy, right? A gel is a Gaussian error linear unit. This one multiplies the linear with a CDF of a Gaussian, so it’s basically going to be like the relu but with a little bit of a bump here.

Hopefully, you can see that this is not just flat at the very bottom. This makes things a little bit more differentiable, which may or may not help. The GPT family of models, 123, and GPTJ and so on, all use the GLU. The original transformer and some of the older models used the relu. Really, almost all the modern models have switched to the gated linear units like swiggloo and the galu and others.

I think the Google folks really pushed for this like Palm and P5 and others. Since it’s been tried and true, almost all the models post-2023 use a gated linear unit. Going back to that earlier question of what generalizable architecture things we can learn, there are some things that have been consistently useful: residual connections, layer norms, and gating is yet another one.

Originally, this is our fully connected layer with a relu. Now, instead of doing just linear and a relu, I’m going to gate the output with an entry-wise linear term. So x.v gives me a vector, and I’m going to multiply that entry-wise with my original inside term of the MLP. Then I’m going to multiply the whole thing with W2. The way to think about this is that I’ve gated the hidden part of the MLP. I have my original activation that takes my inputs and puts it into the hidden space, and then I’m going to gate that with x.v, and then I’m going to project that back into the hidden dimensionality using W2.

There’s this gating operation that happens entry-wise, and this is the basic thing that’s happening here. This is the GLU plus the relu. We have an extra parameter that we’ve added here for the gating; this is V. When someone says something like it’s a giggloo, there’s nothing to laugh about. The gigl fully connected layer has the gel for the nonlinearity and the exact same gating of x.v.

This is the architecture that was used by many of the Google models like T5V1.1, Gamma 2, Gamma 3, and another variant, there’s swigloo, which has been very popular. Swish is x times the sigmoid. This is the nonlinearity, and you can kind of see a sigmoid, and x looks like this. It will look just like the Gaussian error unit, and you do the same thing here: you have a gating over the switch, and then you get a fully connected layer here.

Yes, I have a question. Below a certain negative value, the switch function and also the G function, it’s not monotonically increasing. In fact, it’s decreasing, and a lot of the argument about how gradient descent works in like input machine learning is that you want to do gradient descent, but here it seems like it would go in the opposite direction if you use the G or switch or their gated versions.

So the question was, this isn’t monotonically decreasing. There’s a bit on the far left of this zero that’s kind of flipping in the derivative. Isn’t that going to be a problem? Intuitively, you could argue that this would be a problem. You might trap a bunch of activations at zeros. In practice, if you look at neural network optimization dynamics, what’s actually happening is often you’re using very high learning rates with momentum in the optimizer. You’re not really going to converge to this zero point.

These activations are going to be all over the place, so in practice, I don’t think this tiny negative piece is really an effect that’s going to be huge for the model, if that makes sense. Okay, and then going back to the swiggloo, most models today, like the llama family, Palm, Elmo. I’ll show you the big table later, but you’ll see that the swiggloo is very popular. One thing to note, I’ll talk about this again in the hyperparameters part, is now remember I’ve added this V term, this extra parameter.

I want to think about how to size this extra parameter. What people do is gated models usually make this hidden size, basically the output dimensionality of W, slightly smaller by a factor of 2/3 in order to make sure the total number of parameters of the whole thing remains the same as the non-gated counterparts. That’s a convention that most people do. If you don’t quite understand what that is, I’ll go back over that again later, but you can keep in mind that for the gated linear units, you just make everything a little bit smaller to make sure things remain parameter matched.

One final question: this may be obvious in the past. One of the benefits of relu is that it’s very easily differentiable by the input. But if you have the derivative of the CDF of the Gaussian, you have a squared with x. Does that not really slow things down? That’s a very good question. I’m not 100% sure what the internal CUDA implementation of the swiggloo or the galu or gluu is.

It’s possible that internally they might be implemented with lookup tables. What really matters is the memory pressure here. It will be the same because you’re reading the same amount of elements for performance. The extra computation is negligible in that context. That’s probably a better argument: basically, flops-wise, this is negligible anyway, and the memory calculus is the same. seen before and Falcon 211B uses a RELU. Both of those are relatively high performance models. So you can kind of see that it’s not really necessary and evidence does point towards consistent gains from swiggloo and gaggloo, and that’s why we ask you to implement exactly that variant.

Cool. Okay. The final thing that I want to talk about for architectures is one kind of final major variation that we’ve seen. Normally, the transformer block is serial, right, in the sense that for each block, the outputs come in from the bottom, and then you do your attention, and then you pass the result of that computation forward. Then you do your MLP, and then you pass that computation forward. This is inherently serial. You do attention and then MLP. But of course, this might have certain parallelism constraints. So if you want to paralyze this over gigantic sets of GPUs, it might be harder to do so if you have this serial connection.

The systems concerns might also be more difficult; you might get lower utilization from your GPUs. A few models have done this thing that I’ll call parallel layers, where instead of having serial computation of attention and then MLP, they will do them both at the same time. You will get your X from your previous layer; you will compute both the MLP and the attention side by side, and then you will add them together into the residual stream, and that will be your output. This was pioneered by GPTJ, which is kind of this open-source replication effort. The folks at Google doing palm were bold enough to do this at a really big scale, and many others have followed since.

If you’re implementing this, you can share a lot of stuff; the layer norms and the matrix multiplies can get fused together, and you can get some systems efficiency out of that. It hasn’t been quite as popular since then, at least in the last year. I think most of the models that we’ve seen have been serial layers rather than parallel ones. The only exceptions to this are coher command A, command R plus, and Falcon Q 11B.

Now I think we have the ability to go back to this big, hard-to-see chart and then see what I was sort of pointing at at the very beginning. You don’t really need to be able to read any of the text because the colors will tell you everything you need to see. This check mark here is basically pre versus postnorm. The only two models I really know of in the early days that did postnorm are the original transformer and GPT and BERT if you want to include that into this table. Almost everybody else, I think basically everyone else, has done porm. The only other non-checked boxes here are models that are proprietary, and I don’t have details for.

This column on the leftmost thing is RMS norm versus layer norm. The gray boxes are the layer norm. The blue ones are RMS norm. Basically, most people have converted to RMS norm. This column next to it is serial and parallel layers. Once again, most people do serial, but you see other variants. What I’m going to talk about next is going to be position embeddings, and that’ll be kind of more interesting in a moment here.

Any questions about any of this architecture stuff before I move on? Hopefully that gives you a bit of an overview of at least the major variations in architectures that we see.

Yes, is serial layer computation more efficient than parallel? The question was whether serial is more efficient than parallel. It should actually be the reverse; parallel is more efficient than serial, and that’s why you’re kind of willing to do this. In some sense, you might expect serial to be more expressive because you’re composing two computations rather than just adding them together. But the benefit of parallel in theory is that if you write the right kinds of fused kernels, a lot of these operations can be done in parallel, or the computation is shared across the different parallel parts.

So cool. The last thing I want to talk about in architecture land, I think this is the last thing, is variations in position embeddings. I think this one’s interesting because in the first few years of sort of LM land, there were a lot of different things that people were trying. Sign embeddings were from the original transformer. You should have learned this in 224n. There’s sign and cosine positions. Many others did absolute embeddings like the GPTs and OPT; all basically just added a position learned position vector to the embedding.

Some others like T5 and Gopher did various kinds of relative embeddings that add vectors to the attention computation, and then I think most models have converged to rope, which is relative position embeddings. I think this actually started in GPTJ, once again another open-source contribution, and has really rapidly been picked up by most of the models.

The high-level thought process behind rope is that the thing that matters is relative positions of these vectors. If I have an embedding f of x of i where x is the word I’m trying to embed and i is my position, then I should be able to write things down in this way. There should exist a function f such that f of x i and f of y j, if I take the inner product of these embeddings, I can write this down as some different function g, which is a function of the two words and the difference in their positions.

This definition enforces position invariance or absolute position invariance. You only pay attention to how far apart these two words are. You can do a brief check and see what happens with signs; you get these cross terms that are not relative. So you do still leak absolute position information. Absolute positions, like it’s in the name, it’s not a relative position embedding.

Relative embeddings—well, it is relative, but it’s not an inner product. It sort of violates this constraint. Rope is this kind of clever observation; we do know one thing that is invariant to absolute things, which is rotations. We’re going to exploit that structure to come up with our position embeddings. We know that inner products are invariant to arbitrary rotation, so we’re going to leverage that.

On the left, this is the starting point. Let’s say my embedding for the word “we” is this arrow over here, and my embedding for the word “no” is this other arrow over here. Now I want to embed this sequence. We know that and I only look at the words “we” and “no.”

How do I do that? “We” is in position zero, so I’m not going to rotate that guy at all. “No” is in position one, so I’m going to rotate him by one unit of rotation. Now I have this embedding for “we” and “no.” Let’s say I want to embed this sequence. Of course, “we” and “no” have the same relative positioning to each other. Let’s look at what happens. “We” gets shifted by two positions. I rotate “we” by twice, one and two, and then I rotate “no” by three positions, zero, one, two, three positions.

If you look at these two arrows, they have the same relative angle, so their inner products are preserved. This is kind of the nice fun idea about rope. You just rotate the vectors, and the rotation angle is determined by the position of each word. Rotations—the inner products don’t care about relative rotations, and so these inner products are only going to look at the difference in distance.

Now it’s easy to think about in 2D because rotations are kind of obvious. In 2D, there’s only one way to rotate a vector. But in high-dimensional spaces where we operate, it’s not obvious how we are going to do this rotation. The rope folks came up with, in some ways, the simplest but also effective way of doing this. You take your high-dimensional vector in this case D, and I’m just going to cut it up into blocks of two dimensions, and every two dimensions are going to be rotated by some theta. There’s going to be a rotation speed, and I’m going to rotate the pairs of dimensions.

Now every pair of dimensions is encoding all these relative positions, and much like in sine and cosine embeddings, I’m going to pick some set of thetas such that some embeddings are rotated quickly, and others are rotated much more slowly. They can capture both high-frequency information or like close by information and very far away sort of lower frequency positioning information. The actual rope math here is that if you’re going to think about rotations, it’s just going to be multiplying with various sign and cosine rotation matrices.

You can think about this as an operation where you multiply your embedding vectors with these block 2x2 matrices. There are no additive or cross terms that sort of appear here; this is all purely relative. One thing that is different if you’re used to absolute position embeddings or sign and cosine embeddings is that rope is going to operate at the actual attention layer. You’re not going to add position embeddings at the bottom; whenever these attention computations are going to be done, you’re going to intervene on that layer.

That’s going to give you your position information. I pulled this from the Llama implementation of rope. You’ve got the initial normal attention stuff at the very top, like query keys and values—these are your normal linear projections. Then you’re going to come up with cosine and sine angles. These are rotation angles telling you how much to rotate different blocks of the query and key.

You take your query and your key, and you’re going to rotate them by the cosines and sines. You’ve gotten rotated query and rotated key, and that’s going to go into the rest of your attention computation. You don’t do this at the bottom; you do it whenever you generate your queries and keys. Hopefully, that’s clear.

That’s really critical to enforcing kind of this relative positioning-only information. One of the things I want to highlight is that rope is actually one of the things that it seems like everyone has converged on. I went through all 19 of those papers over the weekend, and basically all of them now use rope for various different reasons. There’s the reason that rope has now many different algorithms for extrapolating context length, which is an important part of sort of the modern productionized language model.

It also seems to be empirically quite effective even at fairly small scales in small context lengths. It’s kind of won out on this position embedding battle. Any questions before I move on to some of the hyperparameter stuff? Yes, is the rate of rotation consistent across all these models? I don’t think they’re all the same; there’s some variation in the thetas.

Are the thetas for each pair, are those hyperparameters or are they trained? They’re not. The thetas that determine the rotation angles aren’t hyperparameters. Much like in the signs and cosines, there’s kind of a schedule to the rotation angles that are determined, and it’s in the same intuition in the signs and cosines. You want to cover different frequency ranges in order to get higher or lower frequency information.

Yes. Oh, do the rotations create any difficulty with training? I wonder about angular rotations. The rotations themselves don’t create any issues because one way of thinking about a rotation is that it’s just a matrix multiply. Since the thetas are fixed, and the M’s are fixed, this is really just a fixed matrix that multiplies your vector. In that sense, it’s not really an issue. If you were learning the thetas, then maybe you’d have issues because you’re differentiating through trig functions, but you’re not doing that here.

So, cool. Now I think we go even one more level into the details here, and we’re going to talk about hyperparameters. I feel like when you’re dropped in and asked to train a new language model, there are a lot of questions you have about hyperparameters because there are quite a few of them. One of the things that I’ve realized is that actually only a few of these really get changed across different successful models. There are actually fairly clear rules of thumb and fairly clear guidelines that people seem to be following.

There are some things like how much bigger should the feed forward size be, or how many heads should I have, or what should my vocab size be? We’ll talk about each of those things and try to constrain the space of hyperparameters that people have. The starting point we’re going to look at is a simple feed forward layer, just the one with the bias. This is a ReLU version of it, and so there are two hyperparameters here: d model, which is the dimensionality of x—that’s the input coming into your MLP—and then you’ve got dfff, so this is the feed forward dimension.

This is the output hidden dimension of your MLP, and from there you’re going to project back onto d model. What should dff be? In general, these things are projections; you’re going to have more hidden units than there were inputs. But how much bigger? There is actually just about a consensus. Almost everybody that uses ReLU-style MLPs are going to pick dff equal to four times d model.

I will show you some empirical evidence for why this is a sane number later, but as far as I can tell, there’s no law of nature that says you have to pick four. This is a convention that has held up. Now, there are a few exceptions to this rule. Remember that the GLU variants are going to scale this down by a factor of two-thirds. If you scale it down by a factor of two-thirds, you’re going to have roughly the same number of parameters.

You can do a little bit of math, and if you scale the GLU variance down by a factor of two-thirds, you’ll conclude that the way to do that is to set dff equal to 8 over 3 d model. That’s the number you end up at, and you can convince yourself that will give you the same number of parameters, and that’s the ratio you would get if you started with a ratio of four. If you look at many of the models, they actually do follow this rule of thumb.

Palm, for example, palm mistro and llama are slightly larger. These are GLU models, but they don’t follow this 2.6 rule. If you look at llama, for example, one quen deepseek and t5, they all roughly follow this kind of 2.6ish rule. I can put up the big table of LMs I made later with hyperparameters; many, many, many of them fall into this roughly 2.6 range, and that’s the standard parameterization of a GLU unit.

I’ll go through one other exception. I really like this exception because, in many ways, big large language model training is a game of copying hyperparameters from other people, so we don’t learn very much; it’s very conservative. But T5 I really like because in some sense it’s really bold, and I think Google people actually do some pretty bold stuff. If you look at the 11 billion parameter T5 model, they have a pretty incredible setting. Their hidden dim is 1024, but their dff, their up-projected dimension, is 65,000.

That’s going to give you a 64 times multiplier on the ratio of dff to d model. Of course, you compare this where Palm is like a factor of four, and everyone else is much smaller. This is a very large difference. There are some other recent examples of using much bigger multipliers, like gamma 2, which follows in these footsteps and does a factor of eight. I’ll talk a little bit about this exception later. T5 was a totally fine model, so this should tell you it is possible to train a model with such a much larger ratio.

One of the things that I think is quantitative evidence—I saw that 4x multiplier and I thought, is that really the right thing to do or is there some more quantitative experiment someone’s done to convince me that that is a good idea? One of the figures from Jared Kaplan’s scaling law paper—most people know this paper for the scaling law component—but actually, there are also some really useful hyperparameter components in this paper. You’ll see that they do exactly this thing I’m talking about, the dff to d model ratio.

They plot essentially how much the loss increases as you vary this, and you kind of see that there’s kind of a sweet spot. This is a ratio of 1, 2, 3, 4, and then up to like 10 or so here. There’s a pretty wide basin here anywhere between 1 to maybe up to 10 where you can pick whatever feed forward ratio you want, and it’ll be roughly optimal. Four is not too far off from your optimal choices; it’s like one, two, three, four. It’s like right here or maybe right here, so that’s a pretty reasonable choice.

What can we learn from all this hyperparameter stuff? A lot of the evidence points towards you can pick the same defaults. If you’re not using a GLU, you can multiply by four. If you’re using a GLU, you can use roughly 2.66, and they can work pretty well for mostly all the modern LMs. T5 once again shows that you don’t have to follow these rules; right, you can be a rule breaker and do whatever you’d like. There’s no hyperparameter choice written in stone.

You can get reasonable LMs at many other hyperparameters. That said, the really funny epilogue to this story is that P5 has a follow-up model called P5V1.1 that’s improved, and it uses a much more standard 2.5 multiplier on gaggloo. You can read between the lines and say maybe they looked at the original T5 and said, actually, maybe we want to walk back that 64 times multiplier and pick a more standard one, and they did end up with a better model.

So, I think that’s a good question. The question was what’s the ratio or what’s the relationship between this ratio that I’m talking about here and generally the impact on the model? If we go all the way back here, the ratio is controlling essentially how wide the hidden part of this MLP is. The original justification in the T5 paper for picking 64 was to say we can get bigger and fatter matrix multiplies if we make that dimension really large.

While that is a kind of true statement, the wider it is, you’re getting more parallel computation rather than serial computation. So you’re spending your flops and your parameters in a slightly different way than if you made your hidden units bigger, which would let you pass more information. Using more units would give you more serial computation. You’re spending your parameters and your flops in a slightly sub-optimal way from expressive power, but you might get systems gains if your matrices are wide enough.

Okay, excellent. Another thing that is a surprising or maybe not surprising consensus hyperparameter is the ratio between the model dimension and the head dimension times the number of heads. I clipped this from 224N, but really the basically canonical choice is to pick things so that the dimension D is the hidden dimension. If you have multiple heads, you’re just going to split up the number of dimensions each head gets, right? You’re going to keep the dimensions fixed as you add more heads.

You don’t have to do that; as you add more heads, you could just keep the same number of dimensions per head, and you could let the attention part take more and more parameters. You could do that—that’s an option you have. Most models, once again, do follow this guideline. We see GPT3, T5, Lambda, POM, and Llama 2. They all have a ratio of one or almost exactly one. T5 is the one exception that breaks this rule; they tried the big ratio of 16.

Otherwise, it’s all fairly following this consensus. There have been a couple of papers that have argued against this 1:1 ratio. There’s a notable one by Boja Panelli et al. 2020, who have argued that if you have more heads, they’re going to have lower rank. If you have very few dimensions per head, that starts affecting the expressiveness of the attention operation.

In practice, it doesn’t seem like we see too many significant low rank bottlenecks. Most of the models with this ratio of one seem to do just fine. This is really a parameter that’s generally been held constant by most of the models that we’ve seen. If I have time, I’ll talk a little bit about different optimizations that people have made on this multi-head component. But hyperparameter-wise, things have stayed fairly similar.

I think one of the big ones in terms of hyperparameters is the aspect ratio. We can think about deep networks. We can have more and more layers, or we can have wide networks. Generally, if you want one knob to control the width, that would be the hidden dimension of the residual street. It would control essentially the width of almost all the operations at once. This seems like a pretty critical thing to tune. You might think that deeper networks are smarter and more expressive or that wider networks are more efficient.

There is generally a sweet spot of ratios that people have picked. There have been outliers; some early models used much smaller ratios here. What that means is that they were much wider than they were deep. Some models have gone really deep, where they had way more D, sorry, the other way around—really wide, where they had way more D model than N layer. There has been a generally sweet spot of saying we want about 128 hidden dimensions per layer, and that has been generally stuck to by a lot of the GPT3 and Llama variant models.

I’ll talk a little bit about evidence for that in a second. There are considerations about the aspect ratio that are quite important. They will control the amount of parallelism we can do. If you’re doing something called pipeline parallel, what you’re often going to do is take your different layers, cut them up, and put them on different devices or different blocks of devices. You’ll parallelize within each layer as well, and there are going to be certain kinds of constraints that you’re going to put on your model.

Also, if you have really wide models, then you can do something called tensor parallel, where you slice up the matrices and distribute those on GPUs. Different parallelism paradigms are going to have different constraints; you need really fast networking for tensor parallel, and you might get away with slower networking or higher latency networking for pipeline parallel. Your networking constraints might, in turn, drive some of these width-depth considerations.

Setting that aside, you might ask abstractly what the impact of aspect ratio on model performance is. Kaplan et al. have a really nice visual aid showing how aspect ratio impacts performance. This is three different scales: 50 million, 274 million, and 1.5 billion parameters. The x-axis is aspect ratio; the y-axis is sort of loss difference in percentage change.

You see that around 100—which I told you was the consensus choice of hyperparameters—is the minimum across different scales, so this is kind of backed by some large-scale hyperparameter data published by Kaplan et al. It roughly matches that intuition, and a really nice thing here is it seems to be the case that aspect ratio optima does not shift too much across several orders of magnitude here. If this holds up even more, that’s good news; you can keep training on one fixed aspect ratio.

One thing I will note that is quite an interesting result is that EK and others at Google had a very interesting paper studying the impact of depth versus width, both upstream and downstream. One of the things they found was that if you’re looking at losses, then it doesn’t matter. Parameter is the only thing that matters; deeper models don’t help you. But the story is less clear if you’re looking at downstream accuracy; at the time, they were looking at fine-tuned superlue accuracy. They were arguing that for the same amount of flops, deeper models might be better.

I’ll leave it at that. There’s not quite as much follow-up to this work, at least that I’ve seen, but downstream performance may actually be slightly different in terms of the aspect ratio considerations here.

Okay, cool. The final thing I want to talk about in this very low-level hyperparameter world is what the vocabulary sizes are that you might want to pick. In general, vocabulary sizes have been trending upwards. I think a big part of why is that LLMs are being deployed out in the wild. They’re becoming more useful services. When that happens, you’re going to interact with people speaking different languages and using emojis—all sorts of other kinds of modalities or languages than what you might expect.

I think some of the earlier models, especially monolingual models, ranged around in the 30,000 to 50,000 token vocabulary range. You can see this in the early GPTs and Llamas. But if you look at the multilingual or production systems that have come out, they’ve all sort of been shifting towards the 100,000 to 250,000 range for their vocabulary sizes. I looked at Command A, which is one of Coher’s models; they are a company that emphasizes multilingual stuff. You see very large vocab sizes from them.

Even with GPT4 and many others that have copied the GPT4 tokenizer, they’re going to be around the 100,000 tokens. That’s kind of the standard that a lot of people are operating at, roughly at the 100k to 200k token size. There’s been work showing that as models get bigger, these models can, in some sense, handle more and good use of more and more vocab elements. You might see increasing trends to token counts as models get scaled up or more data is used to train them.

Cool. Okay, so the last thing—this is no longer specific hyperparameters but two other things that you might need to do before you sort of set your model to run—are dropout and other kinds of regularization. This was really interesting to me when I was originally doing the research for putting this lecture together. If you think about pre-training, it’s the furthest place that you might think of from regularization.

During pre-training, you usually do like one epoch; you can’t even go through all of your data because you have too much of it. So you’re going to do one epoch training, and you’re almost certainly not overfitting the data in that one pass you’re doing. You might think you don’t need regularization for pre-training; just set your optimizer loose; it’s all about minimizing loss.

There are good arguments for why you shouldn’t need to regularize. But if you look at what people do, the story is actually kind of mixed. This story is actually maybe even more mixed than what has turned out so far. Out to be but early days people did a lot of dropout. Then there’s a lot of weight decay that also seems to be happening. These days I think a lot of the people have stopped publishing details on precisely their training hyperparameters.

Dropout has sort of gone out of fashion, but weight decay has really been something that a lot of people continue to do. Why is that? That’s a really odd thing to be doing. I’ll give you a moment to just think about this state of affairs. If you’re training a really large neural network for one pass on SGD on vast amounts of data, why would you use weight decay when you’re doing that? Maybe some of you know the answer, but I think that’s an interesting thing to think about. It’s very intuition-violating, at least for me.

So, okay, the reason is because it’s not to control overfitting in the sense that if you look at weight decay, different amounts of weight decay don’t really seem to change the ratio of training loss to validation loss. You can train with different amounts of weight decay, and if you train long enough where you control your hyperparameters appropriately, you’ll end up with the same train to validation loss gap. So overfitting—nothing’s happening here, even with zero weight decay. But what is interesting is that weight decay seems to be interacting somewhat in a strange way with the learning rate schedules of the optimizers.

What’s happening is that if you look at a constant learning rate, this is a model trained on constant learning rate, then you suddenly decrease the learning rate in ten years. So you see this drop-off as you decrease the learning rate. Then let’s look at different kinds of weight decay that you could do. With weight decay, the model’s not training very well at this high learning rate, and then when you decrease the learning rate, it’ll very rapidly drop off. When you look at cosine learning rate decay, what happens is that the models with high weight decay start out very slow, but then as they cool down—that is, their learning rate decreases—they very rapidly optimize.

So there’s some very complex interaction happening here between the optimizer and the weight decay, and some sort of implicit acceleration occurs near the tail end of training that ends up giving you better models. The answer to the question I posed to you is you don’t use weight decay because you want to regularize the model, which is kind of what it was designed for. You’re weight decaying in order to actually get better training losses, and you end up doing that because of the various learning dynamics at the tail end of training as you decrease your learning rates to zero. It’s a very interesting, complex, and in some ways troubling thing to be doing with language models.

But now you sort of see why if you look at a lot of the reports, you’ll see we use weight decay. This is kind of why that ends up happening.

Putting all that together, there are certain things that I think are just kind of no-brainers. If you’re picking various hyperparameters for your model, you don’t really need to think too deeply about them in the sense that they’ve been validated and basically everyone else does them. This includes things like the hidden size of a multi-layer perceptron, the head dimensions of your multi-head attention, your aspect ratio, and your choice of regularization through weight decay. All of those have fairly good consensus evidence of how to pick most of these hyperparameters, and those defaults roughly give you the kinds of things that we suggest in the assignment so you can kind of follow along, and they’ll roughly give you something similar to this.

Any questions about the hyperparameter piece? Yes? Is there a reason why dropout has gone out of fashion? That’s a good question. I don’t think I’ve seen a deep analysis of why dropout is or isn’t helpful. I haven’t seen any result that shows, for example, that it helps for training loss. What this paper argues, and logic would dictate, is there’s not really a training overfitting issue with these models that can’t even do one epoch over their training data.

Do multilingual vocabularies actually contribute to improved performance in one language? So, the question was whether multilingual vocabularies contribute to improving performance in one language. When you say one language, do you mean do multilingual or like larger vocabularies help performance in English? Is that the right question?

I think in your high resource language, the impact is less. If you’re only thinking about English language modeling, you can get away with smaller vocabularies. This much is kind of true. But the place where larger vocabularies are really helpful is when you’re starting to get to languages that are more minority.

One great example of this is if you look at any of the announcements about their models or their tokenizers, they basically always argue that because of the way they have larger vocabularies and the way they train their tokenizer, non-English and low-resource languages are packed into much fewer tokens. So people using those pay much lower costs at inference time, which is a great benefit.

If weight decay doesn’t have a significant impact on the validation loss, why do we care about the training dynamics or the favorable operation dynamics? The goal is still, I want to get good training loss. This is the game that we’re playing, and the surprising thing about weight decay is that somehow it gets us better training losses. The intuitive thing that makes sense is that you do weight decay, and it gives you better validation losses. But that’s not what happens; what it gets you is better training losses, which are also the same as validation losses.

Are there differences in the architecture hyperparameter choices people make as they move towards multimodal architectures, like images and text? Yes, the question was about multimodal models. That is a great question. My survey of multimodal models is very incomplete. What I can say is a lot of the academic and open work that I’ve seen, they do what you might call shallow or early fusion of the modalities. The way that works is you kind of bolt the vision modality onto an existing language model. In those cases, the hyperparameter and architecture choices are fixed.

One thing I will note, and I will talk about this in just a few slides, is that the multimodal models pioneered some intriguing techniques in stabilizing language model training, which has been a really big theme. So what is different is that often when you bolt on this new kind of vision piece, you need to think carefully about how to stabilize that training process. Those innovations have actually seeped back into pure text language model training.

So, I went back through and looked through all these new papers as I was trying to think about what’s been new in the last year and sort of what new architecture-related things have happened. Actually, the core architecture hasn’t changed much, but I think the one thing that stood out as being emphasized in many of the releases has been what I would call stability tricks.

These are things where you would like to train your model in much more stable ways, and as you make bigger and bigger models or train for longer periods, these kinds of issues start to appear more and more. I’ve taken this from the mode 2 paper, and actually that paper is a great set of academic results on LLM training stability. One thing they start with is this figure. You look at this blue curve, and this is a terrifying graph to look at. Your loss curve seems to behave okay, but you’ve got some bad spikes every now and then, and you open up your gradient norm, and it’s this horrible plot where you’ve got spikes everywhere where your norms are completely blowing up.

If you’re training models like this, you’re going to have a really tough time getting it to converge reasonably. At some point, it’s going to hit you with gradient norm exploding, and you can’t do anything, and your training is done. There’s been a lot of emphasis trying to turn this blue curve into something that looks a lot like the orange curve. Of course, this loss is higher, but ignore that fact because I think they just switched datasets between these two training runs.

The orange curve has nice low gradient norms throughout, and that’s really the kind of plot that you would much rather see. So, you might ask, where do stability issues arise in transformers? They can arise basically everywhere. If you look at the kind of interventions that people are making, there’s really one place that stands out as the problem child, and that’s the softmaxes.

It can be a problem because you’re taking exponentials, and those can be numerically badly behaved. You’re also dividing two numbers, and you might have a division by zero. For many different reasons, this softmax piece is a part that you might have lots of issues with. So, where are the softmaxes in a transformer? Well, there’s one at the very end, so you’ve got to be careful about that output softmax. And also, there’s softmaxes in your self-attention.

There are two softmaxes that we’re going to think a little bit about, and for each one, I’m going to mention a stability intervention that has generally seemed to be effective. The first one is called the Z-loss. In my desire to cite a paper that’s older, I’ve gone back to Devlin in 2014, where in a machine translation paper, their goal was to ensure that this normalizer was near one. If you look at P of X, that’s the output softmax.

The output softmax consists of two terms: you exponentiate your logits, and then you divide by the normalizer Z. If you want this Z of X, you want to train the network to have a Z of X close to one. Well, then you can rewrite your loss and add a little second term here to try to force log of Z of XI to be close to zero. You’re going to end up with an auxiliary loss term that’s alpha log of Z of XI. You can see that derivation on the right here.

This is, in some sense, what people often call the Z-loss. Jacob Devlin and others did this for machine translation for totally different reasons than what it’s used for today. But this was, I think, the first instance of this in the language modeling land. Palm used this as they called it auxiliary loss of Z-loss 10^4 log of Z to basically encourage the softmax normalizer to behave nicely. You can reason through the behavior of this regularizer. If it succeeds and forces log of Z of X to always be zero, then the logarithm and the exponential cancel, and you’ve basically just got U of R of X. That’s a good place to be, a nice numerically stable operation.

All of these problematic operations kind of go away. You can think of the softmax as being well-behaved when Z of X is close to one or log of Z is close to zero. Palm, in some sense, is very much a pioneer because they did this Z-loss trick. Many others didn’t really do it for a long time, or at least the ones that had open papers.

There was a sequence of papers that have done this; Baichuan 2 is actually the earliest follow-up that I know of, and then DCLM and Almo, and now several others have basically picked up on Z-loss as a very nice convenient intervention for improving stability. The other trick we see is how to stabilize the output softmax, but we’ve got another softmax we’ve got to deal with in the attention operation. This is from an Nvidia paper. I forgot to put the citation marker here, but this is a block diagram of how attention works. You’ve got your layer norm at the beginning.

You multiply your queries and your keys, softmax it, multiply the values, and then project it. This looks just like your normal multi-head attention operation. So what’s the difference? Several folks came up with this idea or approach called the QK norm, where you take the queries and the keys and pass them through a layer norm before you take their inner product for the softmax operation. This is a very different approach to controlling the behavior of the softmax. Here, you’re not controlling the normalizer Z; instead, you’re controlling the inputs to the softmax to be kind of bounded in size, and that’s going to naturally control the bad behaviors of the softmax.

As I said before, this is originally an innovation from the vision and multimodal model community. Deani in 2023 had a paper on training very large vision transformers. Chameleon and Edith Feix from Hugging Face used these tricks for their multimodal training components. Several others like Gemma 2, DCLM, and OMO2 basically use these kinds of techniques to stabilize their training.

I think I’m allowed to add one joke per lecture, and this is the one I’m going to go with here. One of the things that has stood out in terms of stability interventions has been just how strikingly effective layer norms are. We’ve seen going from layer norms just in the pre-part of the block to the beginning and the end of the non-residual component, and now we’ve also thrown it into the Q and K component. At least in terms of improving stability, layer norms have been shockingly effective without affecting performance too much.

The last trick I’ll note, which I think has not been quite as frequently used, is to soft cap the logits that go into the softmax. The QK norm is a very heavy-handed intervention because we’re operating over the entire vector. After taking the inner products for self-attention, you could pass them through a kind of soft maximum operation. You could pass them through this equation where you take your logits as your input divided by the soft cap multiplied by the soft cap.

What does that do? If your logits start exceeding the soft cap by a lot, the tanh is going to clip them off to one. You’re going to have a maximum value of soft cap. This controls, in some sense, the soft clipping of the logits. I think two others also do this. It hasn’t been quite as popular otherwise. The Nvidia folks mentioned earlier did actually quite a few different stability-improving interventions.

What they find is, you have your baseline model over here, and this is the perplexity of the baseline model—11.19. Soft capping makes it worse, while QK norm actually makes it better because you can use more aggressive learning rates and push the optimizer further.

That’s the end of the stability-improving intervention stuff. Does anyone have any questions? That’s kind of the new development over the last year. Yes, for the QKV norm, I understand that during training you will have the layer norm being applied. At inference time, is the layer norm still being kept? Yes, the question was at inference time, do you still use the norm, and the answer is yes because the layer norm has kind of learned parameters. The whole action of the layer norm is to take an activation, normalize it to unit, and then scale it to some size.

If you take that out, that’s a huge change to the model. It will have no idea what to do with those unnormalized activations.

I have this last bit—last few slides that I want to end with. If we go over, then we can always push this into the next lecture, but I think we also have a lot of content next time because I have to cover Datasets v3. The last thing I want to cover is variations on the attention heads.

Attention heads, I think, haven’t had as much work done to them, but there have been a few important changes that you need to know about in order to understand the models being trained. The first thing I’ll talk about is GQA and MQA. These aren’t critical to the training-time behavior of the models, but they’re very important in understanding the inference cost and inference behavior of the models. Because this is an important architecture change, I’ll mention them here in addition to probably being mentioned by Percy in some of the inference lectures.

The other new development I’ll mention is how the most recent models, like Llama 4, supports supposedly 10 million tokens of context. How does it do that? Well, it does so by messing with the attention pattern in very structured ways.

If you’ve looked at some of the larger models like the big llama models or others, you’ll have heard or seen the term GQA or MQA. To set the stage, let’s think about the compute that you need to do attention. This is again 224 slides here. You take your query and your keys, and then you’re going to form your big quadratic attention matrix. You can walk through each of these matrix multiplies and convince yourself that the total number of arithmetic operations is going to be B * N * D^2.

That’s B, the batch dimension; N, the sequence length; and D^2, the hidden dimension squared. You can ask about the total memory accesses, and this is going to be B * N * D. For example, accessing just this matrix here, this query is going to be that size. The softmax is going to be B * H * N^2. You can convince yourself of that by thinking about the size of the softmax matrix, which is going to be batch time, number of heads times all the different softmax activations that you have.

That’s N^2 of them, and you’ve got a projection and you’ve got D^2 projection operations at the very end. We can take the ratio of total memory accesses and arithmetic operations, and this will be something that will be very important in a couple of lectures. This idea is called arithmetic intensity. We want our arithmetic intensity to be high, which means we want to do a lot of compute for every single memory access we do.

Memory accesses are expensive on a GPU relatively speaking, and compute is relatively cheap. In this batch computation I’m showing you here, the arithmetic intensity—if you take the ratio of those two things—is going to be 1 over K plus 1 over B * N inverse. This idea means we can keep our GPUs running because if we have a large number of heads, a large batch size, and large sequence length, those are all going to be good large numbers.

Of course, this is what happens at training time. The issue is that at inference time, we do not have these big chunky matrices to multiply together. That’s going to really change the nature of our algorithm behaviors. When we’re generating text, remember we have to generate a token, and the transformer has to read that token and process it. Then we can get the next token distribution, and we do things autoregressively one token at a time.

By doing this, we can’t parallelize the generation process. We need to go step-by-step for every single new token. When we do this, we’re going to need to incrementally compute attention—an idea that people call the KV cache. This is a lovely animation of a KV cache explained. If you look at this figure, you’re doing is generating a new token and conditioning on it. You want to ask what sort of information you should look up in the past that query token.

Your query tokens are shifting from one through N because you’re generating new tokens one at a time. You’re building up this key cache where you’re building up all of the past keys, and the past keys don’t change because they only depend on things in the past. As I generate tokens, I incrementally build up all of these past keys, and each time I can compute one new element of Q.K. The big attention matrix will be this lower triangular matrix.

I’m computing one row at a time, and that row is exactly what’s necessary to generate the next token. This KV cache idea, if you haven’t seen it before, is the idea of saying I’m going to generate the keys and the values incrementally as I generate each token. I’m only going to compute Q that’s absolutely necessary for my operations. If you think about the KV cache, I’m multiplying the absolute necessary keys and values since I’m saving all the intermediate computations.

I’m not wasting any sort of matrix or vector multiply. The total number of arithmetic operations remains the same—B and D. But the memory access patterns are different. When I do the KV caching thing, I have to move various kinds of parameters in and out of memory repeatedly. Whenever I multiply with a key matrix, I’m going to have to put that into memory and then multiply it by K.

Then I need to compute some activations, and I’m repeatedly loading different matrices. That’s going to give me a much higher total memory access of B^2 D plus N D^2. When you take this ratio, the arithmetic intensity is not so good. You’re going to get N / D plus 1 over B inverse.

If I want the arithmetic intensity to be high, I want this thing inside to be very small, so I need really large batches, and I need N / D to be small. What does that mean? I need really short sequence lengths or really big model dimensions, and this N / D is really unfavorable because I don’t want a bigger model, and I don’t want a shorter sequence length. This is the core inference cost trade-off that people face.

You have this very bad memory access pattern where you have this one term N / D, which is really killing you in terms of the throughput of your system. This motivates something called MQA. The key idea here is you can have multiple heads for the queries, but only one dimension or one head for the keys and values. This immensely simplifies things. Once you do this, you’re moving much less information for the K’s and the V’s.

KMV is shared, but the query has many heads. You still have multi-head attention or multiple queries but only single K’s and V’s. That’s why it’s called multi-query attention. Now when you do the same kind of arithmetic, we have fewer memory accesses because we’ve shared the K’s and the V’s. The arithmetic intensity is much better behaved.

We can increase things like longer sequence lengths, which are now viable, and the second term is now divided by the number of heads, so this term is also not so terrible. All the different terms are controlled now, and MQA can give you much better behaviors. GQA or group query attention changes this slightly. Instead of having a single query or multiple queries and single key, you can reduce the number of keys by some multiple, which lets you trade off between the inference time behaviors and the expressiveness of the model.

Some works show that GQA doesn’t hurt, but multi-head attention hurts. I’m not going to get into that; I’m just going to close off with this last thing, which I think is a really interesting development in the last few months. Back in 2019, OpenAI had a cool paper arguing how to build longer attention models. They essentially argued that one way to do that is to create sparse attention patterns.

Instead of paying attention to all of the sequence, I’m going to pay attention to a local window at each chunk. Then I can have other attention patterns that are diagonals to help propagate information across. You can build sparse or structured attention that trades off various kinds of expressiveness versus runtime. GPT-3 uses these kinds of tricks when they originally released it to get larger attention windows. Sliding window attention is another variant where you only pay attention to a small region around your current position, controlling the total amount of resources you need to do longer context.

Your effective receptive field is now the local one times the layers. The final trick—those were kind of the older ideas. The way that this has been modernly instantiated is through recent papers like Llama 4, Gamma, and Cohere Command, which have come up with the clever trick of having transformer blocks where, in this case, you have a set of four transformer blocks.

The very bottom one uses full self-attention with no position embedding. There’s no rope, no nothing; it doesn’t know about position at all, but it’s full self-attention, and it only happens once every four blocks. The three blocks above it use sliding window attention with rope. This is a clever trick to control the system’s aspect because full attention only happens now and then and the length extrapolation aspect because rope only deals with local context windows. Anything that’s really long-range has no position embeddings at all, allowing it to extrapolate very aggressively because you don’t have to do this position extrapolation with something like rope.

That’s a cool development we’ve seen in the last couple of months. I think we’re coming up on time. Feel free to ask any questions about architecture or hyperparameters. I’ll be happy to answer questions after.

Breaking Huawei + Tariffs Done Right with SemiAnalysis and Asianometry

2025-04-19T00:00:01+00:00

Breaking Huawei + Tariffs Done Right with SemiAnalysis and Asianometry

Well, howdy ho, welcome to China Talk, where we talk about how Trump is going to destroy China. And let’s talk about the betrayal of Jensen Huang. He flew to China right after we told him he can’t sell his graphics processors in China. What the same hell is going on here? At the same time, let’s kick all these kids out of Harvard, too. This is China Talk and Transistor Radio. Welcome on. I’m your host… Dylan Paddle!

All right, guys, I’m sorry, I’m Dylan. Dial on Paddle! All right, we got Doug here eating wings. We’re going to talk Huawei, H20s, Visas, Emerging Market Watch. Just got to kick the Fed out. It’s so simple. All right, Huawei. So Huawei has the Ascend 910 B and C, right? It’s the same chip silicon-wise. But we’ve been saying, sort of banging on the drums about this for years now, right? Systems matter more than individual chips for AI workloads because systems are what’s important, right?

When we look at the three-headed dragon that is NVIDIA, right? One of them is networking and systems, right? One of them is the chip, right? These are almost equivalent-level things. So what Huawei has done is they’ve taken a 7nm chip that is mostly produced at TSMC, some at SMIC, but mostly produced at TSMC, and taken that and put it into a system that, yes, it consumes more power, with more chips. But they’ve networked it together in such a way they’ve cooled it, they’ve done all the networking, et cetera, in a way where the performance is actually going to be better than NVIDIA’s GB200 and NVL72 rack, right?

This is the new product that NVIDIA is shipping right now in mass production. You know, it’s the hottest stuff. It’s what everyone wants. It’s really good, right? To be clear, it’s really good, and it’s one rack that consumes 140 kilowatts, but Huawei has made this system that is, was it 12 or 15 racks, 16 racks? It’s 16 racks of compute connected all together through optical fiber, through LPO transceivers, you know, through, like, I think 6,000 optical transceivers connecting the sends together.

And, yes, it consumes, like, it consumes 550 kilowatts, right, versus 140 for NVIDIA, right? This is a huge step up in power, but also it brings to bear 2 to 3x the performance in memory and flops, right? And so, when we talk about, hey, do you have the software to utilize it? Yes. Huawei’s software is better than AMD’s. They support JAX, they support PyTorch as first-class citizens, they already support VLLM. There’s a lot of software support that they’ve already built out publicly, and Huawei’s engineers are correct, right? Their software engineers are, arguably, just as good as NVIDIA as maybe, you know, that’s not a doubt, right? Like, you look at all the stuff that they’ve built over the years.

So I think, when you look at that, it’s like, oh, wow, Huawei has turned what is effectively, like, a big handicap on chips into, well, you know, we’ll use a lot more power, right? We’ll use 3x the power to achieve the same performance. But that’s okay, right? Or 2x the power to achieve this, 2 to 3x the same performance. Well, that’s okay because China can build power. The U.S. doesn’t like to build power, right? There’s many reasons why, which we can talk about later.

But I think this is a very important thing to look at and to consider, right, is that Huawei, through the use of Samsung HBM, TSMC wafers that they continue to receive, at least they continue to receive the Samsung wafers through this Faraday, Coasia loophole with low-temperature solder through ASC’s packaging. You know, they still continue to receive HBM, and they’ve stockpiled a ton because the U.S. government telegraphed the bans on HBM for like nine months instead of just doing it. And then they continue to receive wafers, allegedly. We’ve heard this. I haven’t been able to validate it, but I’ve heard this. So, you know, people need to look deeply into this.

But they received over $500 billion of wafers for just the Ascend alone last year. And they likely are – they could potentially, maybe not likely, but could potentially continue to still be receiving wafers from TSMC. Furthermore, as SMIC continues to up their yields, right, if they up their yields from as low as it is today to even a paltry 20%, they could make millions of Ascends, right? Because we continue to sell equipment, leading-edge equipment to SMIC to produce wafers. We continue to sell leading-edge equipment to CXMT to produce wafers.

Yes, there are some restrictions against them, but actually, those restrictions are neutered and there’s loopholes. So, between all these things, Huawei is going to be able to ramp production unless the U.S. government gets their act together. Instead of banning the H20, they should have focused their attention on banning the equipment before they did the H20. Because now they’re just taking revenue out of NVIDIA and letting China still import – you know, ASML just said, oh, wow, China’s, you know, buying a lot more high-end equipment than we expected, right?

And they’re most – you know, just very recently. So, there’s a lot of issues going on with the government strategy. They’re doing things backwards. You know, props to them for banning the H20; maybe that is what’s needed, but they got the order of operations completely wrong. And there’s still a lot of loopholes that were left by the Biden administration, you know. The Trump administration has the chance to, you know, be tough on China in a way that, you know, the Biden administration kind of was, but really wasn’t if they want to, but they haven’t done that yet.

So, we’ll see if the Trump administration can decide to be tough on China because the rhetoric is really tough, but the actions are weak. You know, they laid off one of my favorite people in the NSC recently because Laura Loomer said that he was a deep state agent. This is not a joke. This is serious. Like, what the heck? There’s a lot of idiotic things that are happening, and I want the administration to do better. I think, you know, I’m happy to talk to them on what they need to do.

Yeah, I’m done with my rant. Just so you know, this is the most coherent, straightforward, put-together, on-topic rant we’ve had in 10, 20 episodes, maybe. So, he’s back, brothers. He’s back, and he wants to nuke China’s semiconductor and AI ecosystem from orbit. Oh, also, I was going to say, not just ASML, going to add on that. All of the suppliers have been saying there’s been incremental pull-ins on demand for semi-caps, specifically in China.

It’s equipment. It’s subsystems. It’s metrology equipment, especially, which is barely controlled. It’s, you know, completely a U.S. industry. There’s chemicals that could be used to cut off SMIC, but they don’t want to, or the Huawei-associated fabs, but they won’t. Also, the amount of consumables. Maybe something I think people don’t appreciate, either, is Huawei is vertically integrated. Like, they’re also – Huawei, what’s the company called? It’s something – Is it something carrier? Sci-carrier. What’s the name of the – Sci-carrier? Sci-carrier.

Yeah, Sci-carrier. Yeah, they’re pretty much just nuking the crap, even out of the Chinese domestic ecosystem. I think they’re literally stealing NARA and Amex tools, as well. There’s, like, a tool that they had that’s side-by-side, identical to a disco tool. We saw, and there’s, like, an identical KLA tool. They’re just copying everything. Let’s be clear, the Sci-carrier and Huawei – Huawei bought, what, like, $8 billion of equipment last year? They’re literally just tearing down a lot of them, but also running wafers right next to them in their own tool, in their tool, and calibrating and seeing how to improve, and just continually better and better and better. It’s amazing how well they’re doing this, by the way.

Yeah. I mean, and what’s crazy is, you know, Huawei and Microsoft, the same number of employees. Dude, pound for pound, Huawei is so much better than Microsoft, it’s hilarious. Oh my God, Satya. Satya, what are you doing? Well, Microsoft does make a lot more money, to be clear, but it’s a big work. And not only that, they’re doing hardware, software, I mean, like, four steps of hardware, networking. I mean, what is Huawei not doing a lot?

So, yeah, Huawei. Dude, they’re doing electric vehicles. Have you seen their electric vehicle? It’s fire. Well, and I think Jensen said this before, that Huawei is the AI company you need to be watching out for. I mean, dude, they’re the big baddie. So, anyways, yeah, is that scary enough? Like, we got a bigger dose? Wait, Doug, can you explain the vertical integration and what Sci-carrier is aiming to become? They make tools. They make tools. So, they’re like the applied materials, ASML, KLA, equivalent.

But then Huawei also happens to own the entity that makes the chips. Well, I mean, I think it’s SMIC, but like, very, very heavily involved. They’re like the largest, you know, and they have huge investments in, and they also are the biggest wafers out. Well, they have an HBM fab that they’re building, and they’ve bought a bunch of equipment for, I think, $2.5 billion worth already. At least, that’s what we saw last year. Probably more that they’ve circumvented. They have a leading-edge 7-nanometer and 5-nanometer fab. They have a new form of non-volatile memory, which should beat DRAM and NAND.

Beat DRAM in cost, beat NAND in performance. Swaesher, Swaesher, Swaesher. They’re cracked. That’s Huawei. There’s advanced packaging Huawei fab. There is leading-edge semi. There’s HBM. There’s all these different kinds of fabs that they’ve built that the U.S. government doesn’t all consider Huawei, but if you rub two brain cells, they’re Huawei. Yeah. And it’s not like, oh, it’s Huawei from the extent of, like, everything in China is because of civil-military fusion. No, they’re just cutouts.

One thing, right? Which is not the case, right? All right. Anything else on Huawei? I think it’s important to recognize what they did on the networking side as well, which was really cool, right? So, they built their own switches. They built their own NVLink equivalent, basically, you know, which is impressive on its own. But then on top of that, they built their own optical transceivers. By the way, every single optical transceiver that goes into a Google TPU pod is made in China by a Chinese company.

Oh, by the way, every single half of the trans—actually, 70% of the optical transceivers that are NVIDIA are designed in China and shipped to all these – and most of them are built in China as well. And, you know, so, like, there’s a big optical supply chain issue that America has that I don’t know what’s going on. But America needs to wake up. But anyway, sorry. China has their own optical supply chain. And it’s really cool because what they’ve done is they’ve made—they’re the first to productize this technology called LPO.

Doug and I have been banging our horns about this for a couple of years now. I think exactly two years now. But it’s basically taking an optical transceiver and removing the most expensive chip and still being able to make it work. They don’t have the incentives to do it either, right? It’s cheaper. It’s lower power. It’s technically lower latency. The problem is, like, it’s somewhat more unreliable. But, you know, if you just do good engineering, you figure it out. And Huawei did this before anyone in the West.

Also, one of the reasons why I think China crushes optics so much is because their margins suck. Optics margins are objectively bad. We’re talking, like, teens, 20s, 30s, your best optics margin business. And so, the United States, obviously, capitalism maxing. We need to figure out how to get a Huawei optic to see if they even use TIAs and drivers from America. Well, okay, even if they were designed in America, I think it would still end up being fabbed elsewhere.

You can design it out. Or, like, design. I mean, it would be—obviously, it’s fabbed elsewhere. But, like—I don’t think they would have. Dude, they were so cracked on network, like, on analog. Yeah, why would they use American TIAs? Yeah, why would they use American TIAs?

So, anyways, maybe helpful because Dylan said this before, but I think you should, like, max on this. Pretty much, if you had no energy constraints, this is how you would design an NVL 384, right? Like, they essentially copied the 276 design that didn’t happen, and they went all optics, so they said, screw copper. Copper is more power-to-price, performant, efficient, but if you just didn’t care about price or power, and you had a lot more power, and you also know you can make zero on the optics, you just scale up optics to a really, really large domain.

And that’s what they did. So, it’s pretty impressive. They played to their strengths. It’s going to be a pretty beastly machine. So, yeah. Who’s behind this? It’s over. Deep Seek. No, I’m sure they will. Like, the White House is saying we’re apparently going to ban an open-source model. Like, okay.

Wait, what open-source model? There’s somebody going to ban Deep Seek from America. Deep Seek. Wait, what are you talking about? This came out of the White House. This came out last week, or this week. They literally said Deep Seek is a tool to gather information for China, and they’re going to ban it. They’re going to put malware on every computer in America to see if we’ve downloaded a Chinese model and throw me into El Salvador for running R1 on my new anime GPU. Bro, they’re not going to throw you to El Salvador. They should set up a great firewall, an American firewall, and then start censoring. Honestly, though, I don’t see why we don’t ban all Chinese software. They ban all of our software. It’s open-source, though. How do you ban it? You can ban the Deep Seek API. You cannot ban the model weights, right? The model weights ban is stupid. Also, I think you shouldn’t ban it in general. I think that’s kind of probably the wrong way to go about it.

Yeah, if anything, you want to steal from them. Steal it. Steal the model weights. Change one weight to be different. Boom, you’re good to go. Well, their license is so permissive, you can literally just do that and call it your own model. In contrast to Meta’s license, which is hilarious, right? Meta’s license requires you to say, “powered by Llama,” on the website, anywhere. What if we Operation Paperclip Liangwenfeng and just nationalize Meta and give him all of Meta’s GPUs? What? Sorry. What do you mean Operation Paperclip? I don’t know what that means. No, this is when we got Werner Von Braun to make our ICBMs. The Soviets and Americans both did this.

Now, Jordan, now, you listen here right quick. What you just said sounds like immigration. Okay, so what, okay, well, let’s get to that then. What am I, can we pour one out for the American University basic research talent attraction? Dude, I went, wait, hold on. I literally went on a date with someone who’s doing cognitive science stuff and their grants are pulled and they’re like, “fucked.” Yeah, no, the grants are fucked. My favorite one is the Japanese PhD student, uh, guy. They’re like, “two tickets? Get out of here.” Wait, what do you mean two tickets? He got like a fishing permit wrong and a speeding ticket from 2019 that he paid. Computer vision PhD. He’s now on a plane to Tokyo. American dominance. AI dominance. You say it enough times, it’ll come true, right? Spend that money on an American. That’s what I’m going to say.

See, what you need to do is you need to kick the intellectuals out, you need to burn their books, and then you need to have them work in the countryside in the fields. Then you throw stones at birds, and you see how that works. Should we melt down our laptops for steel? Yes. Thank you. Micro steel, dude. It’s the wave of the future. We’ll melt down our laptops so we can make solar panels. For waivers. Oh, God. In tiny little furnaces that we make in our backyards. We just burn offerings to the god of Intel. I have a balcony. That’ll bring back the 90s. They wanted, we wanted domestic production of silicon. That’s how we do it.

And anything more uplifting or more encouraging to talk about? Other than this depressing shit. There’s a hint of a plan of a deal with China. Yeah, what’s up with the semiconductor tariff situation? I think it’ll be 25%. I think it depends on what side of the bed he wakes up on. So, so y’all know about this Liz Truss thing, right? Who will last longer? Liz Truss or a head of lettuce? So, so the funny thing is someone in our Discord has been posting, uh, every day, a head of lettuce, uh, decorated up. And he calls it Howard lettuce. Who will last longer? Howard lettuce or Howard leptik?

I think, I hope the semiconductor tariffs are intelligently designed because they could do a lot, right? They could do a lot of positive things. They could do a lot of negative things. Let’s imagine the good version. Right. So, you do increase it. So one, if a company, and this is across all tariffs, if you import a subcomponent into the U.S. and then reship it out, you get refunded the tariffs that you paid on that subcomponent you imported into America. So this is not chips, this is anything, anything you export out, you can get credit for on imports. And therefore that fixes a lot of the tariffs so that this is done by some countries. It’s not done by America.

So that’s one. That’s not even semiconductor related, but that’s just generally. Two, let’s frame this as an electronics tariff, not just semiconductors, on this, what we want to exclude tariffs on any sub materials or sub components again. Right. So, for example, consumables, right. We don’t make wafer sputter targets, right? Like in America. And it would be nice to, but you don’t want to increase the price or cost for countries if that exists. So, and then you also do a phasing in tariff, right? IE if you’re saying, “Hey, sputter targets, you know, we don’t make those in America. Everyone just imports them. We probably should make some sputter targets in America.”

Let’s, how about we do this? If it takes two years to build a sputter target factory in America, let’s say in two and a half years, this tariff will go into effect. If it takes nine months to do it, oh, this tariff will go into effect in a year. Right. So you have phasing in or rolling in of tariffs. So no one gets obliterated, but they know it’s coming. And it’s very clearly communicated. So you do this on all subcomponents that you can. Furthermore, you tariff finally, a final assembled goods a lot more than you assemble, you tariff sub-sub-components and subassemblies.

Right. So for example, you tariff the shit out of iPhones, even though it’s politically unpopular. Instead of excluding them, you tariff the shit out of iPhones. You give it, you give them 135% tariff, in China. Right. And then you like give, you know, other countries, you know, 25% tariff fight. But then for their subcomponents, right. For the display from Korea, you just don’t tariff it. Right. Or you tariff at a very low rate. And so you have this like waterfall of, oh yeah, you should actually just make, you should do the assembly in America or, you know, go through the USMCA loopholes.

The other thing that I would do is I would reclassify goods. Right now, motors and robotics are considered the same classification. So no one’s going to manufacture robotics in America because if you import a motor and then you export a robotics, there’s no substantial change. I can’t remember the exact language. So, there’s no substantial change and therefore, you know, there’s no exemptions on things, right? The way they’re currently written. Whereas if you import a chip and then you turn it into a computer or turn it into a PCB board with a bunch of data processing unit, right? You turn it into a data processing unit or whatever. It’s a substantial change.

And now that solves certain tariff aspects of it, you know, they need to do this too. They need to change these classifications for robotics versus electric motors. So I think, I think there’s, it’s generally that waterfall methodology, right? Of like subcomponents get tariffed a lot less than final assembled goods. And you definitely tariff the shit out of China, but you don’t tariff other countries nearly as much because a lot of these tariffs should be just moving supply chains out of China into, you know, Vietnam and Indonesia and Malaysia and India and like other countries and Mexico and Canada, as well as the U.S.

But a lot of them, the focus should be just fuck China. It should not be fuck China, let’s go America. It should be mostly just fuck China. Why do you, why Dylan, why do we care about assembly in America or Mexico? Because assembly is actually not that labor intensive anymore. So, here’s the thing, right. When we talk about automation, we’re never going to get automation of factories in America if the factory doesn’t exist in America in the first place. And when they move the factory over to America, they’re going to automate it a shitload more.

Okay. We know how Trump, he drove the truck. He worked at McDonald’s. Like when he, is he going to assemble an iPhone? Like where’s that photo shoot? I need that. That’ll make it all better. Right. Who says, who says he hasn’t already put Baron on the case? Tiffany. Where’s Tiffany? No one talks about Tiffany anymore. It’s like the great leader, you know, it’s like when the emperor’s children have been de-emphasized, they just disappear from, from the, the, the delete the words.

No. Yeah. There’s only one true child now. It’s Baron. How are you feeling about O3? Whoa, hold on, hold on, hold on, hold on, hold on. Okay. Sorry. Tariffs. We didn’t finish the rant with a, right before I talked about, oh, why do we, why do we need assembly in America? We need assembly in not China, A, right? That’s the main concern for me. But also generally we need assembly in America because as we, as I mean, supply chains have done this for a long time, right? There are less and less labor-intensive and they’re more and more capital-intensive, and America should not be at a deficit when it comes to capital intensity. They should only be at a deficit when it comes to labor intensity, right? Because our people just make more money.

To increase our tax base massively, we need to have all of the highly capitally intensive things happen in America, right? There shouldn’t be a natural advantage for other countries versus America in these areas, but there is because we don’t have any supply chains here. The main thing that we need to do is import a hundred thousand Taiwanese citizens into America and just build every fab in America, right? We need to import like a hundred thousand Korean people and build memory in America, right? This is just exactly what needs to be done.

But, you know, I’m operation paper clipping a little too hard. Generally, I think we need to have that small set of skilled labor operating massively capitally intensive factories because if you look at the labor cost of say an Intel fab or a TSMC fab, it’s minuscule, right? For a giga fab, i.e., a fab that makes over a hundred thousand wafers a month, there are only a handful of those in the world. Power costs just as much as the people, and power costs twice as much in Taiwan. For all intents and purposes, a fab should actually be cheaper in America, but we have supply chain issues with gas delivery, tool delivery, and tool installs, and all these other things.

This is the problem because otherwise, a hundred billion dollar fab has like 10,000 people working in it. That’s nothing. It’s not about the labor; it’s not about the jobs. It’s the fact that it’s here in America and it’s now contributing to the tax base. Fabs are an example of ridiculous capital intensity for very, very low labor. Yes, that labor is intelligent and is very good, but we can just import them. We should fix the trade deficit by having a larger talent deficit.

These highly capital intensive aspects of society are really something America should have, and we don’t. We should be building a hundred billion dollar factories like TSMC is in America. Hopefully, knock on wood, the same should happen across all these supply chains. Now fabs are extreme, but assembly is going to get there. A lot of manufacturing is going to get there because of robotics really hitting its golden age. This is why semi-analysis has got multiple people committed to working on robotics now.

There’s a lot of stuff that robotics and AI are going to do that makes things even more capital intensive and even less labor intensive. The magic and the proof are in the pudding, right? You look at O3, right? O3 is just so good. I did the segue for you, Jordan. Go ahead. Why is O3 so good, Dylan?

I think it’s about the tool use. Honestly, it’s better at tool use than O3 mini, O3 mini high were. Specifically, it’s really natively multimodal. The model is good. I’m not like, wow, this is so much. I can definitely feel whatever big model feel or whatever. It’s better than 2.5 for sure. The thing that makes it really sick is you can just drop all kinds of stuff in it and it will just do its job. That’s like the real.

I was talking to one of the people at OpenAI, and I was like, huh, why is O4 mini better than O3 at this one specific thing? He said that’s because the multimodality support on O3 is mid. The multimodality support on O4 is really, really good. Then he was like, you know, I always find it weird using the mini models because they’re like little autistic children. I’m like, whoa. That’s because the mini models are very small. The regular models are still quite small, right? You have to recognize four, it’s 4.1 based. 4.1 is like very small, 4.0 based, right? Those are very small models, roughly the size of DeepSeek V3, a little bit smaller actually.

I think O3 is impressive because there is some multimodality. The reasoning is really important. I think it’s much smarter actually, Doug. It’s not just a little bit smarter; I think O3 is much smarter than O1 and Pro and O3 mini. It is generalizing a lot better across different tasks, but O3 is still not generalizing that well.

When do we start to worry about Anthropic? We should already worry for Anthropic. They’ve greenwashed AI safety. No, dude. Anthropic is fine. They’re going to be fine. They’re going to ship Clawd 4 in like a month or whatever. It’s going to be fire.

I saw they did a Gmail integration. What is Google doing? I would screw around with a freaking Gmail integration most. What does that Gmail integration do? I asked it to, and it didn’t do a good job. I was like, tell me what BD leads I’ve forgotten to follow up on, and it brought me spam. It was kind of a letdown, actually.

Can we get Nebius to sponsor Transistor Radio? Sure. Who’s going to want to sponsor this random Zoom call? Whoever makes it 37 minutes into this. If you want to have an ad spot, it would start right here. There’d be four of the brightest minds in semiconductor AI in China, and we would all extol your virtues for the low cost of one million dollars. Dylan will even give you his Texan accent for free.

I think this episode is too coherent. We’re done with this darn tootin’ podcast. It’s a little too coherent for me. I’m out. Adios, partners.

Adios, amigo.

You left me and you went away. You said that you’d be back in just a day. You’ve broken your promise and you’ve left me here alone. I don’t know why you did, dear, but I do know that you’re gone. I’m walking the floor over you. I can’t sleep a wink, that is true. I’m hoping and I’m praying as my heart breaks right in two. Walking the floor over you.

Now, darling, you know I love you well. Love you more than I can ever tell. I thought that you wanted me and always would be mine

#40 - Lewis Bollard - How to End Factory Farming

2025-04-18T00:00:01+00:00

#40 - Lewis Bollard - How to End Factory Farming

You know there are so many problems on earth that are vying for our attention. But if I had to pick one where in say a hundred years time we look back and be like “Damn, we really screwed up. That one sucked,” it would be factory farming. As you’ll learn from today’s conversation, it is one of the most lose-lose things currently happening. My guest today, Louis Bollard, is a leading expert.

The only way that they get contaminants down to acceptable levels in US chicken is by bathing the chicken in chlorine. In a lot of these cases, there is no appetite to study the human health consequences. When the government tries to fund this, they very quickly get a call from the senators on the agriculture committee asking them what the heck they’re doing and telling them they’re going to defund the effort.

He’s a researcher and philanthropist who has dedicated practically his entire life to fixing this issue. Thank God he is because there are some really meaningful levers of change that we can all be doing to actually improve this. So I am so excited to share this conversation with you. Here is my chat with Louis Bollard.

Louis, thank you so much for joining. You know, as you know, and certainly I think most of the Win Win viewers, this particular topic of factory farming is about as close to my heart as any topic there is. I wanted to talk to you because I think you have one of the clearest pictures of the entire landscape of this topic.

To start us off, maybe you can explain what your background is, how you got interested in this topic, and then also just explain to people what really factory farming is.

Sure. Well, thanks for having me on the show. So I think like many of your viewers, I grew up liking animals, wanting to do good in the world, and not having strong views beyond that. It was when I learned about factory farming when I was a teenager that I was just shocked. I couldn’t believe that we would treat animals in the way that factory farms do, and I couldn’t believe that no one was really doing anything about it.

It struck me when I looked at other social problems in the world. So many of them are complex; they have tough solutions. It’s really hard to solve poverty. But the reason we’re torturing animals is that it is marginally cheaper to do so. The solution is very simple, which is just to stop doing it.

I kind of went on this path of first learning about this issue, going to law school to try and litigate on this issue, which I did for a year. Unfortunately, there are no real laws on factory farming in the US, so it’s hard to litigate because there’s nothing to litigate, no laws to litigate under. Instead, I ended up at this philanthropy, Open Philanthropy, which funds advocacy globally to reform the worst conditions of factory farming.

You grew up in New Zealand, in a sort of farming community. Is that correct?

Yeah. We grew up with a sheep farm. At the time, I had no compunctions whatsoever. I thought these animals lived a pretty good life. They would just graze across the hills all day and had a good time. In some ways, I think that actually blinded me to factory farming because in New Zealand, you see sheep; you see cows out on the pasture all the time.

It was only when I visited Southeast Asia, particularly Southern China or Vietnam, and started to see live animal markets, that I began to see some of the ways that animals were being mistreated out in front of me on the streets. I think that kind of planted the seed. From there, I started trying to do my own research. I reached out to farms in New Zealand and particularly noticed that you don’t see any chickens or pigs on pasture in New Zealand.

That was kind of the first time I realized that, you know, we eat a ton of chicken and pork, but there are no pigs or chickens outside. I reached out to the chicken and pig farms and said, “Can I come visit?” and they all said no. I then reached out to the industry associations and said, “Could you hook me up with a visitor?” and they said no.

It was really weird to me because as a kid, you could visit any sheep or cow farm you wanted. They’d love to have a visitor. I think it made me realize that there was something nefarious going on. The more that I researched this, the more I learned that this really is the entire system of producing chicken, pigs, eggs, and increasingly fish, relying on this factory farming model.

Explain what the difference is. What actually defines a factory farm? Is it a certain size? Is it a certain technique? What are the actual ingredients that make something a factory versus a conventional farm?

I think you know it when you see it. Typically, a factory farm is an entirely indoor environment. Traditional farms let the animals outside, at least when the weather is decent. Factory farms do not; they do not have any outdoor space where the animals could even be let outside. Within those indoor environments, there are a lot of features common to factory farms.

One, if they’re in the egg or the pork business, they typically use cages and crates. That’s kind of a hallmark of factory farming. If they’re in the chicken growing business, they typically use these hyper-optimized breeding lines for birds that have been bred to grow incredibly fast in a way that their body does not support, resulting in major health issues.

I’ve seen images of what a chicken looked like in 1940. They were quite small and athletic-looking. Modern chickens now look as different as a Pomeranian to a Great Dane. More accurately, they resemble a French Bulldog that is grossly overweight and whose legs are barely supporting it.

These modern broiler hens are almost obese. When we first got into factory farming, the industry split in two. It used to be that chickens were raised in backyards of small farms where they would eat food scraps and lay eggs to be used for meat. When the industry started to factory farm and industrialize, they developed hens hyper-optimized for laying eggs. They’re very scrawny; they just produce eggs and don’t eat very much.

The problem on that side was it left a whole bunch of male chicks who could no longer be raised for their meat because they were far too scrawny. Those male chicks get killed on the first day of their life. About eight billion chicks a year get thrown into macerators or killed in some other way on their first day because the egg industry no longer has any need for them.

On the other side, the broiler chicken industry, which is the meat chicken industry, developed these birds that grow incredibly fast. There’s nothing inherently wrong with that if they had developed them in a balanced way. But in reality, the only parts of the chicken they wanted to grow fast were the highest value parts. So they grew the breast meat incredibly fast, while the legs and internal organs, which are not very valuable, did not grow fast.

You get these birds that grow rapidly, but their legs start to collapse underneath them, and their internal organs, like their lungs, start to no longer support their body. So by five to six weeks of age, they are struggling just to stay alive. These birds are then slaughtered at about six or seven weeks of age, which is how the industry manages to keep them alive for long enough.

Because those birds have not yet passed puberty, for the breeding birds, they need to grow them longer. The only way they can keep them alive longer is by starving them. Breeding birds are fed about 30% of the feed that they want to eat. They’re chronically starved because if they didn’t, these birds would just collapse.

Mother Nature is like, “Nope, this should not be breeding.” Essentially, this is an abomination—it needs to die before it passes on its genes. But we’re finding ways around it.

I can see the economic reasons. To be clear, I can see the financial reasons why companies would want to do this. Explain to me what goes on in pig farms because that was the one that really woke me up. I’ve spent time with pigs, and I know they are even smarter than dogs.

Dogs are very smart, but pigs are a whole other level. They are incredibly intelligent and emotionally intelligent; they can read vibes and know what’s going on. They can solve problems. Please explain what pig farming could be like and what the reality is for over 99% of pigs, particularly in America.

Absolutely. Pig farming could be done the way it was for centuries, which is to have a small backyard herd that would eat food scraps from other animals or from the kitchen. They’d be happy to eat anything, and they’d live these decent lives where they get to hang out outside and roll around in the mud.

What the factory farming industry figured out was that they could cram all of these pigs inside. They could put them on concrete slatted floors so that their manure would fall down into a pit beneath them. They are constantly living on these uncomfortable floors and can’t lie down on top of their own manure.

The worst part is that for breeding pigs, they could breed them to optimize for larger piglet numbers. In pursuit of that purpose, they inadvertently bred them to be really aggressive. Because they are bred to be really aggressive, these sows could no longer hang out in a crowded indoor building with other sows. If they did, they would get into horrible fights.

So the industry had another solution: they said, “We can put them into crates.” They put them in individual crates, making it physically impossible for them to fight any other pigs. It was also physically impossible for them to even turn around. That turns out to be another positive because they will expend less energy, which means they will eat less feed relative to their weight gain.

It’s just a case study in how this industry operates, where every time they create a new problem, they create an even worse problem as a solution. They don’t care about the quality of life and the animals at all; the only thing they care about is bottom line efficiency.

Starting with the raw numbers of animals living in these conditions, it’s huge. For every human on Earth, there is about one laying hen confined in a cage. So around 8 billion laying hens are confined in cages globally. For every human on Earth, there’s roughly three broiler chickens being raised on factory farms. If you’re in a country like the US, it’s a lot more than that because we eat a disproportionate share of the world’s chicken.

When it comes to pigs, there is about one pig in a factory farm for every eight people globally. Again, in a place like the US, it’s higher than that because we eat more pork. Overwhelmingly, those animals are in factory farms. The best estimate is that 90% of the hens globally are in battery cages. When it comes to pork, 99% of US pigs are in farms of a size that are almost certainly indoors.

The industry doesn’t collect data on how many are indoors versus outdoors, but we have data from the USDA on the size of those farms. We know that if you’re running a farm with 100,000 pigs, you’re not giving those pigs outdoors; they aren’t wallowing around in the sun.

No, they’re in tiny little cages. Just rows and rows and rows. And the overwhelming reality now, unfortunately, even in emerging markets, in places like India, China, and Indonesia, is that factory farming has become the norm globally. Unfortunately, it all started in the US, and now we’ve exported that system to the world.

You mentioned China. In some ways, I think they’re now the largest consumer of pork, and somehow their animal welfare laws are even worse than a lot of the worst things in America. I saw a tweet recently that showed an image of a factory farm in China, and it’s almost like a multi-story car park. It’s seven stories high, this windowless, hellish-looking concrete block, and it’s just rows and rows of these pigs living in these tiny, tiny crates.

What gets me so much about this is that if someone did that to a dog, cat, or any kind of pet that they spend time with, put it in a cage so small that it can’t move backwards and forwards for more than like an hour, people would freak. Maybe you could put it in for a second, but even in crates when you transport an animal, they can usually turn around.

For just an hour, people would freak out because of the claustrophobia. These pigs, who are smarter than your pet, spend a minimum of four months in this condition and then get let out just to get put into another cage. Their whole lives are like this—it’s quite mind-blowing.

We’ve talked about the animal welfare side of things, which is fairly obvious to most people, but what people don’t appreciate is that there are so many other costs to human health and well-being that come from this industry as well.

Perhaps the first one we can talk about is, from a selfish perspective, I had a family member who had an antibiotic-resistant infection after a minor, fairly routine operation. They got a bad infection that nearly killed them. Part of that was because of a superbug that has been bred by antibiotic resistance from the pork industry.

A statistic that blew my mind recently is that the pork industry accounts for the highest percentage of medically important antibiotic usage within America. They use about 44% of the antibiotics, but they use the equivalent amount of antibiotics in their industry as the entire US hospital system combined. This must contribute to antibiotic resistance, right?

Absolutely. I think this is very clear. For decades, the industry denied there was any connection. They put out scientific studies saying there was no connection, but there is no doubt now. The UK government has commissioned independent scientific panels which have stated this, and the National Academy of Sciences in the United States has commissioned independent reports saying the same.

The new FDA commissioner, Martin Macri, has written about this, stating that the contribution of animal agriculture to antibiotic resistance is a huge crisis. Yet, still nothing happens because of the political power of this industry.

Another topic I was researching is hormone use. Of course, milk is one example where, for the cows to grow bigger udders and produce more milk, they use synthetic hormones. Studies have shown that various synthetic hormones fed to cows to make them grow faster might accumulate in humans. There’s evidence suggesting that after eating processed meat, men’s testosterone levels drop.

Again, there seems to be a real lack of scientific study into this because what we eat is so important. Why do you think that is?

The interesting thing with hormones is that it is actually one of the few things that has been regulated. Because things got so bad, it did get banned in the U.S. and in Europe in chicken feed and pig feed. Now it’s just beef cattle and possibly dairy cows who can be fed hormones.

But they’ve replaced them with all kinds of other substances. You’ve got growth-promoting antibiotics. You’ve got these beta-agonist drugs like ractopamine that they feed to pigs which make them grow incredibly fast, and we don’t really understand the human health consequences of these drugs.

In a lot of these cases, there is no appetite to study the human health consequences. When the government tries to fund this, they very quickly get a call from the senators on the agriculture committee asking them what the heck they’re doing and telling them they’re going to defund the effort. There has been no openness—it’s similar to studies on tobacco’s health impact decades ago.

Right. We can’t study it because the lobbies are too powerful.

I can’t imagine that eating an animal that has lived such an unnatural, cortisol-fueled life of constant stress is good for us. There are huge differences between gentle slaughter and the very worst methods.

Some of the worst methods aren’t just in the U.S.; in China, there are reports of boiling pigs alive because that’s the most efficient way they can do it. Other horrific methods include suffocating them to death.

I can’t imagine that the cortisol and adrenaline flooding that meat, as they live under such harsh conditions for such a long time, can be good for you compared to eating an animal that’s lived a natural life.

It’s interesting that people often claim eating meat is natural, while they’re not consuming bison roaming on the plains. Hunting is as natural as it gets, but this is the furthest thing from it. It seems intuitive to me that the impacts of that meat accumulating in your body as you grow aren’t ideal—personally, I wouldn’t want it.

On that note, I always find it funny when people say they’re following the paleo diet while eating factory-farmed chicken. This is not something our ancestors were eating. Beyond the additives and all the issues, one thing that’s crazy about U.S. chicken production is that the only way they ensure contaminants are down to acceptable levels is by bathing the chicken in chlorine once it’s been slaughtered.

You can’t export U.S. chicken to the U.K. or Europe because they have higher food safety standards prohibiting this practice. Even after bathing the chicken in chlorine, a significant portion still has salmonella—still has antibiotic-resistant salmonella. In fact, there are U.S. regulatory tolerances for this.

Depending on the cut of chicken, you’re allowed up to 25-33% of the chicken to have salmonella. Meanwhile, plant-based foods often have tolerances of 0%. So there’s this crazy system where we’ve just decided that as long as we bathe it in chlorine on the back end, we do all these things to try and kill everything, then it’s acceptable.

People think they’re eating healthy, lean, pure protein, but they have no clue about the process by which it’s produced. What are the standards for labeling where your meat comes from in the U.S.? I know in Europe and the U.K., they have strict rules about the conditions in which animals are raised.

There are only very minimal labels that mean anything in the U.S. The USDA does regulate the term organic, which means something since it has a real certification process. Typically, “free-range,” “pasture-raised,” and “cage-free” mean that the animals at least had access to the outdoors and were not kept in cages.

Basically, every other term you see means nothing. One of the most popular brands of factory-farm chicken in the country has a huge label that says “all natural.” If you look at the tiny fine print, what they interpret that to mean is that the chicken didn’t have any artificial additives added to it after it was slaughtered.

The USDA states that as long as you define the term, you can decide what it means. So again, they say, “Well, there’s full disclosure.” There’s a tiny detail at the bottom of the package on what “all natural” means, but it’s hard to interpret. People won’t know what it means; that’s the kind of labeling craziness they get away with.

Congress tried to crack down on this a few years ago. They went to the USDA and said, “You need to change this whole process for labels. You need to actually make it mean something.” The USDA spent years consulting with consumers and industries, then came back with essentially the same approach and said, “No, it’s actually fine. We’ll leave it to the industry to define these labels.”

So why is the USDA so hands-off on this?

It’s been captured by industry. It’s a straightforward process of industry capture. These are people who are deeply afraid of what the industry will say if they try to meaningfully regulate anything. This has happened again and again.

For instance, in the early 1970s, the FDA tried to regulate antibiotic use in animal feed. The chicken industry immediately went to Congress, and powerful senators threatened the FDA saying, “We’re going to defund you—if you try to move forward with these regulations, you’re done.”

It’s not just this issue; the revolving door at the USDA is also a problem. Oftentimes, the people regulating the industry previously worked in it and plan to go back. This phenomenon happens under every administration.

The last agriculture secretary, Tom Vilsack, was governor of Iowa, the largest factory farming state in America. That was the credential that got him nominated by Obama. After serving, he lobbied for the dairy industry and got re-nominated under Biden as Agriculture Secretary. Now we’re just waiting to see which industry he goes to work for next.

It’s a crazy revolving door that has been accepted because it’s seen as the only purpose of the Department of Agriculture is to serve the agriculture industry instead of protecting consumers.

Surely the purpose of government is usually to prevent the worst excesses of corporations. We have countless data points demonstrating that corporations prioritize themselves over consumers and the environment. It’s absurd that the one thing meant to protect us from these excesses is helping them.

It’s really crazy. The USDA was founded by President Lincoln. He said it would be the people’s department because food and agriculture concern everyone. Over the years, as agriculture became a narrower industry, the Department of Agriculture started serving big corporations instead of family farmers.

If the public is not watching and politicians think no voter will care about regulatory capture of the USDA, they are often right. Consequently, the industry can get away with whatever they want, and that’s exactly what has happened at the Department of Agriculture.

Pandemics are another growing concern. In 2009, there was the H1N1 influenza outbreak from an industrial pig farm in North Carolina. Recently, about 50% of farms in America have had some kind of outbreak. This is not just limited to chickens; it seems to be spreading in cattle as well, which I thought was unusual since bird flu is typically not supposed to spread easily in mammals. What’s going on there?

The bird flu is a tragic case of what occurs when we allow factory farms to operate without accountability. When H5N1 started in earnest around 2023, the question was whether we would vaccinate the birds. Many countries around the world do this and do not suffer outbreaks like we do.

In the U.S., we have the capability to produce and supply those vaccines. There was a factory in Kansas in 2023 that had 400 million doses ready, sufficient to cover the entire egg industry in the U.S. The only question was whether the USDA would allow farmers to vaccinate or ideally require them to do so.

The chicken industry said absolutely not. They argued that vaccination would prevent them from exporting chicken to other countries. Other countries would find an excuse to close their borders to U.S. chicken, which is worth billions of dollars annually.

The USDA agreed with the industry. They decided not only would they not require vaccination, but they wouldn’t even allow it, despite other countries containing the virus by vaccinating. Then the policy became that every time the virus popped up, we would kill all the birds to suppress it.

USDA kills birds using very inhumane methods. For instance, they often turn off the ventilation system and pump in heat, a method referred to as ventilation shutdown plus. It suffocates the birds slowly. Studies indicate that it takes hours for this to occur, and it’s not always effective.

In many cases, the USDA has to return the next day to kill all the birds that survived the previous day’s attempt. They’ve also used these methods to kill pigs. It’s unfathomable that we accept this on today’s farms, justified under the cruel argument of necessity because we have no other options.

In reality, we have no other options due to lack of preparation. They have killed over 150 million birds—almost all using this method—in the last year or two since this current outbreak.

That’s right. That’s roughly equivalent to half the population of the U.S., slowly baked to death. And perhaps half of them would survive anyway, resulting in a round two kill the next day.

To me, this should be a national outrage. Unfortunately, the media isn’t covering it. They don’t know what’s going on or maybe they don’t think it’s a story they can convey.

I think there’s a cognitive dissonance with how people react when discussing this topic. They acknowledge the issues but don’t want to confront them because they like meat and don’t want to change. It’s okay to acknowledge cognitive dissonance and still aim to make a difference.

On that note, I believe the animal rights movement bears some blame for framing this issue around individual behavior and guilt. They’d say, “If you still eat meat, you are part of the problem. You should feel terrible and the only solution is to stop eating meat.”

Unsurprisingly, most people resist this, which leads them to block it out. In reality, similar to any major societal issue, we need government, corporations, and technology to step up. Focusing solely on individual diets misses the bigger picture.

I’ve read some things about the environmental impacts of factory farming, particularly in the local areas where these farms are situated. People living nearby suffer from air pollution problems and waterway contamination. Can you elaborate on that?

The basic problem of factory farms environmentally is that they concentrate all the environmental harms in one small space. In pasture-based farming, manure is spread over a large area, which the land can absorb. In tiny factory farms, there’s a huge amount of manure and no land to put it on. Also, there’s a significant buildup of ammonia and other toxins in the barn, but no regulations prevent polluting air and waterways.

The industry has effectively lobbied to exempt themselves from environmental regulation under the Clean Air Act and the Clean Water Act. They are effectively the only industry exempt from such regulation. As a result, there are no standards. If they want to release ammonia and other toxins into the local air, they can do so without repercussions.

Being a neighbor of one of these factory farms is horrific; you wake up every morning to the smell of feces in the air, nasty flies, and a horrible environment. Thankfully, there are some local regulations against water pollution, but there are often not enough. For example, the city of Des Moines, Iowa, is suing the Department of Agriculture because their water supply is polluted with manure from factory farms upstream.

How is that not more reported in the media? The Flint water crisis received significant attention when industry polluted the local water supply. In this case, it seems just as egregious.

People have this bizarre notion that manure is natural. Because manure is natural, it’s not the same as lead. However, it contains antibiotics and drug runoff, which can have severe implications for public health. There’s all kinds of other crazy shit that is going into the waterways along with the literal shit. But it’s also the sheer volume of this is completely unnatural. Birds will poop in a river; that’s normal. But having a million pigs run off going into one small river, stream, yeah, stream is not normal at all.

I mean, one thing I say, too, is the time when it’s truly horrific is whenever a natural disaster strikes. So, for example, in North Carolina, what they eventually did to try and stop the runoff into the waterways was they said, okay, every pig factory farm needs to have a giant pool, which all the manure goes into. They call them manure lagoons. They said it’ll be totally fine because it’ll be contained to this pool and it’ll dehydrate over time and it’ll be good.

What happens in reality is these are incredibly flood-prone areas where the factory farms are, right, because they’re on flat land, near the coast. And so every time a hurricane comes in, pretty often, these manure lagoons flood. The local water gets horrendously polluted. We get not just all the manure, but, in fact, when that happens, also, the factory farmers always flee and don’t take their animals with them because they have nowhere to take their animals to.

So you also end up with a whole bunch of dead pigs that have just drowned floating around in the local waterways. This is just treated as something that is a necessity. It’s treated as something where there’s no alternative. I mean, how else could we possibly handle this? And so it’s just accepted as a fact of life.

Any other environmental things that people might not be aware of? I mean, I think that the whole way you go along the chain, right? You start off with the feed that’s needed in this industry. A huge amount of that feed is coming from soy and corn being monocropped. Some of it is coming from the Amazon. I mean, the fastest driver of deforestation in the Amazon is these crops for the use of factory farming. That’s why they’re cutting down the trees.

If you’re concerned about climate, animal agriculture is 15% of global climate emissions. But, you know, at every level, you look at some environmental harm and you see there is some way a factory farm is contributing to this. It’s air pollution, it’s water pollution, it’s the toxins in the food we eat. Everything about it is piling in.

Another area that I think people don’t consider until they research this is the mental health impacts on the people who work on these farms. I was reading some of the most horrific stuff. The people who often end up working in these things are arguably some of the more vulnerable members of society in the first place.

You know, they’re either migrants, they don’t have a lot of other job opportunities and so on. One particular person was writing, personally, I didn’t suffer any physical injuries working on this farm, but the place really affected my mind. As I spent day after day in that large windowless box, I felt like a gray fog descending over me. At night, my mind would be filled with literal nightmares replaying over and over some of the horrors I’ve witnessed throughout the day.

I mean, I can’t, if you are doing something where you are just day in, day out, either slaughtering thousands and thousands of animals by your hand or pushing them into a slaughter machine, I mean, ultimately, we’re not, no, again, this is not like on the plains where we evolved to shoot something with a bow and arrow. This is just on a mass scale and every single animal fights for its life. People don’t think that they’re just calmly walking in; especially with pigs.

Can you describe what you saw in some of these slaughterhouses? Sure. Years ago, I visited a slaughterhouse that slaughtered cows and sheep and pigs. It was, it was one of the better run places. This was in New Zealand where there were actually regulations. This is like a nice slaughterhouse, the biggest slaughterhouse in New Zealand. It was well regulated and they had worked out the least bad process for slaughtering sheep and cattle.

But when it came to pigs, there was no good process. The pigs know exactly what is coming. They had tried to do everything to stop it, but the pigs knew exactly what was coming. The moment they were let out into the kill room, they would start squealing and screaming and trying to run up the walls. They were doing anything they could to escape. They knew what was coming.

The other particularly horrendous part that I didn’t mention is these slaughterhouses have an incredibly fast throughput rate. They have to; it’s an assembly, a disassembly line. It’s always moving at the same speed. In this case, I think they had four or five seconds per pig to kill them. Ideally, they get a good stun, and then the pig gets killed with an automatic knife and it’s dead.

But if they don’t, the pig continues the whole way down the disassembly line alive. The next step, after it is meant to have been stuck with a knife and killed, is it is dipped into a vat of boiling water so that its skin will be easier to come off. I said to the guy, do the animals ever survive there? Do they ever make it past there? He was like, oh no, don’t worry about it. He was like, you know, when they do make it past, they’re alive, the workers down the line get really pissed off. He’s like, it’s really hard to take apart a live animal, you know? So like, don’t worry, there’s a natural check on this system.

Again, this is the best slaughterhouse. This is as good as it gets for pigs. When it comes to something like chickens, it can be even worse because they’re worth so little. When it comes to chickens, you’re not talking about getting four or five seconds per animal; you get less than one second per animal. They’re just hoisting these chickens up upside down, throwing their legs into shackles. Legs often break, wings often break; they don’t care.

They move on to the next step. They’re meant to get stunned by being dipped upside down into an electrical water bath. But the stun setting, that’s just another crazy fact. In the European Union, they have regulated stun settings to ensure that these chickens actually get knocked out by this. In the US, there are no regulations. The industry worked out that rather than optimizing for actually knocking them out, we’ll optimize for meat quality, which requires a much lower stun. Many of them are not knocked out and just continue on the process.

Ideally, they get killed by the automatic blade that’s meant to cut their neck. If they duck, they keep on going. There’s just this insane reality around these factory farms. If you have no regulations, if you truly don’t care what happens to the animals, this is what you end up with.

Coming back to the human impact on the workers, there was a guy, Virgil Butler, who worked at Tyson Foods poultry plant, who became a whistleblower for this. He said, again, you don’t understand; I was turned into a robot zombie machine because that was the only way either I had emotions about what I was doing, or I just turned myself into a zombie.

There was one of his co-workers who ended up in a mental hospital because he kept having endless nightmares of chickens coming after him. That’s one aspect of the human suffering. There’s also this other factor, which is there seems to be always a percent of human beings who tend to have psychopathic tendencies. It attracts those types of people.

Some of the things I was reading about the games that these farmworkers play as a way to pass the time… One of them was called the shit game where they grab a live chicken each and squeeze it as hard as they can so that its insides and shit fly at each other. That was a game they played. Another one they’d do is who could rip off the head of a live chicken with their bare hands and then stick the heads on their fingers. This is the hell that we have created for our food system that most people don’t even know is going on as we speak all the time.

What is that doing to humanity’s soul, if such a thing exists? Even if you’re not religious, this will abhor, I imagine, every single person listening. The karma is mounting up, and we’re just not doing anything to address it. For decades now, some animal advocates have been going undercover in these factory farms to film and expose the conditions going on. You’d expect they would see the same cages and the same routine standard abuses.

One of the wild things is these investigators choose these farms at random. They don’t get reports; they don’t learn that this farm is uniquely bad. They just go around and see where they can get a job. Almost without fail, they film sadistic abuse. They’re often only at these places for like three months, and in that period of time, they almost always find some kind of crazy sadistic abuse.

Whether that’s because the workers started out as psychopaths or whether it’s just because when your job requires you to do these horrific things, you know, and maybe for instance, you have to package up a million chickens over the course of several days, some of them start to flap their wings and try to get away, and you get frustrated. You find it easier to throw the chickens into the crate rather than to put them in gently. It encourages a crazy level of personal abuse that we’ve only seen the tip of the iceberg.

These investigations have captured less than 1% of these farms, and still we have a huge litany of these abuses caught on camera. One thing I will do is in the show notes of this, I’m going to provide a bunch of resources for those of you who want to actually go down this rabbit hole. It’s not a fun rabbit hole, but we need more people going down it to understand this.

Since I started tweeting about this more actively, the topic upsets people across the political spectrum. I was expecting, it doesn’t surprise me that people who are more into social justice would be upset by this. But I noticed there were a bunch of what I would consider almost far-right accounts retweeting or commenting saying, this is the worst thing ever. Religious people, non-religious people, everyone seems to agree that this is bad for everyone and everything in so many ways.

There was a study in 2022 where 84% of Democrats said they would support a law in their state completely banning farmed animal confinement, and 76% of Republicans said the same. This is truly a bipartisan issue. So why is it so entrenched? We’ve partially covered this talking about regulatory capture, but I want to understand the incentives that have us stuck.

It’s Moloch, right? There’s the Moloch trap, this game-theoretic trap that we’re stuck in. Is it just because the meat is so cheap, or what is the reason? First, I completely agree that what gives me hope is that almost everyone thinks this is wrong. Very few people are excited about factory farming. There was a survey the European Union did where it found that 89% of Europeans think they should ban the individual caging of animals.

That’s across the board; that’s crazy. The European Union did that survey and then didn’t move forward with their plans to ban the caging. I think the reason is the same as we always say, which is just this incredible capture of the political process by industry, coupled with complete public ignorance. If every night on the news or every day on social media, people were confronted with the realities of factory farming, I don’t think we’d have it for that much longer.

If people had to see it and kept realizing where it came from, I don’t think people would buy it if they had to see a photo of the crate the pig had been in like on the pork product. It relies on this total lack of transparency, which is reinforced by things like ag-gag laws that ban people from providing transparency.

It’s also reinforced by the political capture where again, the calculus of politicians is that no one’s voting based on this. Yeah, everyone hates it, but they’re instead going to vote on a cultural war issue or inflation or the economy or whatever. I can give a stop to the lobbyists; I can give a stop to this very narrow constituency that really cares about keeping animals in crates, and that’s how it continues.

The way it will change is when people start saying to their politicians, I am actually going to vote on this issue. I am actually going to hold you accountable on this issue. This is across the board; Democrats have been just as bad as Republicans on this. Plenty of Democratic senators basically serve as stooges for the factory farming industry. They think they can get away with it because they assume that no one is watching, that no one’s paying attention.

The combination of lack of transparency and capture of the political process is the trap we’re in, but hopefully, it’s also the way out of it. Are they literally being paid by the big corporations like Tyson or Smithfield? How do they incentivize them specifically? Some of it’s money. Certainly in the US system, there’s a lot of money in politics, and these are major campaign contributors. The largest campaign contributors in the agriculture sector are the big agribusinesses. By contrast, small farmers trying to do things the right way are contributing nothing, as they don’t have money to contribute.

It’s the very profitable factory farming corporations giving all the money, and it’s not just the money. The factory farm lobby has been effective at positioning itself as the voice of farmers. There’s this mythology in our society where we think of the hardworking farmer, the salt of the earth. Politicians are really afraid of them, not just because they’re held in popular esteem but also because they’re very good at organizing.

The European Union was proposing a few regulations that farm lobby groups didn’t like. They sent flotillas of tractors into Brussels and clogged up all the streets. The European Union then backed off. They’ve long had this kind of agricultural exceptionalism where it’s basically understood by politicians that you can’t touch these guys.

I remember Bob Dole, who was the Republican Senate majority leader, who was a huge advocate for humane reforms. He passed a number of landmark pieces of legislation. Yet he still said that the one group he won’t mess with are the farmers. I’m still deeply afraid of them.

From what I understand, it’s not the small farmers that are doing these worst practices; it’s the big corporations. When you describe this protest in Brussels, that’s mostly small farmers doing that. Are they being misinformed that these laws are against them when they’re actually meant to hurt them? It’s a mixture. I should say that the protests in Brussels were not specifically over farming and welfare reforms; they were other kinds of regulations.

Regulators interpret that as we can’t regulate anything at all. I think it’s the case that a lot of small farmers don’t want environmental regulations, but they’d be fine with animal welfare regulations. They get thrown into that as being against big government regulations.

There are a lot of industries that depend on factory farming. If you’re growing soy or corn, it’s probably going to factory farms. It would be bad for you if the factory farms went away because they wouldn’t need your soy and corn anymore. If you’re selling agricultural equipment, it might be going to them. If you’re a small-town bank, you’re probably lending to factory farming operations.

This whole economy has developed around it. To me, it’s still kind of crazy because that economy wouldn’t have to go away. If we had higher welfare farming, they would still need loans, they would still need some feed, maybe not as much monocrop soy and wheat, and they would still need an economy around it.

It would be a change and different people would benefit, and some people would not benefit. A lot of it is the resistance to change, the idea that the status quo is working quite nicely for a small group of people. There is a degree of lobbies not representing the people they purport to represent. An example is the American Farm Bureau, which claims to represent something like 10 million farmers. There aren’t 10 million farmers in America. Most of the people it represents are people who bought insurance policies through the Farm Bureau who presumably do not care about farm animal welfare regulation.

When the Farm Bureau goes up to Congress, they don’t say that. They say we’ve got 10 million members who are mad about this. The European farm lobby claims to represent something like 20 million farmers. Again, they’re not actually members of the organization; that’s just roughly how many farmers there are in Europe.

There’s almost no lobby at all for small farmers. A few small grassroots groups try to represent their interests, and they support reforms. Because small, hardworking farmers have either been driven out of business entirely, or are just hanging on by a thread, they don’t have the money or time to go to Washington, D.C., and lobby their politicians.

By contrast, the Farm Bureau is very good at finding those 1,000 factory farmers who stand to lose a lot and sending all of them to Washington, D.C. at the same time. The senator thinks, oh my God, there are 1,000 farmers here who are mad at me, not realizing these are just the factory farmers. They don’t represent the entire constituency, but they’re the ones who are really worked up about it.

One of the most egregious examples I’ve seen of this misleading of people—bucketting small farmers and big factory farmers into the same category—is the EATS Act. Can you explain what that is? It might be one of the worst steps backward for American farming that I’ve ever seen.

Absolutely right. As people started to realize the conditions on factory farms, particularly with gestation crates, they first went to their politicians and tried to get reforms. State politicians had no interest; they said they were captured by the industry. A number of states in the U.S. allow for ballot measures—issues to go directly to the voters. Advocates took this issue directly to voters, first in a number of red states; first it was in Florida, then in Arizona.

Each time they went overwhelmingly; even in these red states, people overwhelmingly said, yeah, ban the crates. The problem with those laws was that they only banned crates within the states. The first thing that happened was the Arizona pork producers using crates said, no worries, we’ll just move over to Nevada. We’ll cross the border and sell our pork back into Arizona.

Because of that, advocates said, okay, the next step we need is to actually ban the sale of this cruelly produced meat in the state as well. They passed laws like that first in Massachusetts, then in California, and since then in a few other places, elements of those laws in Michigan. The industry got really threatened by this. They realized someone finally worked out a way to harness the popular will to not have these crates.

The first thing they did was file a ton of lawsuits. They probably filed six or seven lawsuits so far. One of them made its way to the Supreme Court. To show you the power of this industry, the pork producers not only got on board every industry group in D.C., they had the American Chamber of Commerce and the Petroleum Institute—all of them said, we’re with you in solidarity against these regulations.

They managed to get the Biden administration on their side. The Solicitor General of the United States appeared before the Supreme Court arguing on behalf of the pork industry on their right to knock down state laws stopping the crating of pigs. The surreal nature of this is something almost no one has any idea about. No one who voted for Biden was voting for this, but they don’t know what even happened.

Despite all that, the industry lost. The Supreme Court, a majority—both conservative and liberal justices—said, yeah, states absolutely have a right to do this. It’s well within their power. It’s something they can do. Voters, it’s a very democratic process. Just because it costs your industry a little bit doesn’t make it unconstitutional.

The moment that happened, pork producers went across the street to Congress and said, you need to undo this. You need to effectively overturn the Supreme Court ruling. They need to preempt all these state laws and knock them out. Normally, when Congress preempts state laws, it introduces its own standards. In this case, Congress is proposing to not create their own standard but to create no standard.

They are knocking out these state laws and not putting anything in their place. The really wild thing about this is they’ve gotten both Democrats and Republicans on board, but they’ve got support from states’ rights Republicans to sign on to this idea of knocking out state laws they don’t like with federal preemption power, which is about as un-Republican an idea as you can find.

At this point, they set this thing of the EATS Act. They knew that wouldn’t pass on its own in Congress. Every five years, the Farm Bill comes up. The Farm Bill is considered a must-pass piece of legislation because it has enormous billions in subsidies for the agriculture industry and billions in food aid. Republicans love the farm subsidies; Democrats love the food aid. It always passes. They said, we’ll just slip it into there.

We’ll put this provision in the Farm Bill. If you can’t get it out of the Farm Bill, no one’s going to veto the bill based on this, and it’ll pass through the Farm Bill. That’s exactly what they’re trying to do right now.

I was watching a hearing with Brooke Rawlins, Trump’s new agriculture appointee. She claimed that the EATS Act is here to support small farmers. She’s lying through her teeth because it’s literally the small farmers who are being squashed out and steamrolled by factory farms. They’re suffering. If the EATS Act passes, it’ll give factory farms even more market share, which will crush small farmers further.

It’s a complete moral inversion, just a straight-up lie. Why aren’t the small farmers seeing this? Well, some of them are. Some of the strongest lobbyists against this have been small pork farmers. There have been whole bunches going to Congress saying please, this is a market for our value-added product. We can make more money by selling into California and Massachusetts with these standards. This puts us on more of a level playing field.

They don’t have to sink to the depths of factory farming to sell onto the commodity market. In fact, the House Agriculture Chairperson, who is one of the chief people fighting this, is the biggest pork producer in his own district and is a small farm producer. They are against this. But he has repeatedly said the National Pork Producers Council wants it; the national industry wants it.

These politicians just don’t care what these small farmers think. They’ve realized small farmers aren’t politically organized. They don’t represent any real threat. They’re never going to fund a primary challenge against them. By contrast, they’ve heard loud and clear from the Farm Bureau and the National Pork Producers Council that if you don’t back us on this, we’re going to fund a primary challenge against you.

If they’re thinking about their own selfish interests, it’s to do this. A great case in point was Brooke Rawlins at that hearing. She said, you know, I hadn’t heard about this issue until a couple of weeks ago. She only heard about it because she went around the senators’ offices, and they all said, you need to back this.

So she just towed the line and said, oh yeah, sure, you guys want it? Absolutely. We’ll knock these out. It’s this crazy reality where, as you say, they say it’s for small farmers. There’s no definition of what a small farmer is. If they think 100,000 pigs are small, then yeah, it helps those farmers.

I also want to double-click on ag-gag laws, which prevent transparency in what’s going on. In the early 2010s, the argument was protecting farms from competition. They came up with these laws that made it illegal for anyone to take photographs or videos of agricultural facilities without the owner’s consent. It also made it illegal to gain access to agricultural facilities under false pretenses.

This is an example of twisting government, which is meant to be the safeguard for consumers from corporate evils. They use law to help the big guys do even more evil. Some of those laws have been overturned. What happened so that they were overturned?

The first incarnation of these laws had the factory farming industry saying, you can’t criticize us. Some laws specifically said, if you want to come on a farm and film us to make us look good, that’s totally fine, but if you make us look bad, that’s illegal. Thankfully, that is very clearly unconstitutional under the First Amendment; it’s viewpoint discrimination and illegal.

What’s happened is a crazy game of cat and mouse where every time a court strikes down these laws as unconstitutional, the legislature is motivated to stop transparency and comes back with a new one. Iowa is now on its fourth ag-gag law, having had the previous three struck down. Each time they look at court ruling and say, how do we stop any exposure of factory farms in Iowa, narrowly and in a way where, you know, the first time around, legislators said, our purpose here is to stop the exposure of factory farms.

Recently, they’ve been cleverer. Now they don’t say why they’re doing it; they say this is just a general protection law, citing some privacy concerns or other baloney. Intellectual property, exactly, some crazy reason. They’ve worked out how to take a lot of that.

They’ve also created legal liability. If you want to work at a farm that contracts with Tyson Foods or any other big ones now, you’ll be presented with a contract that says, I certify I’m not going to film in here. No matter what I see, I will not expose it publicly. If I do, I acknowledge that I’ll be sued for a lot of money.

They’ve created something that makes it impossible to expose what they’re doing publicly. That’s exactly the system they have intentionally created.

Hopefully, we’ve sufficiently painted a picture of how bad this situation is. There are solutions, and an incredible amount of progress has been made, particularly in Europe. The UK itself is making some headway. Can you talk about the different methods that have been used to improve welfare standards?

There is hope for change. People often get despondent thinking about the scale of factory farming suffering and their inability to do anything about it. People have managed to do things about it. As much as the system looks strong, it is deeply vulnerable because it’s using practices that are almost universally unacceptable; almost universally people disapprove of what they’re doing.

Some of the more exciting reforms we’ve seen first, at the legislative level, were a number of European countries moving away from the worst factory farming practices. The UK long ago banned gestation crates. Germany is now phasing them out, and a number of other European countries are doing so. The European Union is considering doing so.

We’ve seen the practice of chick-culling—killing all male chicks when they’re born—being phased out. France and Germany have banned that, and it’s being phased out across Europe. At the corporate level, we’ve seen a huge amount of progress with major corporations being pressured by their customers and investors into adopting new animal welfare standards.

For example, McDonald’s recently announced that it is 100% cage-free in its egg supply in the US, Canada, and Europe. That’s over 7 million hens out of cages just thanks to McDonald’s. Costco, one of the largest retailers in the US, is almost 100% cage-free. So are a host of other businesses.

How verifiable are those? It’s verifiable in the sense that they are publicly proclaiming it to their investors. The good thing is that these big companies have a general counsel’s office, and if they want to commit fraud, the public and their investors will probably stop them.

Normally, the sort of half-truths you get around factory farming is we don’t say anything about the conditions. We just know you’ll assume they’re not on factory farms. But they know when they put on their website or on their packaging that it’s 100% cage-free. That’s a real claim that could expose them to legal liability.

Does McDonald’s own its chicken farms? No. Part of how these factory farms have escaped any kind of pressure in the past is that they’re not selling directly to consumers. There’s this complicated process where you have an integrator, like Tyson Foods, that owns parts of the process, not the whole process. They don’t even own the farms.

What’s happened more recently is these consumer-facing brands, like fast food chains, have said, we’re going to adopt standards. The funny thing is that for years these producers would say, we couldn’t possibly do anything differently; it’s completely impossible. The moment McDonald’s said, we want to do things differently, they immediately fell in line.

At every point, particularly in the U.S. where the legislative process has not worked for farmed animals, we’ve consistently seen that when corporations make a move and require it from their suppliers, they can do it overnight. They can tell these guys they have to do it. These suppliers are completely dependent on them. They’re the ones who have the power in the food supply chain, and they’re the ones who ultimately need to take that responsibility.

So in the case of eggs, my understanding is McDonald’s is primarily working with Cargill. Cargill is either producing the eggs itself or working with other egg producers, ensuring that you have these cage-free eggs available in these markets by this time.

There’s a logistical challenge, but it’s not that challenging. If there’s no transparency in what’s actually going on inside the farms, how do they know? Again, I think this is where I feel good about the companies making these proclamations public. There is real liability for them. If they came out and said it’s verified at the home of the cage-free, and someone went and did an investigation and found otherwise, that would create real issues for them.

For what it’s worth, having talked to a lot of people in the industry, this real transformation is underway in terms of cages. I’ve been to industry trade shows where I’ve talked to cage suppliers who sell cages, and they said they’re not selling cages in the US anymore. They haven’t sold them in years in the US. Instead, they’re only selling cage-free systems. That’s the only thing egg producers are interested in now—cage-free systems.

That’s not out of the goodness of their hearts; it’s because the retailers and the fast food chains have told them that’s the only thing they’ll have. Local news reports show they’re building huge new cage-free farms, converting cage facilities into cage-free facilities. These are big construction projects.

Yes, it’s possible that in certain cases, there will be fraud. People need to be careful around ambiguous labels; all natural means nothing. But if they’re making a specific commitment, particularly if that’s being verified by a big company standing behind it, then we can feel better about it.

It’s interesting because it’s like working within the system. There’s a divest movement where people often say everyone should divest their funds, but there’s another approach to working within the system where you can invest in one of these companies and then apply shareholder pressure. There are tighter laws where companies can’t ignore what their shareholders request.

Can you talk a bit more about that? Absolutely. I think divestment is a silly idea in this space because that’s what these companies want. They want all the investors who care to go away. One great case study of investors exerting pressure was Carl Icahn, the billionaire investor who personally cares about pigs.

When he learned about the horrors of gestation crates about a decade ago, he went to McDonald’s. He told them, you’ve gotten rid of crates in Europe. You’re saying to everyone that it’s impossible in the US. He said, I’m going to put someone on your board. He initiated a proxy battle to put someone on your board who will care about this issue. Within days, the CEO of McDonald’s called up Carl Icahn and said, we’ll do it; we’ll do exactly what you want on this issue.

It took just one person who had power over the company. McDonald’s then slow-tracked implementing that pledge. A few years later, Carl Icahn went back to McDonald’s and said, hey, I see you haven’t followed through on that. I’m now going to put some people on your board, and again, McDonald’s freaked out, and now they’re actually following through on it. We’re about to see McDonald’s very soon have a crate-free supply chain.

There is a huge role for investors here in catalyzing the interests of consumers and other investors who often don’t have the power individually to move corporate decisions in the way that some investors do. Are there any clever sort of legal pressure campaigns or techniques that investors can do?

Yeah, I mean, I think that we’ve seen, for instance, a number of SEC complaints or Security Exchange Commission complaints over the years looking at what companies are telling their investors versus what they’re actually doing.

And so, for example, one thing that companies will routinely say is when we’ve done our materiality analysis, when we’ve looked at what is actually material to the risks to the company, we have found that consumers are really worried about our animal welfare status.

So, I mean, a crazy thing like this was even Tyson Foods; a senior executive there told me they had done internal surveys of when it comes to sustainability, what do consumers care about?

And you might assume it’s climate, number one. It’s not. It’s animal welfare. It comes ahead of all the environmental concerns.

And so that’s really a material concern for them. And it’s a risk because if consumers find out how they’re actually treating their animals, they could desert the company in droves.

That could pose a real risk to the company.

And so where actions have sometimes been effective is in going to the SEC and saying, look, this company’s saying this is a risk, but they’re not doing anything about it.

Particularly in cases where the company has actually made a public pledge and said, we’re going to get rid of crates, we’re going to get rid of cages.

Then it’s tried to forget about it, has tried to say, oh, we’ll just stop reporting on how that’s going.

But, you know, that’s also presented a real opportunity for people to go in and say, either through the SEC or just going to investors as a whole and putting forward a resolution at their shareholder meeting, for instance, saying, hey, just update us on how you’re doing on that.

Because that alone, that public exposure is often enough to embarrass the company into action.

What are some of the lessons that can be drawn from successful campaigns to pressure governments into making, for example, improved standards in slaughter methods?

Because, again, some countries are actually, you know, I mean, there’s no especially nice way to slaughter an animal, but there is a huge gap between stunning them prior or making it so they can’t see what’s happening before they get sent.

You described it as like there’s this thing called the stairway to heaven where, you know, the cows sort of go up and the one gets stunned and then drops, and they don’t see until it’s about to happen to them.

Okay, that sounds like ways to go, not too bad.

So, that’s been enacted a lot in Europe.

What, how did that, you know, what were the campaigns that made that happen successfully that can be then copy-pasted into, for example, America?

Yeah, so I mean, when it comes to slaughter, this is actually one of the few areas where regulation goes back a long way.

So even before we had factory farming, we often had humane slaughter regulations. I mean, some of these have religious roots in terms of halal or kosher slaughter. Those were motivated by humane concerns originally.

Really? I thought the whole thing was that you can’t stun them, and you have to let them bleed to death.

Yeah, which now in an industrial setting is a terrible way to kill an animal. But in the original setting, the idea was we had no way to stun them originally at that time, right? There was no electrical stunning around.

Right. And so the idea was to use a really sharp knife, cut them really quickly, and then let them fully bleed out before you do anything else.

Which is the most humane way to kill an animal if you have no future technology and if you’re killing them one at a time.

By hand.

Right.

With the person who actually cares.

Now, unfortunately, as with all methods of slaughter, when you end up in an industrial setting, it goes horrible.

And that’s when you need to be able to use stunning.

Exactly. That’s when you need to be able to electrically stun the animals or do something else because you have this incredibly fast throughput. You can’t provide that kind of individual care to an animal that is dying.

And so the adoption of these humane standards came with the slaughterhouse industry kind of industrialized first.

So well before we had factory farms, we had industrial slaughterhouses.

And people started to see the horrors of that.

I mean, in the US, we had The Jungle, Upton Sinclair’s book on the Chicago slaughterhouses. It was mainly about the abuse of workers and about food safety concerns, but it also talked at length about how horrible the way we were killing the animals was.

And in Europe, similarly, we started to see real humane concerns develop around this.

And so I actually think it was a lot of popular mobilization. I mean, this is a fascinating case.

So in the US, the Humane Methods of Slaughter Act, which is actually a very strong law as far as mammals go. Unfortunately, it has been interpreted by the USDA to not apply to chickens, which are by far the biggest group of animals slaughtered.

But this law was passed in the early 1950s. And when President Eisenhower signed it into law, he said, if I went based on the mail, I would think that people didn’t care about anything other than humane slaughter.

So, I mean, even in the 1950s, there was a huge popular outcry about this. And there was a real motivation to change it.

And I think the other thing that has aided that is the development of better technology.

So you mentioned the stairway to heaven. That was something that Dr. Temple Grandin developed, initially actually developed in the US with McDonald’s.

So the only way she got that implemented originally was McDonald’s told its suppliers to implement it, but it’s now spread globally.

And it doesn’t cost a ton more than inhumanely killing animals.

And so I think that’s a big part of how we’ve managed to get these.

But, you know, I think it was that combination of there needs to be public pressure, but there also needs to be a technocratic solution that is actually viable.

What about the role of government subsidies?

Because, you know, as you mentioned, the farm bill is billions in dollars.

What portion of subsidies are currently spent on improving animal welfare conditions or, again, the actual quality of the meat, you know, such that it’s not full of antibiotics?

Right. Yeah.

I mean, I think our subsidy system is deeply messed up.

I mean, basically in the US, it was a response to the Great Depression. And there was this Dust Bowl crisis where really small family farmers were going bankrupt.

And so as part of the New Deal, FDR said, we’ll give them some money.

And so it was almost a welfare scheme at the time.

Over time, as all those small farmers disappeared, and as you ended up with a whole bunch of big factory farmers, it became a corporate welfare scheme.

And you now have the average recipient of farm subsidies earning hundreds of thousands of dollars a year.

So it’s this subsidy scheme for quite wealthy corporate farmers.

And primarily what they subsidize is the inputs into factory farming.

So primarily they subsidize the really cheap production of corn and soy.

And they’ve encouraged the monocropping of those crops in very large volumes to feed into factory farms.

They also provide a whole host of other subsidies to factory farms.

So, for instance, they get heavily subsidized insurance.

So that if anything ever goes wrong, the government basically picks up the bill.

You see that, for instance, with Abbey and Flu.

There are even deals where if the industry overproduces one year, the USDA will often buy up their surplus. So they do these commodity buyback programs where, you know, we’re sorry, the chicken industry produced too much.

We would hate for them to have to suffer the market outcome that any other industry would suffer.

And so we will just buy that extra chicken and we’ll foist it off on local school districts or something, you know, to force them to eat it.

And so, yeah, this is how this is developed.

As a result, the standard tradition has been there is no money for higher welfare farming.

In fact, the subsidies have had this perverse effect where because they subsidize these huge commodity crops, they’ve dramatically increased the price of farmland.

It’s incredibly valuable to own farmland because you can just get subsidies for owning farmland and planting it with corn or soy.

And so as a result, pasture-based farming is way more expensive.

You don’t get subsidies for pasture-based farming.

And so now if someone wanted to buy a few hundred acres to do pasture-based farming, the farmland price is way higher than they used to be.

They won’t get any subsidies on their feed because their feed is the grass.

They won’t get any other subsidies.

So they’re operating at this huge disadvantage relative to the industrial operators.

And the solution, I think, is, I mean, first, we should probably provide fewer subsidies in the first place.

But the subsidies that do remain should be redirected toward things that are actually good for society as a whole.

In particular, animal welfare is one of those social goods.

And we’re starting to see this in Europe.

So we’re starting to see the European Union and progressive nations like Denmark and Netherlands adopting subsidies for animal welfare improvements.

So saying if you want to get these subsidies, you actually need to do something better than you were doing previously.

And there is enough money out there that we could fund a huge transformation in these industries.

I mean, it’s something like $800 billion a year globally in farm subsidies.

Just a fraction of that could end crates, could end cages, could subsidize a huge transition away from the worst practices.

You say just a fraction.

Give me a ballpark.

Yeah, well, it’s hard to know because the industry inflates the price of any potential reform they don’t want to do.

So, you know, one estimate for getting rid of battery cages in the US is it’s about a $10 billion transition.

For getting rid of gestation crates, it would probably be a bit less than that.

Now, there would be slightly higher operating costs, but really slightly.

I mean, you know, it’s heavily disputed, but the estimate for pork raised without gestation crates is it would be somewhere between 0% and 5% more expensive at the farm level.

Which, when you pass it through to the retail level, could be a pretty tiny increase.

And again, the government could even cover that.

I mean, you know, the government could provide 0% to 5% subsidy at the farm production level for not using crates.

And essentially solve that.

Yeah, because, I mean, I think sort of the argument against all of this is like, yeah, but people are already struggling to buy, you know, to feed themselves.

People have less liquid income.

You know, who are you to say, you know, to pass on this cost to the consumer?

But your point is that it doesn’t actually have to pass on through a smarter allocation of these billions of dollars of funding.

The government is, you know, taxpayer money, essentially, is just being very misallocated.

That’s right.

Because again, like coming back to, like, the EATS Act and so on, part of the new campaign, sorry, the new administration’s push is like, we’re going to bring down the costs of everything.

Right.

Food is going to be cheaper again, and so on.

But by the sounds of things, there’s a better way of doing that by just making the allocation of subsidies be less captured by corporate cronyism, which is what the current situation is, and more effectively spent.

Is that a true statement?

Yeah, I think that’s right.

I mean, I think that’s the simplest solution in many ways to avoid consumers having to bear the burden of this.

I think, I mean, there’s another funny thing that it’s not always the case that factory farming has brought down the price of products.

So in chicken, the chicken industry, this is true, because they just have these crazy efficient birds and has brought down, it has brought down chicken prices.

I was looking the other day at how the price of bacon has changed since the adoption of factory farming in the 1970s.

And it has risen 18% faster than inflation.

Really?

Which is crazy.

And so, you know, I don’t know whether that has been effectively captured by these industries.

One of the reasons why it might be is that it is an oligopoly, the system they have.

There is a very small number of companies with very little competition.

So, you know, the vast majority of pork in this country comes from three or four companies.

And they have huge price-setting power.

And so I think it is, you know, if you combined antitrust action at the same time with these reforms, I think you might see some pretty big changes.

And, you know, an example of this right now, there have been major antitrust lawsuits going on against the biggest chicken companies.

A number of them just settled out of court for hundreds of millions of dollars, effectively acknowledging that they had been price-fixing for years.

So to me, the other crazy thing about this is whenever the industry says, you know, you can’t possibly force these reforms on us, which would cost one cent more per egg or, you know, 5% more on the pork.

And then it turns out they were price-fixing like 10% up, 20% up, you know, whatever.

It’s like, well, we just got rid of the price-fixing; like, we could afford a whole bunch of reforms.

So what about technological solutions to this?

Because, I mean, by and large, I mean, in some ways, technology has made the problem of factory farming what it is.

But, you know, technology can also be an incredible force for good.

So what are the most promising methods to, you know, essentially innovate our way out of this situation?

Yeah, I think there are some really promising methods.

And, you know, I’ll say, like, the industry often says, the only way to reform this is to go backward.

And we’re not going backward.

But I don’t think that’s true.

What do they mean?

They mean it’s to, you know, get rid of all the technology they’ve developed over the last 50 years and go back to what a farm looked like 50 to 100 years ago.

And, you know, they rightly point out that that was inefficient in many ways and that it didn’t properly control disease and all these other things.

And the point is, we don’t need to go back to that.

I mean, you know, there will always be a market for people who want to do things in the absolutely traditional way.

And that’s great.

But we can do large scale, high tech, high welfare farming.

So one example would be, you know, of a sexing technology.

This is a technology which holds the promise to sex the gender of a chick while it is still in the egg.

So rather than saying we need to wait until the egg hatches and then we need to throw out the males, just work out if this is a male or a female in the egg and get rid of the male eggs well before they’ve developed into anything that we would care about.

And it turns out that as this technology is getting developed, it can be cheaper than the alternative of waiting for the eggs to hatch, using up space in the incubator, and then having to have a whole bunch of people who literally go through and try and sex them based on what they look like.

And so, you know, this is a new technology.

Now that it’s being adopted in Europe, France and Germany have been at a ban in this practice.

About 20% of European hens have now been, are now coming from these sexed eggs.

It’s now just coming into the US.

So this year, the first US egg producers are adopting this technology.

And I’m really optimistic that we will, within a decade or so, see the end of this practice of killing male baby chicks globally.

I expect the reason why it will be adopted so quickly and easily in the US is because it actually aligns with profits.

That’s right.

It actually costs the companies less to do this.

It’s aligned with efficiency.

That’s right.

It’s a more efficient process.

And thus, you know, really, like, yeah, once again, like, if you can align the incentives, it’s easy.

And the key piece there is that you need to get the technology to a scale where that is first possible.

Because this, the industry is a commodity business run by people who are not tech savvy, who are not innovative.

I mean, factory farming leaders have straight up said to my face, like, look, you need to understand the thing we’re good at is doing exactly the same thing again and again and making it slightly cheaper year over year.

And we’re only going to adopt new technology when someone else has proven at scale that it is cheaper.

Otherwise, we don’t take risks in this business.

Like we just keep on doing the same thing because we’re producing a commodity product.

And so what you need and what happened in the case of an obosexing is you need catalytic funding that came from governments, that could come from philanthropies to fund the development of these technologies that the industry has no interest in developing themselves and get them to a point.

And again, government regulation can be really helpful here in subsidizing or mandating that transition to get them to a point where they actually are more efficient.

Another example I’ll give you is in the pork industry. It is common practice to castrate all of the males at birth.

Now, the reason they do this is not actually because they’re worried about them reproducing because they don’t let them get to puberty.

So they don’t actually.

But the reason why they do this is because non-castrated males develop something called boar taint.

Their meat tastes a little bit different because of the hormones in it. People don’t like the taste.

And so they have developed this process of castration.

Now, it would cost minimal amounts to provide pain relief while doing castration.

Anesthetic.

But the industry doesn’t want to pay minimal amounts.

They want to pay zero.

So they do this with a scalpel without any pain relief.

And that’s probably not going to. That part is not going to change.

The good news is there is now what’s called immunocastration, which is you can inject the piglet with something which controls the hormones in the same way that physically castrating them would and gets rid of the need to castrate them at all.

And in fact, the entire Brazilian, or almost the entire Brazilian pork industry, which is the third biggest in the world, has now adopted this technology.

And so they’ve basically ended castration in Brazil.

The US pork industry is behind the times.

It hasn’t adopted this technology at all.

But there is the potential for these kinds of technologies to get rid of these trade-offs.

Just to sort of still man the other side a little, I could see people being like, oh, but that’s unnatural, right?

Like, oh, you’re using a sort of medical intervention, like by injecting some kind of hormone to stop them, you know, a puberty blocker essentially.

Is there any evidence that that is bad for health?

There is no evidence.

This has been around for two decades now.

The population of Brazil has been eating this pork for a long time.

And we’re not seeing any of those outcomes.

For what it’s worth though, you’re absolutely right.

A conversation I had with a senior leader in the pork industry, I won’t say who he is to embarrass him.

I explicitly asked why was his company, had his company not adopted this?

And he said that was why.

He said that the marketing folk, he thought it was a good idea.

And the marketing people had vetoed it because they said people are going to think there are some weird hormones in the meat.

And, you know, that’s going to be bad for them in some way.

So that’s exactly what happened.

Even though the meat is already full of…

They’re all full of crazy stuff.

Yeah.

Antibiotics, hormones, that’s the thing.

Right, right.

They don’t talk about that.

It’s already pumped full.

That’s right.

And that’s because that stuff is perceived as necessary.

You know, they can’t imagine pulling that stuff from the system.

But when you’ve got this thing that’s only there for welfare reasons, well, that’s unnecessary.

We don’t need that.

What about cultivated meat?

Yeah.

So, I mean, I think the alternative protein category in general.

So, you know, the idea of can we grow meat and other proteins from plants, from fungi, from animal cells, I think there’s huge potential.

And it doesn’t need to be an either or with meat.

I know there are people out there who are going to say, I want real meat.

I want this.

And if they’re getting that meat from a real natural process, fine.

Great, great.

Where I see cultivated meat or plant-based meat and things competing is with meat that is already coming from a factory.

So, you know, people will say, oh, this is lab-grown meat.

And it’s like, well, you kind of are already eating lab-grown meat.

You’re eating the product of a number of products from a lab that were put into an animal in a factory and sent to another factory in a processing plant and popped out the end.

And so wouldn’t you rather, if it were possible, have that meat that came out the end of that processing process?

And yes, it is processed just as regular factory-farmed meat is.

It came out the end of that processing process, but had fewer of the additives you didn’t want in it.

So it doesn’t have antibiotics.

It doesn’t need to have all these other crazy things along the way.

It doesn’t need to be bathed in chlorine.

It doesn’t need to do, you know, you don’t need to do those things.

Right.

It’s not full of salmonella.

Yeah.

Exactly.

You can vastly reduce salmonella, vastly reduce foodborne contaminants.

And that’s already the case, by the way.

Plant-based meat has vastly fewer foodborne pathogens than regular meat has.

They’re held in part because they’re held to a far higher regulatory standard.

And so the promise of these is the world wants more protein.

I mean, both because people are getting richer, because people are obsessed with getting more and more protein.

The world is going to want more protein.

We need to produce more protein.

And I think, you know, the current default trajectory for doing so is let’s factory farm more animals.

We can…

I think one thing we should do is bring in reforms that improve the welfare of those animals.

But realistically, to provide the protein at the scale that the world needs it, we need other sources of protein.

And this is where I think there’s this huge potential of these alternative proteins to provide at least some of that mix of protein that we need.

Again, though, there’s regulatory capture happening before this technology has even had a chance to be fully explored.

Alabama has banned it.

Florida, you know, Ron DeSantis is proudly saying, I’m banning this cultivated meat thing.

I don’t want people to be buying this.

It’s like, how, whatever happened to, like, freedom of choice for your consumers?

And again, you literally see the poor, you know, the meat lobby standing behind him smiling, like, oh, you know.

Italy has banned it.

Yeah.

And again, like, it’s not like it’s already there.

Like it’s not been, it’s not on the market anyway.

It’s just, it’s in the R&D phase.

Yeah.

And it’s insane to me because it’s like, talk about cutting off your nose to spite your face.

Because nations are going to be increasingly moving into food insecurity.

Yeah.

So we need to be looking into as many different methods of producing proteins as possible and see, you know, see what actually works.

Yeah, I agree.

Like cultivated meat needs long-term studies.

Yeah.

Like, you know.

But at least can we treat it somewhat evenly as like, you know, where are the long-term studies on eating factory-farmed meat that’s like scraped off the bottom, you know, sausages pumped full of like absolute crap, like the bottom of the barrel type slop that we’re feeding humans?

Yeah.

Like, let’s at least be somewhat balanced with this.

Sure.

What do you think are the most promising things in terms of preventing the regulatory capture machine from also stopping this potential solution?

Yeah.

Well, I mean, as you say, it’s been crazy to see the political roundabouts that people will go on this.

I mean, the idea of banning cultivated meat started with European socialists who said, this is harming our food tradition.

This is harming the farmers.

This is not, you know, we don’t like change, basically.

And then it was picked up somehow by the American right.

And you ended up with these, you know, free market Republicans tying themselves in circles to explain why they wanted to ban a product, you know, and literally coming up with things like, well, you know, the FDA reviews not stringent enough, like as if that was like the thing we really needed, you know?

Like, no one, like half the time they’re saying they don’t like the FDA anyway.

Exactly.

I mean, it’s totally crazy stuff.

Inconsistent.

Yeah.

And so, I mean, look, the good thing is we are seeing a lot of principled conservatives and even ranchers opposing these laws.

So the Institute of Justice, which is a prominent free-market conservative think tank, has been suing, is suing currently the state of Florida and suing other states with these bans.

So we’re seeing a number, like in Wyoming and South Dakota, bills to ban Kota Bermuda have actually just died in the last couple of weeks.

And they were killed off by rancher groups.

They were killed off by ranchers who came and said, like, this is crazy.

And by the way, this kind of sets a precedent for banning meat, you know?

Like, if you’re worried about, like, the things, like, you know, our argument for years has been, don’t you dare ban any of this stuff, leave a free market, let people decide.

And now you’re coming in and doing this.

I mean, the other thing is even where countries and states are not banning this, the other thing the industry has tried to do is pass these onerous labeling regulations where they want there to be some kind of defamatory label on the product.

So, you know, this was made from cells as an imitation product.

It is nothing like real meat, blah, blah, blah.

And, I mean, the irony is, for decades, the meat industry has been opposing accurate labeling, you know, that you can’t even get, literally, you can’t even get them to put on accurate country of origin labeling.

Because the meat industry has said it would be too hard for us to work out if we’re making a sausage, where all the different meat came from.

Some of it might be from an animal in Canada, some might be an animal in the US, some might be from an animal in Mexico.

Like, you know, and we don’t want to have to label on there that it’s made from a mix of animals.

They talk about frankenfood.

Yeah, right, exactly.

Like, people call cultivated meat frankenfood. Meanwhile, a sausage is literally, like, they can’t even track what animal meat, like, it came from and where.

I even saw some crazy thing, someone was telling me about hot dogs.

There were some tests done, and they found human meat.

Oh, God.

Traces of human DNA, at least traces of human DNA within it.

Yeah.

Like, that’s just one example.

Because, like, that’s like when you come to the bottom of the barrel meat, like, it’s just…

Oh, yeah.

So, it’s like labeling, you know, it’s one rule for me, another for thee.

I understand.

When it’s convenient, like, they’re pro-labeling.

Yeah.

And when it’s anti, you know, when it’s not convenient, they’re against it.

Yeah.

And, I mean, you know, I absolutely, like, I think that all of our food should be maximally labeled.

There should be maximum transparency for everyone, whether it’s, you know, ranched meat, factory-farmed meat, cultivated meat, plant-based, you know, pretend meat.

Sure.

So, absolute maximum transparency.

People deserve to know what they’re putting into their bodies.

Yep.

And, you know, it’s hard to sense-make in today’s climate, and I’m sure there will be so much propaganda from each individual industry.

But by denying, like, but at least with labeling, like, that’s a huge step one.

Let people know and see.

Yep.

So, again, yeah, that’s funny that they’re trying to use that technique.

Any other sort of potential levers there to help?

Like, you know, from what I understand, so one of my previous guests I had on was Uma Valeti, who is the CEO of Upside Foods, which is a cultivated meat company.

Yeah.

And I tried their chicken.

Yeah.

I’m still alive, doing fine.

In fact, I’m healthier than ever.

Yeah.

You know, and it tasted basically like, to me, it was indistinguishable from a natural chicken breast.

Yeah.

But at the same time, like, you know, not only are they facing like all these regulatory hurdles before they can even get, like, even join the race.

Yeah.

It’s like preventing them from sitting down and, you know, getting in the starting blocks.

Yeah.

But they also have the disadvantage, of course, of like, it is a new technology.

Yeah.

It is hard to scale.

Yeah.

Any promising areas there that you think, like, what would you, if you had a bucket of, like, you had $100 billion, how would you deploy it?

Well, I mean, yeah, I think alternative proteins have immense promise, and they still have a long way to go.

So, you know, the factory farming industry got a 50 to 100-year head start.

They got all of their R&D funded by the government.

And so, I mean, literally all the processes were funded by land grant universities in the U.S. or otherwise funded by governments.

And so I think there’s a really strong case for philanthropy and government to, as it subsidized the adoption of higher welfare alternatives, also subsidize the development of alternative proteins, also subsidize the development.

And indeed, that’s happening outside the U.S.

So the Chinese government’s doing that.

A number of European governments are doing this.

UAE is doing this.

Singapore is doing this.

So a whole bunch of Israel’s doing this, particularly governments that are concerned about food security.

They’re investing heavily in this.

And, you know, that really gives me hope because I think we will, in the same way that factory-farmed meat is not going to get any better.

I mean, it’s basically they have optimized every last possible thing out of the system.

Cultivated meat will only get better.

It can only get better.

And it may take a while.

It may take a long time.

And it may require a lot of funding.

I mean, this is high risk.

It’s not on the kind of timeline.

High risk in terms of financial.

High risk financially.

And it’s on a long timeline.

And so, you know, the challenge of relying on the private market as we currently are, like venture capitalists, is these very short timelines.

And in reality, you know, this is a decades-long endeavor.

This is something where we need to keep working on.

A lot of that will need to happen in academic labs, whether it’s supported by governments or philanthropies or whoever else.

So, I see a lot of potential on that in the long term.

And again, it’s also not an either or.

You know, I think so many people get in this thing where they get grossed out.

They say, well, I don’t want cultivated meat.

Therefore, we need to shut it down.

And it’s like, look, like, that’s fine if you don’t want it.

Like, you know.

Don’t buy it.

Yeah, don’t buy it.

That’s fine.

No one’s forcing you.

It’s a very easy thing.

What personal biases affect my freedom?

Right, that’s right.

Exactly.

And so I think we need to have on the market, you know, a number of different options.

And people will choose different options based on what they care about.

And that’s great.

And if we have accurate labeling and people actually know what they’re buying, whether that is from an animal, whether it’s from plants or cells or whatever else, that will take us a huge part of the way in terms of getting away from the current system where almost everyone is eating factory-farmed meat, whether they want to or not.

What advice would you give to people watching this episode who now feel inspired to actually go out and try and move the needle on this in some way?

Because from what I gather, there are many different approaches.

You know, whether there’s obviously everyone thinks, oh, I’ll just like tweet about it.

And I mean, that’s my own weakness.

I’m like, I have a big platform.

So that’s enough.

That’s great.

It’s a start, but it’s not.

It’s nowhere near sufficient.

Yeah.

There are, you know, I’m also, fortunately, I came into the money recently and I’m donating a bunch to various philanthropic areas.

But could you sort of like list through the most potential, you know, there was promising approaches for different people, whether it’s like contacting their congressperson or donating money.

What are the most promising things?

Yeah.

Well, look, the first thing I say is if someone’s not convinced, go and do their own research.

The thing about this issue is the more you learn, the more forward you’ll be.

We have, there is nothing to hide on our side.

There’s a lot to hide on the other side.

And so I would definitely encourage people, if you’re skeptical about the things we’ve talked about today, go and research it yourself.

Look online, find the information.

The next thing is get the word out about it because just as you’re doing with tweeting about it and posting about it, this industry thrives in the darkness.

It thrives when no one has any clue what’s going on.

And so if you have a big platform, great, talk about it.

But even if you don’t have a big platform, talk to your friends about it, get this discussion going.

Then we need to let the people in power know this is something we care about and that we’re watching.

And so that starts with politicians, for sure.

So, you know, letting your politicians know, hey, this is something I care about.

I’m watching what you do on this issue.

Letting corporations that you give money to know about that, too.

So whether that’s the supermarket chain, the fast food joint, whoever, letting them know, hey, this is something I really care about.

And again, I’m watching what you do on this issue.

And if you have the means, donating.

I mean, there are some really effective groups out there doing work to end some of the worst practices.

They are funding-starved compared to other causes.

I mean, this is a fraction of one percent of the world’s philanthropy that goes toward this work.

And there is a real dire need for more funding.

So people have the financial means that one of those impactful things you can do is definitely to donate.

Can you mention any or would it be better if we just provide a list?

Well, I’ll mention some and we can provide a list, too.

So, you know, I would check out, for instance, the Humane League.

It’s a group that has been doing campaigning to get companies to adopt better standards.

So harnessing that consumer pressure to push companies to make the kind of changes they claim they’re going to make anyway.

And, you know, then I’d look at other groups like the Good Food Institute.

If you’re excited about alternative proteins, that’s a group that is promoting the alternative protein industry, trying to develop and and trying to push back on some of these state legal efforts.

There are a whole host.

Yeah, we could do a long list of other groups out there.

There’s Mercy for Animals, Animal Equality, a whole bunch of these other groups that are doing really, really effective work in this space.

Are there any lessons to be learned from previous social movements that have either been successful in raising awareness and making change or actually unsuccessful or caused even a backfire effect?

You know, like the… not that I like to make the comparison between slavery and factory farming, but there are clearly lessons to be learned.

Because, like, there was a social movement to stop slavery in America and it took way too long, but it eventually worked.

So like what lessons?

Yeah, like what did those successful movements do right?

And what are the ones that caused backfire effects doing wrong?

Yeah.

Well, you know, the slavery abolitionist movement is a great example because on the one hand, I am deeply inspired by the British abolitionist movement, which is really one of the world’s first modern social movements.

And they took a practice that was deeply ingrained in society, accounted for a significant portion of Britain’s GDP, was integral to its colonies and its global security, and they publicized it.

And once they publicized it and once they started organizing people, and once they started lobbying their members of parliament, they created more and more pressure.

And eventually, in bits and incremental reforms, they got rid of it.

By contrast, the U.S. abolitionist movement was actually a lot more polarizing.

It was a lot less focused on incremental reform, a lot less focused on what was politically possible.

Inadvertently, it helped bring about a civil war that abolished slavery, but that was not what they were trying to do.

I mean, they were out there.

And I think part of the lesson there is the British abolitionists were incredibly focused on this one narrow goal.

They worked across party lines.

So a number of the most prominent abolitionists would be considered conservatives in our modern political parlance.

Whereas the U.S. movement, as is often the case, sort of radicalized and became about 100 different issues, many of which had nothing to do with slavery, and also didn’t find that kind of broad political consensus.

And again, accidentally contributed to the abolition of slavery, but not intentionally.

One thing that both of those movements have in common is they didn’t obsess over individual personal purity.

So there were campaigns to boycott slave-grown sugar, but those never took off, and those did not play a major role.

It was not that, you know, there were people at the time who said, “Oh, well, you know, it’s just an individual consumer choice. You can just choose to not buy sugar or cloth or coffee produced by slaves, and that’s how it all went in.”

And that went nowhere.

By contrast, when they organized for political change, that was what actually did it.

And I think similarly, in more recent times, we’ve seen a number of movements be very effective in mobilizing corporate change.

So I think that’s also focusing on those biggest levels of change, political change, corporate change, and increasingly new technology.

And to me, yeah, the greatest lesson is we need to mobilize at that level.

We need to ignore petty differences.

We need to also appeal to everyone across the political spectrum and not make this into some kind of radical movement that doesn’t appeal to half the country.

So final question.

Tell me why solving factory farming is a win-win in your view.

I mean, this is something that would unlock so many other problems.

We talked about antibiotic resistance.

This helps push back against the antibiotic resistance crisis.

We talked about human health.

This will produce food that is healthier for people to eat.

We talked about clean air and water, and this will provide cleaner food and water.

I mean, at the end of the day, to me, it’s just kind of crazy that this is how we’re producing our food and this is the model for how we’re going to produce our food in the 21st century.

Like, we’re going to have AGI before we abolish gestation crates.

And so the thing that gives me hope, though, is that it is.

It’s such an anachronism.

It doesn’t belong in the present day.

It doesn’t belong with our modern moral attitudes in today’s society.

And so I really think that the fact that almost everyone already disapproves of it, the fact that people are learning more about it over time, the fact that we have better alternatives gives me a lot of hope that we will be able to end this.

And when we do end it, that it will be better for our health, for our environment, and for future generations.

Thank you so much for coming.

I really, really appreciate it.

As I said, like the fact that there are people like you who have dedicated their lives to understanding it and figuring out the best ways to move us past this is just so inspiring.

So thank you so, so much.

Thank you to all of you guys who have managed to make it through to here because I appreciate this was not probably an easy episode to listen to, but man, is it important.

So yeah, thank you for following.

Please give Lewis a follow.

Do read the show notes because there’s going to be a lot of info beneath in this one, particularly with calls to action of what we can do, particularly around pressing ones.

So yeah, thank you for joining Lewis and, yeah, thank you, win winners.

Thank you.

Thanks for having me on the show, and thank you for everything you’re doing on this issue. Thank you.

This is an experimental rewrite

You know, there are so many problems on Earth that are vying for our attention. But if I had to pick one that, looking back a hundred years from now, we’d say, “Damn, we really screwed up. That one sucked,” it would be factory farming. As you’ll learn from today’s conversation, it’s one of the most lose-lose situations currently happening. My guest today, Louis Bollard, is a leading expert in this field.

Louis explains that the way contaminants are reduced to acceptable levels in U.S. chicken is by bathing the chicken in chlorine. In many cases, there’s no interest in studying the human health consequences of this practice. When the government tries to fund such research, they often receive calls from senators on the agriculture committee asking them what they’re doing, with threats to defund the effort.

He’s a researcher and philanthropist who has dedicated nearly his entire life to addressing this issue. Thank goodness for his efforts, as there are meaningful changes we can all contribute to in order to improve this situation. I’m thrilled to share my conversation with Louis Bollard.

Louis, thank you so much for joining us. As you know, and I think most of the Win Win viewers agree, the issue of factory farming is very close to my heart. I wanted to talk to you because I believe you have one of the clearest understandings of this complex topic.

To kick things off, could you explain your background, how you became interested in this issue, and just clarify for listeners what factory farming truly entails?

Sure! Thanks for having me on the show. I think like many of your viewers, I grew up with a love for animals and a desire to do good in the world, but without specific views on this topic until later. When I learned about factory farming as a teenager, I was utterly shocked. I couldn’t believe we would treat animals in such a manner, and it was unbelievable that no one seemed to be addressing the issue.

It struck me as I compared factory farming to other social problems; many are complex and have difficult solutions, like poverty. However, the reason we’re torturing animals is merely that it’s marginally cheaper. The clear solution here is just to stop.

I initially set out to learn more about this issue, even going to law school with the intent to litigate on it, which I did for a year. Unfortunately, in the U.S., there aren’t any real laws governing factory farming, making litigation difficult due to a lack of applicable laws. Eventually, I joined Open Philanthropy, where we fund advocacy efforts globally to reform the worst conditions associated with factory farming.

Interviewer: So, you grew up in New Zealand in a farming community, right?

Louis: Yes, that’s correct. We had a sheep farm. At the time, I had no qualms about it. I thought those animals lived a pretty good life, grazing on the hills all day. However, in hindsight, I think this perspective blinded me to the realities of factory farming because, in New Zealand, you often see sheep and cows outside in pastures.

It wasn’t until I traveled to Southeast Asia, particularly Southern China or Vietnam, that I started witnessing live animal markets and realized the extent of animal mistreatment occurring around me. That observation really planted the seed for my research. Upon reaching out to farms in New Zealand, I found it curious that you don’t see any chickens or pigs outdoors in the country.

It was a turning point for me when I realized that we eat a lot of chicken and pork, yet there are no pigs or chickens seen outside. I contacted chicken and pig farms to ask if I could visit, but they all said no. I even reached out to industry associations for help, and they also said no.

It baffled me because, as a child, you could visit any sheep or cow farm. They welcomed visitors! This experience led me to understand that something nefarious was happening. The deeper I dug, the more I uncovered about the entire system of producing chicken, pigs, eggs, and increasingly fish, all relying on the factory farming model.

Interviewer: Can you describe what defines a factory farm? Is it a certain size or technique?

Louis: You’ll typically know it when you see it. A factory farm is primarily indoors. Traditional farms allow animals outside, at least when the weather permits, while factory farms bear no outdoor space for the animals. Within these indoor settings, several characteristics are common.

For instance, if they’re in the egg or pork business, they generally use cages and crates—this is a hallmark of factory farming. In growing chicken, they often utilize hyper-optimized breeding lines designed for rapid growth that their bodies can’t naturally support, leading to serious health issues.

I’ve seen images of chickens from the 1940s—small and athletic-looking. Modern chickens are drastically different, resembling a French Bulldog that is overweight and whose legs barely support its body.

These modern broiler hens are nearly obese. Upon the industry’s shift towards factory farming, it diverged into two segments. Chickens were previously raised on small farms where they ate food scraps and laid eggs. Then, as factory farming industrialized, eggs came from hens bred specifically for laying, leading to male chicks being discarded at birth because they were too scrawny. About eight billion male chicks are killed annually on their first day of life, often disposed of in macerators or other gruesome methods.

On the other end, the broiler chicken sector developed birds that grow rapidly without balanced development. They focused on growing breast meat while neglecting the legs and internal organs, which were deemed less valuable. Consequently, the bird’s legs weaken, leading to struggles for survival. By the time they are six to seven weeks old, the birds are slaughtered, as they haven’t hit puberty, which allows for longer growth but results in chronic starvation.

Mother Nature essentially signals that these animals should die before passing on their genes, but the industry just finds workarounds.

Interviewer: I can see the economic reasons for this. Can you tell me more about pig farming? I’ve spent time around pigs and know they are even smarter than dogs.

Louis: Absolutely. Pig farming could resemble the traditional methods of centuries past, with small backyard herds feeding on kitchen scraps. These animals would lead decent lives, roaming outside. However, factory farming figured out how to confine pigs indoors, laying them on concrete floors designed for easy manure removal while denying them proper living conditions.

Breeding pigs were further optimized for larger litter sizes, leading to increased aggression. Since these pigs couldn’t coexist peacefully in crowded indoor spaces, they were crammed into individual crates that prohibited movement, even turning around. This arrangement boosts efficiency, minimizing energy expenditure and feed consumption.

This factory farming system exemplifies how the industry operates—every time they solve one problem, they create an even worse one. They prioritize operational efficiency over the well-being of animals.

Interviewer: The numbers are staggering. For each human, there’s about one laying hen in a cage globally, and the figures for pigs are equally alarming.

Louis: Yes, it’s immense. For every human on Earth, there’s roughly three broiler chickens raised in factory farms. In the U.S., that number rises significantly because we consume a majority of the world’s chicken. For pigs, there’s approximately one pig in a factory farm for every eight global citizens, with even higher ratios in the U.S. Most of those animals are kept indoors. Best estimates suggest that around 90% of laying hens are in battery cages globally, and around 99% of U.S. pigs reside in farms that we assume are entirely indoors.

The industry doesn’t collect data distinguishing indoor from outdoor animals, but USDA data on farm sizes indicates that if a farm is housing 100,000 pigs, those pigs won’t be outside enjoying the sun.

No, they’re all in cramped conditions. Even in emerging markets like India and China, factory farming has become the prevalent model. This system originated in the U.S. and has now been exported worldwide.

You mentioned China. They are now the largest consumer of pork, and their animal welfare laws are often worse than the dire conditions found in the U.S. I recently saw an image of a Chinese factory farm resembling a seven-story parking structure—windowless and stark, it’s just rows upon rows of pigs in tiny crates.

What shocks me is that if someone treated a dog or cat that way—even for a short time—people would be outraged. But pigs, who are more intelligent than household pets, needlessly endure these conditions for months before being slaughtered, another layer to this heartbreaking reality.

We’ve discussed the animal welfare standpoint, which most people can grasp, but what they may not realize are the significant human health risks tied to this industry as well.

For instance, I have a family member who developed an antibiotic-resistant infection after a minor surgery, which nearly became fatal. This was partly due to a superbug that originated from antibiotic overuse in pork production. It’s astounding that the pork industry uses about 44% of medically important antibiotics, equivalent to the entire U.S. hospital system’s antibiotic usage combined. This clearly contributes to antibiotic resistance, right?

Louis: Absolutely. For decades, the industry denied any connection between their practices and health outcomes. They even produced scientific studies claiming no correlation, but independent panels and research have since confirmed the link. Recent statements from the FDA underscore that the role of animal agriculture in antibiotic resistance is a significant crisis, yet change remains stagnant due to the industry’s political power.

Another aspect I was researching is hormone use. In dairy production, synthetic hormones promote larger udders and more milk production. Studies suggest that hormones fed to cows for faster growth can accumulate in humans, potentially impacting testosterone levels in men after consuming processed meat.

This suggests a real lack of scientific inquiry into these issues, despite how crucial our diet is to our health. Why do you think that is?

Louis: Interestingly, hormones have been regulated to some degree. Thanks to past abuses, hormone use in chicken and pig feed has been banned in the U.S. Now, the focus is primarily on beef cattle and possibly dairy cows.

However, they’ve replaced hormones with various growth-promoting antibiotics and beta-agonist drugs like ractopamine for pigs, which leads to rapid growth. We still don’t fully understand the human health ramifications of these drugs. There’s minimal interest in researching these impacts.

When the government attempts to fund such studies, they quickly receive pushback from senators on agriculture committees, who threaten to withdraw funding. The overall atmosphere resembles the reluctance to investigate tobacco’s health effects decades ago.

Interviewer: Right. Those lobbies are just too powerful.

I can’t fathom that consuming meat from animals raised in such unnatural, stressful conditions is beneficial for us. There are vast differences between gentle slaughter methods and the most brutal ones.

Some of the worst methods aren’t confined to the U.S.; in China, reports exist of pigs being boiled alive because it’s deemed a more efficient method. Other inhumane methods include asphyxiating them.

I can’t help but wonder about the cortisol and adrenaline in the meat from animals living such harsh lives. I feel it’s unhealthy when compared to meat from an animal that has lived a natural life.

People often claim that meat consumption is natural, yet few consume bison roaming the plains. Hunting is the most authentic form of sourcing meat, but factory farming is its antithesis. It seems intuitive that consuming such stress-laden meat isn’t ideal—I personally wouldn’t want it.

On a related note, it’s ironic when people claim they’re following a paleo diet while eating factory-farmed chicken. That’s not food our ancestors were eating. Beyond the additives and other issues, consider this: the only way to ensure that U.S. chicken is safe is to bathe it in chlorine post-slaughter.

This process prevents exports to the UK and Europe, which uphold higher food safety standards prohibiting chlorinated chicken. Even after the chlorine bath, a significant portion still tests positive for salmonella—frequently antibiotic-resistant salmonella. The U.S. even allows for salmonella tolerances in chicken cuts, up to 25-33%. Meanwhile, plant-based foods typically allow for 0%. This crazy system implies that as long as we sanitize it afterwards, everything is acceptable.

Many people are oblivious to the production methods behind “healthy” and “lean” protein. What are the labeling standards in the U.S. regarding meat origin? I know Europe and the UK have strict regulations about how animals are raised.

In the U.S., there are only minimal labels worth anything. The USDA regulates “organic” as it involves a legitimate certification process. Typically, terms like “free-range,” “pasture-raised,” and “cage-free” imply that the animals had outdoor access.

Beyond that, most labels are meaningless. A well-known factory-farm chicken brand boasts “all natural,” but fine print reveals this means only that no artificial additives were applied post-slaughter.

The USDA permits entities to define the terms they use. There’s little chance the average consumer knows what that means, allowing manufacturers to exploit these labeling ambiguities effectively.

A few years ago, Congress sought to modify this labeling process, urging the USDA to ensure that terms held real meaning. After years of consulting with consumers and industry representatives, the USDA essentially returned with the same approach, concluding that it was fine to leave definitions to the industry.

Interviewer: So why doesn’t the USDA take a firmer stance?

Louis: It’s a classic case of regulatory capture. The USDA is heavily influenced by industry pressures. There exists a significant fear among regulators about how the industry will respond if they attempt to enforce stricter regulations.

This pattern plays out repeatedly. For example, in the early 1970s, when the FDA attempted to regulate antibiotic use in feed, the poultry industry pressured Congress, with powerful senators threatening to defund the FDA if they pursued those regulations.

This issue is pervasive; the revolving door at the USDA also plays a role. Individuals in regulatory positions commonly have backgrounds in the industries they oversee, and many plan to return to the industry post-government service.

Take the last Agriculture Secretary, Tom Vilsack, a former governor of Iowa, the leading factory farming state in the U.S. His credentials secured his nomination from Obama. After his term, he lobbied for the dairy industry and then was re-nominated by Biden. It’s a revolving door accepted as the norm, driven by the belief that the USDA serves the interests of the agricultural industry rather than the consumers it ought to protect.

Interviewer: Shouldn’t the goal of government typically involve preventing corporate excesses? We have abundant evidence showing that corporations often prioritize their interests over consumers and the environment. It’s absurd that the agency intended to shield us from these excesses collaborates with them instead.

Louis: Absolutely. The USDA’s foundation traces back to President Lincoln. He envisioned it as a department for the people because food and agriculture matter to everyone. However, over time, as agriculture evolved into a more concentrated industry, the USDA shifted its focus to serve large corporations rather than family farmers.

When the public fails to remain vigilant, politicians often believe voters won’t care about regulatory capture at the USDA. Unfortunately, they’re often correct. Consequently, the industry has exploited this ignorance—something that has gradually occurred at the Department of Agriculture.

Pandemics also remain a growing concern. In 2009, the H1N1 influenza outbreak traced back to an industrial pig farm in North Carolina. Recently, reports indicate that around 50% of U.S. farms have encountered some type of outbreak. This issue isn’t limited to chickens; it appears to spread in cattle, which surprised me since avian flu typically doesn’t easily transmit among warm-blooded animals. What’s happening here?

The bird flu situation exemplifies the consequences of unaccountable factory farming. When H5N1 became a pressing concern around 2023, a crucial question arose: would we vaccinate the birds? Many countries vaccinate their birds and avoid outbreaks.

In the U.S., we have the infrastructure to produce vaccines. A Kansas factory had 400 million doses prepared in 2023, more than enough for our egg industry. However, the decision about whether to vaccinate hinged on the USDA requiring or allowing it.

The chicken industry said absolutely not, claiming that vaccination could hinder their export capabilities. Other nations could easily justify closing their borders to U.S. chicken, which is worth billions annually.

Ultimately, the USDA sided with the industry. They not only declined to require vaccination but also prohibited it, despite evidence from other countries effectively containing the virus through vaccination. Consequently, when the virus re-emerges, the protocol is to kill all infected birds to suppress its spread.

The USDA employs inhumane methods for this, like turning off ventilation systems and raising temperatures, a technique known as ventilation shutdown plus. It suffocates birds slowly—a process that studies indicate can take hours and is often ineffective. The following day, they may need to return to eliminate birds that survived the previous slaughter attempt.

Interviewer: That’s incomprehensible.

To me, this should incite national outrage, yet the media scarcely covers it. Many may not grasp the magnitude of what’s taking place.

Louis: I believe there’s a cognitive dissonance when discussing these issues. People acknowledge the problems but may resist confronting them because they enjoy meat and dread change. It’s alright to recognize that contradiction while also striving to make a difference.

Moreover, the animal rights movement, in part, has contributed by framing this issue around individual behavior and guilt. They often say, “If you still eat meat, you contribute to the problem. You should feel awful, and the only solution is to stop eating meat.”

It’s unsurprising that most people would reject this notion, leading them to ignore the issue altogether. In truth, as with any major societal issue, we need collective action from governments, corporations, and technological advances. Focusing exclusively on personal diets misses the broader picture.

I’ve read discussions regarding factory farming’s environmental impacts, particularly on surrounding communities. Residents near these farms endure air pollution and water contamination. Could you elaborate on this?

Louis: The environmental issues with factory farms stem from their concentrated environmental harm on small patches of land. In pasture-based farming, manure is spread over large areas, which the land can absorb. However, in factory farms, the sheer volume of waste far exceeds what the land can manage.

These farms generate significant ammonia buildup and other toxins without regulations to prevent air and water pollution. The industry effectively lobbied to exempt themselves from environmental regulations under both the Clean Air Act and the Clean Water Act, resulting in a unique exemption for their operations. Consequently, there are no established standards. If they release ammonia or other toxins into the local air, they face no consequences.

Living next to a factory farm is terrible; residents wake up daily to the smell of manure in the air, an abundance of flies, and an unhealthy environment. Thankfully, some local regulations exist against water pollution, although often not enough. For instance, Des Moines, Iowa, is suing the Department of Agriculture due to its water supply being tainted by runoff from factory farms upstream.

Interviewer: How is this not more covered by the media? The Flint water crisis received significant attention after industry contamination, but this scenario seems equally egregious.

People seem to possess this strange notion that manure is natural. While it is, it’s crucial to understand that it also contains antibiotics and other harmful runoff, posing serious public health risks.

Aside from the literal waste that goes into waterways, it’s the substantial volume that is entirely unnatural. Although birds might defecate into a river, that is a normal occurrence. However, having thousands of pigs defecate into a small stream is entirely different and unacceptable.

Furthermore, one of the most horrific consequences appears during natural disasters. For example, in North Carolina, to address runoff concerns, regulators required pig factory farms to construct giant pools called manure lagoons for collecting waste. The idea was that it would dehydrate over time and become manageable.

In practice, these farms often occupy flood-prone areas near coastlines. Every time a hurricane strikes, these manure lagoons flood, contaminating local waterways with not just waste but also resulting in many drowning pigs floating in those bodies of water.

Interviewer: How absurd that this is accepted as a necessary evil.

Factory farming is woven into the fabric of the economy. The feed for this industry relies heavily on monocrop soy and corn, much of which is sourced from the Amazon. The fastest driver of deforestation in that region is often linked to the crops grown for factory farming.

Concern for climate is valid; animal agriculture accounts for 15% of global climate emissions. Factory farming contributes to numerous environmental harms—air pollution, water contamination, and toxins in our food supply—all accumulating from the system.

Another aspect many overlook is how mental health is impacted among workers in these facilities. Those who work in factory farms are often vulnerable members of society—migrants or individuals with limited job options. One worker shared how, despite not suffering physical injuries, the ambiance affected his mental health, making him feel as if a gray fog was descending over him. Night after night, he endured nightmares replaying the horrific scenes he witnessed.

It’s tough to picture someone who spends day in and day out slaughtering thousands of animals or pushing them into slaughter machines. This isn’t like hunting in a way our ancestors experienced; it’s an assembly line of mass production where every animal fights for its life.

Interviewer: Can you share what you encountered in some of these slaughterhouses?

Louis: Sure. Years ago, I visited a slaughterhouse in New Zealand that processed cows, sheep, and pigs. Even though it was one of the better-run facilities with regulations, there was no decent method for pig slaughter. The pigs were aware of their fate; as soon as they entered the kill room, they began to squeal and try to escape, fully aware of what was about to happen.

The throughput rate at these facilities is incredibly fast. They often have just four to five seconds to kill each animal. Ideally, the pigs are stunned first and then killed by an automatic knife, but if the stunning fails, the pigs continue moving down the line alive.

After the kills, they dip the carcasses into boiling water to make skin removal easier. One worker even told me that in cases where a pig made it past the initial kill, fellow employees would get annoyed. It’s challenging to disassemble a live animal, so they downplay any concern about possible survival.

This was considered a top-notch slaughterhouse. When it comes to chickens, the situation is even worse; they’re less valuable. Workers have less than one second to deal with each bird. They hoist chickens by their legs, often breaking them, and then rush them to the next stage.

Their stun settings are adjusted instead of regulated. In Europe, stun settings are designed to ensure the animals are comprehensively knocked out, but in the U.S. there are no regulations. The industry prioritizes meat quality over ensuring the animals are stunned effectively. Consequently, many chickens continue along the line conscious.

These chickens are supposed to be killed by an automatic blade, yet if they flinch, they move on without being properly dispatched. This unregulated, careless treatment is a direct result of factory farming.

Reflecting on the human impact of this work, there’s a whistleblower named Virgil Butler, who previously worked at a Tyson Foods poultry plant. He described becoming a “robot zombie machine” due to the emotional toll of the job. One of his co-workers ended up in a mental health facility after recurring nightmares of chickens pursuing him.

This highlights not only the suffering inflicted on animals but also a troubling camaraderie among workers, drawing individuals with psychopathic tendencies to factory farms. They often create games to cope with their reality, such as “the shit game,” where they squeeze a live chicken as hard as possible, causing its insides to spray everywhere.

Interviewer: What does this do to humanity’s soul, if such a thing exists? Even those who aren’t religious might be horrified at hearing these stories. The karma is stacking up, yet we do nothing to address it.

For decades, undercover animal advocates have worked in these facilities, capturing footage of routine abuses on factory farms. You’d think the cages and systematic cruelty would come up frequently.

What’s wild is that these investigators randomly select farms to infiltrate; they don’t receive warnings that certain farms are particularly egregious. Nearly every one they visit yields footage of sadistic abuse. Despite their short timeframes, they often encounter reports of shocking maltreatment.

Whether workers begin with psychopathic traits or develop them under these horrifying conditions, the environment fosters abuse and cruelty. Investigations have revealed less than 1% of these farms, yet they consistently showcase extensive abuses.

I’ll include resources in the show notes for anyone willing to explore this intense issue. It may be uncomfortable, but we need more individuals to engage with it to grasp the realities involved.

Recently, I’ve become more vocal on social media, and this topic elicits strong reactions across the political spectrum. While I anticipated social justice advocates would be outraged, I noted that even far-right accounts were retweeting in horror. It seems there’s a universal consensus that factory farming is problematic.

A 2022 study indicated that 84% of Democrats support laws banning farmed animal confinement, while 76% of Republicans agree. This is undeniably a bipartisan issue. So why is it so entrenched? We’ve touched on regulatory capture, but I’d like to understand the incentives keeping us stuck.

Louis: It’s akin to the Moloch trap, a game-theory predicament. Is it merely because meat is so cheap that this system persists?

I completely agree that it’s uplifting to see how many perceive factory farming as wrong. Very few support it. A survey by the EU found that 89% of Europeans believe the individual caging of animals should be banned. Yet, despite significant public outrage, actual change is minimal.

The disconnect often lies in political capture and a lack of public knowledge. If people were regularly confronted with factory farming’s harsh realities, I doubt we’d see its endurance.

If consumers had to witness the conditions under which animals were raised, I believe they wouldn’t buy meat if they knew the truth. Transparency, dampened by ag-gag laws banning exposure, plays a crucial role in this ongoing cycle.

It’s additionally reinforced by political capture; politicians often believe no one cares about these issues and prioritize votes on cultural or economic matters instead. Ultimately, they respond to lobbyists, who keep a close eye on these matters.

The change will come when individuals begin to communicate to politicians that they will vote based on animal welfare concerns. Various Democratic senators are just as guilty as Republicans regarding factory farming issues. Plenty of Democratic senators are influenced heavily by the industry, believing they can proceed without scrutiny.

The combination of a lack of transparency and political capture is the trap we’re caught in, but it may also be the pathway to change. Are politicians being directly compensated by large corporations like Tyson or Smithfield? How exactly do these benefits work?

There’s certainly monetary exchange involved. In the U.S., large agribusinesses are some of the biggest campaign contributors. Conversely, small farmers who attempt to implement better practices often lack the financial means to contribute.

This system has evolved into one benefiting only large factory farming corporations. In contrast, small farmers genuinely caring about proper practices struggle. The factory farming lobby has positioned itself as the voice of all farmers, playing into a cultural narrative surrounding the hardworking farmer, instilling fear among politicians.

As an example, the European Union faced protests from farmers, including tractor blockades in Brussels, whenever regulations were proposed. Such demonstrations effectively deterred governments from pursuing agricultural reforms due to fear of backlash.

I remember the late Bob Dole, a Republican Senate majority leader advocating for humane reforms. He managed several landmark pieces of legislation but openly recognized that he wouldn’t challenge farmers.

From my understanding, it’s primarily potent corporations that engage in the worst practices. When farmers unite in protests, it’s more of a representation of small farmers who unintentionally sideline regulations meant to protect animal welfare.

A notable example is the EATS Act. Can you explain what it is?

Louis: The EATS Act is potentially one of the worst steps backward for American farming. As people recognized the inhumane conditions on factory farms, particularly gestation crates, they appealed to state politicians. Unfortunately, local politicians often were reluctant to engage with those concerns due to industry pressure.

Some states enabled ballot measures, allowing advocates to push their proposals directly to voters. This method proved successful, as many red states, like Florida and Arizona, overwhelmingly voted to ban the usage of gestation crates.

However, the industry swiftly countered. After realizing that states could ban cruel practices locally, they initiated multiple lawsuits to overturn these laws. Their strategy involved rallying all industry groups to fight collectively against these regulations, even going so far as getting the Biden administration involved to champion their case before the Supreme Court.

The astonishing aspect of this is that many Democrats who didn’t support this were unaware of what was happening behind the scenes.

Despite losing the case, the pork producers walked over to Congress and demanded federal intervention. Rather than establishing their own standards, the proposed EATS Act essentially aimed to nullify existing state laws without implementing any replacements.

Shockingly, support emerged, surprisingly from both parties, but particularly surprising was the backing from states’ rights advocates who usually don’t align with federal preemption.

Their strategy involved slipping the EATS Act into the Farm Bill, a must-pass piece of legislation containing significant funds for agriculture and food aid. They knew it would be challenging to exclude it from the Farm Bill, just a strategic move to push the EATS Act through.

Watching a hearing with Brooke Rawlins, Trump’s new agriculture appointee, was disheartening. She claimed the EATS Act supported small farmers, which is misleading, because it actively harms them while benefiting factory farm operations. If passed, small farmers would continue to face challenges while factory farms dominate the industry.

It’s a complete moral inversion and a blatant lie.

Interviewer: Are small farmers realizing this?

Louis: Absolutely. Many small pork farmers have actively protested the EATS Act while advocating for a market for their value-added products. They’ve communicated that humane practices provide better profit margins without sinking to the lowest levels of factory farming.

Unfortunately, those in power seem indifferent. Legislators have learned that small farmers lack organization and pose no real threat. In contrast, factory farmers exert loud pressures, threatening conflicts and political challenges if their interests aren’t accommodated.

Additionally, I want to touch on ag-gag laws, which discourage transparency. Initially, in the early 2010s, the argument against these laws hinged on protecting farms from potential competition. They crafted legislation that made it illegal to photograph or record agricultural facilities without permission from owners.

The absurdity is that these laws aimed to stifle transparency while allowing agricultural entities to operate unchecked. Some ag-gag laws have since been overturned, but what has prompted this shift?

Louis: The initial batch of ag-gag laws pushed by the factory farming industry claimed that criticism of their practices could not be tolerated. Some statutes explicitly stated that filming to make them appear bad was illegal, while endorsing filming for a favorable portrayal as acceptable.

Thankfully, the courts recognized these laws as unconstitutional under the First Amendment due to viewpoint discrimination. This has resulted in an ongoing game of cat and mouse; each time courts strike down restrictive ag-gag laws, legislators attempt to introduce new ones.

Currently, Iowa is on its fourth version of ag-gag laws, having twice faced judicial challenges. Each iteration seeks to limit transparency more efficiently, sidestepping overt admissions of intent to stifle factory farming exposure.

Recently, new legislations avoid calling attention to their true purpose, instead framing them as general protective laws, citing privacy concerns or intellectual property rights.

They’ve taken it even further, creating legal agreements for workers who contract with major agricultural companies, stipulating that they won’t film or expose any wrongdoing. Violation of this contract could result in severe financial penalties.

They’ve effectively constructed an environment making it nearly impossible to reveal their misconduct, achieving their aim through legislative maneuvers.

Through our discussion, I hope we’ve adequately conveyed how grave the factory farming situation is. Still, there are solutions. Remarkably, progress is happening, particularly in Europe. Can you discuss methods implemented there to improve welfare standards?

There is a way forward! Many despair over the scale of factory farming, feeling powerless. The reality is significant progress has occurred. This system, despite its seeming strength, is vulnerable due to overwhelmingly unacceptable practices and societal disapproval.

In Europe, we’ve seen numerous reforms aimed at eliminating the worst factory farming practices. The UK has banned gestation crates for pigs, and Germany is in the process of phasing them out, among other nations. The European Union is contemplating similar actions too.

Additionally, practices like chick culling—killing male chicks at birth—are being phased out, with some countries, including France and Germany, implementing bans.

At the corporate level, major companies are responding to consumer and investor pressure for improved animal welfare standards. For instance, McDonald’s has committed to using 100% cage-free eggs in the U.S., Canada, and Europe—another win, as over 7 million hens are transitioning out of cages.

Interviewer: How verifiable are these commitments?

Louis: They’re verifiable. When companies make these claims to investors, they know they must uphold them. They operate within a legal framework where fraud can result in consequences. Labeling their products as “100% cage-free” entails genuine commitments backed by potential legal ramifications, thus increasing transparency.

Interviewer: Does McDonald’s own its chicken farms?

Louis: No. A considerable part of factory farms escaping scrutiny stems from their complex supply chains; they’re not directly selling to consumers but working through integrators, such as Tyson Foods.

In recent years, consumer-facing brands like fast food chains have embraced new standards. What’s interesting is that while producers previously claimed change was impossible, McDonald’s commitment prompted rapid compliance among suppliers.

Historically, the legislative process hasn’t yielded favorable outcomes for farmed animals. Yet, we consistently observe that when corporations demand adherence from their suppliers, compliance occurs almost overnight.

Interviewer: In the context of eggs, how does McDonald’s ensure compliance?

Louis: They collaborate with suppliers like Cargill, ensuring a steady supply of cage-free eggs.

The logistical challenges exist, but not insurmountable. Given the lack of transparency regarding farming practices, how can consumers trust these claims? Fortunately, big companies face substantial liability if discrepancies arise.

People should remain vigilant against ambiguous labels, but those making clear, defined commitments backed by reputable companies increase accountability, fostering a culture of ethical practices.

It’s akin to the growing divestment movement, where people often advocate for withdrawing funds. However, pressing for change within established companies through investments, they can exert powerful shareholder influence.

Could you elaborate a bit more on this idea? Absolutely! I think divestment can be counterproductive in this space; it essentially allows these companies to dismiss concerned investors. A compelling case study is Carl Icahn, the billionaire investor who took it upon himself to act against the cruelty of gestation crates.

Upon learning about these practices about ten years ago, he confronted McDonald’s, reminding them of their commitments to eliminate crates in Europe. He expressed an interest in appointing someone to their board to advocate for animal welfare issues.

Despite their initial bureaucratic resistance, Icahn’s persistence compelled McDonald’s to take action. When they eventually delayed implementing these pledges, he prepared to pressure them again by appointing people to the board. He ultimately succeeded in catalyzing McDonald’s to follow through, paving the way for a crate-free supply chain.

Investors play a significant role in harnessing consumer demands and holding corporations accountable, thereby facilitating meaningful change in ways that individual consumers often cannot.

Interviewer: Are there specific legal pressure campaigns investors can leverage?

Louis: Absolutely. There have been multiple SEC complaints focused on assessing discrepancies between what companies tell investors regarding animal welfare and their actual practices.

For example, companies may conduct materiality analyses reviewing consumer concerns—often finding that animal welfare outweighs even climate issues when considering what is material to consumers. This realization emphasizes risk; if consumers learn how poorly they treat their animals, they may abandon the company.

Actions can be effective when people bring these issues to the SEC, asserting that a company is saying one thing while failing to follow through on public pledges concerning animal welfare.

In such cases where companies make promises and subsequently neglect them, stakeholders have the opportunity to press for updates on these commitments, often compelling firms to act due to public exposure.

Interviewer: In terms of successful campaigns to achieve improved slaughter standards, what lessons can we derive?

Historically, there have been humane slaughter regulations, stemming even from religious practices designed to mitigate suffering when slaughtering animals.

While some of these laws may have initially lacked clarity that has now evolved, we also face the challenge of adhering to humane methods in an industrial context. This requires a more rapid throughput than individual care allows—they need to adapt to the industrial scale effectively.

Public mobilization against prevalent abuses has historically resulted in better regulations. The U.S. Humane Methods of Slaughter Act from the early 1950s emerged from strong public demand for humane treatment during slaughter, though unfortunately, this law generally does not cover chickens, who comprise roughly 90% of slaughtered animals.

The government passed it after rigorous public calls for reforms, highlighting the importance of civic engagement in shaping policy.

Interviewer: What role do government subsidies play in this matter? Clearly, the Farm Bill allocates billions. How much of that is invested in improving animal welfare while ensuring meat quality?

Louis: I’d argue our subsidy system is fundamentally misaligned. Louis: In the U.S., the current subsidy system originated as a response to the Great Depression and the Dust Bowl crisis, which left many small family farmers bankrupt. As part of the New Deal, FDR initiated financial support for those farmers. What started as a welfare scheme gradually morphed into a corporate welfare program over time. Today, the average recipient of farm subsidies earns hundreds of thousands of dollars a year, primarily benefitting wealthy corporate farmers.

Interviewer: That sounds troubling. How does this impact factory farming?

Louis: The subsidies mainly focus on supporting the inputs required for factory farming. This means subsidizing the cheap production of corn and soy, which are heavily relied upon in factory farming systems. Additionally, factory farms receive a variety of other subsidies, including heavily subsidized insurance. For example, if there’s an overproduction of chickens, the USDA often buys up the surplus to prevent the industry from experiencing the same market outcomes that other industries would face, like being forced to sell at lower prices. This scenario creates a situation where higher welfare farming solutions are neglected.

Interviewer: So, this system creates an imbalance for smaller, pasture-based farms?

Louis: Absolutely. The subsidies have had the perverse effect of dramatically increasing the price of farmland since ownership is lucrative due to those subsidies. Consequently, pasture-based farming becomes far more expensive. Anyone wanting to start pasture farming now faces higher land prices without the benefit of subsidies for their natural feed, which is grass. This puts them at an immense disadvantage compared to industrial operators.

Interviewer: What do you suggest as a solution for this issue?

Louis: We need to reevaluate the subsidies system altogether. Ideally, we should reduce existing subsidies and redirect the funds toward practices that positively impact society—animal welfare being a significant priority. Some progress is already happening in Europe. The European Union, along with progressive nations like Denmark and the Netherlands, is beginning to adopt subsidies that require farms to improve their animal welfare practices to qualify for funding.

Interviewer: And how much funding exists for these improvements?

Louis: There’s a staggering amount—around $800 billion a year globally in farm subsidies. Just a fraction of that funding could facilitate significant transformations in these industries. Estimates suggest that getting rid of battery cages in the U.S. could cost about $10 billion, while ending gestation crates could be slightly cheaper. The operating costs might increase by only 0% to 5% at the farm level but could lead to minimal price increases at the consumer level. The government could offer subsidies to offset those costs, making this transition feasible.

Interviewer: That’s a compelling argument. But what about concerns people have regarding food costs?

Louis: That’s a critical point. Concerns about food pricing are valid, but I’m suggesting a smarter allocation of current subsidies to alleviate that burden on consumers. Owing to the current allocation, taxpayer money is misappropriated, mostly benefiting corporate cronyism instead of food security.

Interviewer: So, your argument is that better spending of existing subsidies could prevent price increases?

Louis: Precisely. The government can allocate those funds more effectively, working to meet societal needs rather than perpetuating a system that benefits only a select few. Additionally, factory farming has not always led to lowered prices. For instance, while chicken prices have decreased, the price of bacon has risen 18% faster than inflation since factory farming began. This signals that any purported savings might not be passing through to consumers as expected.

Interviewer: That’s fascinating.

Louis: Yes, and the structure of the market plays a significant role here. Many companies control the pork industry, leading to oligopolistic price-setting power. Recent antitrust actions against large chicken companies have revealed that price-fixing has been common, which indicates that if we eliminate such practices, it might open up space for reforms without burdening consumers further.

Interviewer: What about technological advancements? Can they help improve factory farming practices?

Louis: There are promising methods in development. The industry frequently claims that reform means reverting to outdated farming methods, but that’s not necessary. We can engage in high-tech, high-welfare farming practices. For instance, sexing technology can determine the gender of a chick while still in the egg, allowing companies to prevent the unnecessary culling of male chicks.

Interviewer: That sounds like a breakthrough.

Louis: It is! Countries like France and Germany are already adopting this technology, which has proven to be cost-effective. With proper implementation, we could eliminate the practice of killing male chicks globally within a decade. The financial incentives align with reducing waste, making it an attractive alternative for producers.

Interviewer: What are other examples of innovative interventions in animal welfare?

Louis: In the pork industry, a common practice is castration at birth to avoid undesirable taste qualities. However, immunocastration allows for hormone control without the need for physical castration, mostly adopted by the Brazilian pork industry. Even though the U.S. has fallen behind in adopting these methods, such technologies can eliminate cruel practices effectively.

Interviewer: There might be concerns about using hormones in this context.

Louis: Indeed, but there’s no evidence that immunocastration poses health risks. The Brazilian pork industry has been using this method for years with no negative health outcomes reported. Interestingly, marketing concerns often prevent corporations from adopting beneficial technologies like this, which is rooted in consumer perceptions rather than health risks.

Interviewer: What about cultivated meat?

Louis: The potential for alternative proteins, such as cultivated meat, is significant. We need to explore various methods of protein production since global protein demand is on the rise. While reforms should improve animal welfare, we must recognize that to meet that demand sustainably, we also need these alternative protein sources. Some regions, however, are already trying to restrict this innovation due to industry influence, which is counterproductive.

Interviewer: That’s quite a contradiction.

Louis: Absolutely. Countries that ban cultivated meats are actively stifling advancements that could help address future food insecurity. There’s a fear of accepting new solutions simply because they don’t align with traditional views. We must balance innovation while being cautious with health and safety regulations.

Interviewer: It’s a complex situation.

Louis: Yes, and it requires thoughtfulness and open-mindedness to ensure that we explore every viable path forward. Both cultivated meat and improvements to existing farming practices can coexist to meet our future needs. Interviewer: What do you think are the most promising ways to prevent the regulatory capture machine from stopping this potential solution?

Louis: Yeah. Well, as you’ve noted, it’s been quite surprising to witness the political twists people take on this issue. The idea of banning cultivated meat initially came from European socialists who felt it threatened our food traditions and farmers. Essentially, they were resistant to change. Then, somehow, it got picked up by the American right. Now, you have free-market Republicans tying themselves in knots trying to justify why they support banning a product. They’re coming up with bizarre claims like, “The FDA reviews aren’t stringent enough,” as if that’s the main problem.

Interviewer: It does seem inconsistent.

Louis: Exactly. It’s totally crazy. The good news is that we are seeing many principled conservatives and even ranchers pushing back against these laws. For instance, the Institute of Justice, which is a well-known free-market conservative think tank, is currently suing Florida and other states that have imposed these bans. We’re also seeing bills in Wyoming and South Dakota aimed at banning cultivated meat fast-track, which have recently failed, largely due to rancher opposition. Ranchers have come forward to say this is unreasonable because it sets a dangerous precedent for banning any meat.

Interviewer: It seems they’re concerned about the broader implications.

Louis: Exactly. Their argument has always been to protect a free market and let consumers decide. Now, with these bans, they’re turning the table and saying, “Wait a minute, this isn’t fair.” Another tactic industries are trying is imposing heavy labeling regulations, pushing for labels that might defame cultivated meat, saying it’s made from cells and not real meat.

Interviewer: That sounds ironic considering the history of meat labeling.

Louis: It is! The meat industry has been resisting accurate labeling for decades. They claim they can’t specify where every piece of meat in a sausage comes from—whether it’s from Canada, the U.S., or Mexico. It’s quite the double standard. They’ve labeled cultivated meat “frankenfood,” while they can’t even trace the origins of the meat in their own products, like hot dogs, which I’ve heard have been found to contain traces of human DNA!

Interviewer: Oh no! That’s wild.

Louis: Yeah, it really underscores the point that there are different rules for different industries, depending on what’s convenient at the time. They want labeling when it suits them, but they’re against it when it doesn’t.

Interviewer: And you believe there should be maximum transparency across the board?

Louis: Absolutely. All food should have clear labeling. Consumers deserve to know exactly what they’re eating—whether it’s ranch-raised meat, factory-farmed meat, cultivated meat, or plant-based alternatives. In today’s climate, making sense of all this is challenging, especially with so much misinformation out there. But labeling is a crucial first step; it allows people to make informed choices.

Interviewer: Any other potential levers to help push this forward?

Louis: Sure, one of my previous guests, Uma Valeti, CEO of Upside Foods—a cultivated meat company—shared how they’re navigating a sea of regulatory hurdles just to get started. Even trying their chicken, which I found to be indistinguishable from a traditional chicken breast, came with numerous challenges.

Interviewer: Right, and the technology isn’t easy to scale either, is it?

Louis: Exactly. The industry has decades of head start, with their research and development funded by the government for years. I believe there’s a strong case for philanthropy and government support to subsidize the development of alternative proteins. Countries like China, several European nations, the UAE, and Singapore are already investing heavily in this area, particularly because they’re concerned about food security.

Interviewer: That is hopeful.

Louis: Yes! While factory-farmed meat has reached its optimization, cultivated meat will continue to improve. Yes, it may take time and significant funding, but we need to keep pushing this forward. It’s a long game and requires sustained support, likely from government and philanthropic sources.

Interviewer: Some might view this as an all-or-nothing situation with cultivated meat.

Louis: That’s my point! People shouldn’t feel compelled to embrace cultivated meat if they don’t want to. It’s essential to have different options in the market, allowing consumers to make choices based on their preferences. If accurate labeling is in place, people will know what they’re buying, whether it’s from animals, plants, or cells, making it less likely they’ll be stuck with factory-farmed meat.

Interviewer: What advice would you give to viewers inspired to effect change in this area?

Louis: First and foremost, I encourage anyone not fully convinced to do their own research. The more you learn, the clearer the issues become. There’s nothing on our side to hide, but plenty to uncover on the other side. Also, get the word out! This industry thrives in the shadows, so discussions about it, whether on social media or with friends, can help raise awareness.

Interviewer: Spreading the word is crucial.

Louis: Exactly. We also need to make our voices heard by those in power. Let politicians know that this is an important issue for you. Similarly, communicate your values to the corporations you support—like supermarkets or fast-food chains. Also, if you have the means, consider donating to impactful organizations doing groundbreaking work in this space.

Interviewer: Can you name some of those organizations?

Louis: Sure! The Humane League is one group advocating for better standards in the food industry. Another is the Good Food Institute, which is focused on alternative proteins and the pushback against restrictive legal efforts. There are many others as well, like Mercy for Animals and Animal Equality, who are all doing essential work.

Interviewer: Are there lessons we can learn from social movements that have succeeded or failed in raising awareness and instigating change?

Louis: Yes, the abolitionist movement provides valuable insights. The British abolitionist movement is one of the earliest modern social movements, effectively publicizing and organizing around an entrenched practice that was significantly beneficial to the economy. They built pressure incrementally over time. In contrast, the U.S. movement became polarized and less focused on actionable reforms, which ultimately led to the Civil War, not the intended goal.

Interviewer: Interesting points.

Louis: Both movements established that personal purity and consumer choice are less impactful than organized political change. British abolitionists focused on a clear, narrow goal and sought cross-party support, while the U.S. movement took on too many issues. Their success hinged on organizing for structural change rather than individual actions. More recent movements that emphasize corporate and political change have achieved remarkable results.

Interviewer: That’s a good takeaway.

Louis: We need to work toward broader collective change, appealing across the political spectrum without allowing it to become a radical movement that alienates people.

Interviewer: Why do you believe solving factory farming is a win-win?

Louis: Addressing factory farming gives us multiple benefits: it combats antibiotic resistance, improves human health, and leads to cleaner air and water. It’s astonishing that this type of food production persists in the 21st century. I often think we’ll have advanced AI technologies before we abolish practices like gestation crates!

Interviewer: That is an anachronism.

Louis: Exactly! It doesn’t align with modern societal values or moral attitudes anymore. The fact that a vast majority disapproves of it and is becoming more informed gives me hope. As we explore better alternatives, I believe we can end these practices for the sake of our health, environment, and future generations.

Interviewer: Thank you so much for your insights.

Louis: Thank you for having me. I’m passionate about these conversations and appreciate your willingness to discuss such challenging topics.

Interviewer: We appreciate your dedication. Thank you, viewers, for sticking with us through this important episode. Please consider checking the show notes, which will contain additional information and actionable steps we can take. Thank you, and we hope to see you next time! Louis: Thanks for having me on the show, and thank you for everything you’re doing on this issue.

Interviewer: Thank you.

A Toy Manufacturer Explains How Trump’s Tariffs Could Crush His Industry

2025-04-17T00:00:01+00:00

A Toy Manufacturer Explains How Trump’s Tariffs Could Crush His Industry

Today, one last show for the time being on the Trump tariffs and the trade war. In the last three weeks, we’ve spoken to economists about the tariffs. We’ve spoken to a historian about the Smoot-Hawley tariff and the 100-year legacy of American protectionism. We’ve spoken to supply chain expert Jason Miller from Michigan State on why China is set up to win the upcoming trade war. But the voice we haven’t heard from in all these shows is the voice of business. People who run companies are screaming at whoever will listen that the White House agenda is about to decimate businesses and plunge entire industries, if not the entire economy, into a recession.

Today’s guest is Molson Hart. He runs a manufacturing business in the U.S. and has for the last 15 years. His company, Via Hart, manufactures consumer products in China, Indonesia, Vietnam, and sells them in stores and online shops, mostly in the U.S. His biggest vertical is toys, including brain flakes, which are molded plastic discs that kids and adults can snap together to build things. Molson Hart is not a bleeding heart lefty. Quite the opposite. This is a guy who told me he’s rooting for the Trump agenda to succeed, who told me in our interview that he wants to believe that the Trump team has its heart in the right place when it comes to bringing back manufacturing in the long run. And yet he has called these tariffs not just a bad idea, but, quote, the worst economic policy in American history.

I spoke to him this week, and he was just incredibly compelling and thoughtful about his business and the toy industry, and more generally, why it’s so hard to bring back American manufacturing quickly, and how these tariffs could do incredible damage to America’s small businesses. So we decided to rush out this interview a little sooner than we intended, in part because it’s great, and in part because this new story is moving so quickly, it really is hard to know what reality will look like next week. And of course, as Molson himself says, this uncertainty is very much the problem.

I’m Derek Thompson. This is Plain English. Molson Hart, welcome to the show. Thanks for having me, Derek. So you’re the CEO of a manufacturing business. Your company makes consumer products like educational children’s toys in China, Vietnam, Indonesia. What do these tariffs mean for your business?

For the last four or five nights, every night, I’ve been meeting with a different one of our suppliers. Sometimes that supplier has been in China. Sometimes that supplier has been in Vietnam and Indonesia. And as a practical matter, we’ve mostly been trying to figure out how we can route manufacturing out of China through another country to reduce the tariffs from 145%, which is the current China tariff as of today, to 10% in Indonesia and Vietnam, so that we can basically save money when it comes to bringing our products to the United States.

Earlier today, I spoke with a supply chain management professor, Jason Miller, and we were going through his analysis of the products that America most relies on China for imports. And they include things like children’s coloring books, 93% of which we import from China, 96% of toys for pets, 74% of toy parts for ages three or under. The U.S. just obviously, overwhelmingly relies on China for toy manufacturing. And of course, now I’m talking to somebody who works in toy manufacturing, who has a supply chain that runs through China. What do you see as the effect of an escalating trade war with China on your industry, the toy industry?

First of all, it’s super hard to make predictions because tomorrow there could be sufficient outcry about the increasing costs of getting toys for families and children and stuff like that. And you could see that 145% tariff out of China become, I don’t know, 20% or something like that. So it’s really hard to say. China is really good at making certain types of toys. We buy our toys also, which we design, and which we inspect in all the countries in which we manufacture, by the way, on every single shipment. We buy them from China. We also buy them from Vietnam and Indonesia.

And certain products, certain toy products can certainly be made in Vietnam and Indonesia. Right now, amongst the people I know who are selling toys, people are falling maybe into three buckets. The first bucket is, I’m going to cheat to win. The second bucket is, I’m going to figure out a way to shift my manufacturing over to Southeast Asia as fast as I can. And we fall in that bucket. And the third bucket is, this policy doesn’t make any sense, and it’s going to change tomorrow. So I’m just going to wait and see.

And so depending on what the enforcement of these policies look like, the first bucket could get in trouble. Depending on whether or not the tariffs are reversed, the second and third buckets of toy companies may either find success or be in trouble. If the tariffs, it’s really hard to say. So in general, people are predicting tariffs to cause inflation, which basically, in the context of toys, means that the prices of toys will rise. But they can actually just as easily cause deflation.

Because these companies, they budgeted $300,000 or $500,000 to buy their inventory. That $500,000 now has a 145% tariff on it. And so that is, I don’t know, 1.5 that. So it’s like $1.3 million now worth of goods. They have to come up with the extra $800,000 in inventory costs in order to get their products to the United States. If they don’t come up with that money, maybe supply goes down for toys and prices do in fact rise. Maybe what happens is that they decide to go bankrupt or they close up shop.

And when they go bankrupt and they close up shop, they want to take their inventory and they want to convert it to cash as fast as possible. And the best way to do that is to lower prices. And in that way, you could actually have a deflationary episode, prices coming down when everyone’s expecting inflation. So it’s kind of hard to know what’s going to happen. But it’s definitely dark days for the industry, no matter what it does.

I want to follow up on two points here. One is that you said an option for various players in the toy industry is we’ll cheat to win. And I want to understand what cheating to win means in this context. And then I want to talk a little bit about how uncertainty actually plays out for a company like yours. So let’s talk about cheating first. What is cheating to win look like in a 2025 trade war with China in the toy industry?

So if you have a 145% tariff, the tariff is actually higher than the cost of the goods, which is wild to think about. So that means that your business is no longer oriented about reducing costs or even improving product quality. Because what are you going to do if you reduce the cost by 10%, 20%, something like that, it just doesn’t matter as much as that 145% tariff. So that becomes your complete and total focus. One of the ways to cheat is to live at the value of the goods.

So if that product that we’re buying for $1 actually costs 10 cents on the invoice, then the tariff goes from being $1.45 per unit to being, I guess it would be $14.5 or something like that. So that’s one way that you cheat. Another way that you cheat is instead of in a sincere and meaningful way, moving production to Southeast Asia, you just figure out a way to say the products are made in Vietnam. Maybe you just have the Chinese factory right made in Vietnam on the packaging. Maybe you like weirdly route the container through Vietnam and then ship it to the United States and boom, now you’re paying a 10% tariff. So those are like the two main ways to cheat to win.

One of the major second-order problematic effects of these tariffs is that incentives drive behavior. And in the United States, you don’t need to have a company and you don’t need to have a citizen in the United States to import. So in order for us to export to China and export to the EU, we have to have a responsible person, a responsible party. We may have to register for taxes, stuff like that. In the United States, you don’t have to do that. So what can happen is that you have companies that are based outside the United States to have a greater incentive to cheat because if they cheat, they just lose their shipment. Whereas if an American company cheats, they get sued for fraud, they have big penalties, which they actually have to pay because they’re in the United States where they can be enforced and potentially people can even go to prison. So there’s really big incentive problems around these tariffs. Superficially, right? The tariffs are anti-China. But if China can more easily cheat and get around the enforcement of the tariffs because they don’t have the skin in the game that American companies do, then paradoxically, it can benefit them.

I’m not sure how to phrase this question, but it’s something like, I keep hearing from economists that uncertainty is bad for business. You’re an actual business person. How does uncertainty play out for you? What are the decisions you’re delaying? What choices are you declining to make at this moment? In a weird way, are there things you’re not slowing down on that you’re actually speeding up to get ahead of whatever is coming? Basically, I want to know what do you do with uncertainty?

Yeah, that’s a great question. I think it really depends on who you are, your personality, and importantly, your financial position. If you’re not in a financial position to take additional risk during this period, because it’s possible that this is an opportunity for people. For example, if everyone in China pauses their inventory, and you don’t pause your inventory, you continue to receive goods, well, then you can be the only place, the only company in the United States that has the product that people need. So your sales can go up, your prices can go up. So there’s an opportunity here.

And then there’s also a huge downside. It’s quite easy in this environment to go bankrupt. Based on talking to other people who are in a similar position, some people are psychologically freezing up. Maybe it’s a money question. Maybe it’s a personality question. Some people don’t take it seriously. They’re like, “It’ll all blow over.” It always does. But for me, I’m built weird. Chaos and uncertainty give me energy. I’m up every night. I’m full of energy. I’m ready to solve this problem. This is an opportunity. This is one of these changing moments in the industry where maybe we can do something. Maybe we can seize this opportunity and do something great and take market share.

But it depends on your personality; it depends on your finances, how you respond to uncertainty. The situation is so uncertain that it’s almost impossible to make a prediction. I just don’t know what the president is going to do. So at least for us, we try to just pick a strategy that, no matter what happens, is probably going to work out okay.

One thing I’m seeing a lot in the reading and the listening that I’m doing for this topic is that it’s very, very hard to just move a supply chain. It’s not like the Derek Corporation calls its factory in Indonesia and says, “Call the boys back and we move everybody into an equivalent factory in Ohio.” That’s not how anything works. You must have experience maneuvering supply chains out of China in the last few years as you’ve tried to make your business more resilient. Tell me about that experience.

Yeah, sure. So in 2020 and early 2021, the cost of container shipping went from, I don’t know, let’s say $4,000 to $5,000 per container. Our warehouse is in Texas. So we’re shipping a container, which is that metal box that you see on a truck that you may also see on a container ship stacked up like six, seven, ten high, whatever. It went from $5,000 to $25,000, which was a huge cost increase for us because we sell rather large products. The large products take a large percentage of the container.

When the container costs went from $5,000 to $25,000, it increased our costs a lot. So we had the idea of doing some manufacturing in Mexico. If you’ve ever been to a Build-A-Bear, which is a retail store where you can pick a skin, a skin is an unfilled stuffed animal. You can fill it up with the plush filling. Imagine a pillow case without the stuffing inside. So you can fill it. Our idea was, let’s import skins from Asia to Mexico, and then let’s fill them in Mexico to save on this massive container cost increase. The idea is you ship skins, which are flat and very easy to ship. It doesn’t cost very much money. You move them to Mexico, then you fill them up, and then you bring them to America. In this way, we can save money as container prices got very high.

You can send a message to a manufacturer in China, and oftentimes, you’ll see them answering at 5 a.m. You’ll see them answering at 7 p.m. They just work really, really hard. Mexico wasn’t like that in the same way. It was hard to get people to reply to us. Ultimately, I went to the supply chain town in Mexico, which specializes in stuffed animals. They really have a place like that. I visited some manufacturers. We purchased… We got some skins. Normally, we buy finished products from China. In that case, we bought skins. We brought them.

Mexico doesn’t have the finest ports. It turned out that the most effective way for us to ship to Mexico was actually through the United States. So we shipped to Mexico, and the manufacturer assured us that by the time the skins got there, he would have the import-export license that would allow us to import it into Mexico and then export it back to the United States. By the time the skins reached the border, we discovered he didn’t have the license. The license was going to cost $500,000. To this day, I still can’t believe he did that after making that commitment.

We actually sent the skins from the Mexican-US border back to our warehouse, where they’ve been for the last 2 or 3 years since. So we failed to get it done in Mexico. That’s an example of how government is important. I really don’t understand why the Mexican government doesn’t do this, but they just don’t make it easy for people to import and then export. Mexico has lots of tariffs, but they should make it easy for those tariffs to maybe be not paid if the factory exports later. That’s kind of an example of the difficulties we had manufacturing in Mexico.

Does your experience in Mexico offer a lesson that applies more broadly, that describes how difficult it’s going to be for toy companies and even electronics companies and phone manufacturers to pull their factories out of China? Yes, it does. What we were doing in Mexico wasn’t even that complicated. The skins were plush animal skins that were like 100% finished. All they needed to do was fill the product, which they were already doing in Mexico from the plush supplier because we need to get the plush and put it in the stuffed animal. And we still fail.

Now imagine how difficult it would be to make all those things from scratch. If we want to make that stuffed animal in the United States, where’s the fabric coming from? Where’s the fabric manufacturer? Are they on the other side of the country? In which case, we may not be able to do it at a price that consumers are willing to accept because the shipping costs are going to be too high. Does anyone even know how to sew a plush animal? Personally, I’ve never actually sewed a plush animal. I don’t know. My wife used to work for Ralph Lauren, and she was kind of helpful when it comes to fashion.

But is the know-how there? Do we have someone in the United States who knows how to make the plastic nose that’s going to go on the stuffed animal? How do we make that product safe and all that stuff? So we failed to do something at that time that was actually quite simple. Making the whole stuffed animal is significantly harder than people even realize. Now, compare that stuffed animal to an iPhone. I think we’re really underestimating how difficult it’s going to be to bring manufacturing back to the United States.

This episode is brought to you by Indeed. Hiring someone new for your business can be a big move, and I understand you probably want to take your time to make sure you found the right person. But playing the waiting game could do more harm than good because that’s extra work and extra stress you’re putting on you and your team. It’s not a healthy work environment. When it comes to hiring the right people fast, Indeed is all you need. Their sponsored jobs move your job posts to the top of the page, letting you stand out first to relevant candidates. It makes a massive difference.

According to Indeed data, sponsored jobs have 45% more applications than non-sponsored jobs. Another great thing about sponsored jobs is that you’re only paying for results. You don’t have to worry about monthly subscriptions or long-term contracts. There’s no need to wait any longer. Speed up your hiring right now with Indeed. Listeners of this show will get a $75-sponsored job credit to get your jobs more visibility at Indeed.com/Plain. That’s Indeed.com/Plain right now. Support our show by saying you heard about Indeed on this podcast. Indeed.com/Plain. Terms and conditions apply. Hiring? Indeed is all you need.

So Molson, shortly after the original reciprocal tariff plan was announced, you called it “the worst economic policy I’ve ever seen.” I don’t think that needs much elaboration from me. Why did you write that? Why did you feel that way? Well, I could have been a little bit emotional there, and it’s very possible that their heart is in the right place, but their head is not. If the intention is to bring manufacturing back to the United States, the tariffs need to be done in a completely different way. I’m not against bringing manufacturing back to the United States. Actually, I would like to see manufacturing of… Certain things come back to the United States so that we can be a strong, productive nation. But we should be selective about it. And we should be more oriented towards iPhones, high-value goods, and things that play to our strengths. One of our strengths is agriculture. Maybe we should have a tariff on agricultural drones, for example. We should be future and technology-oriented. We should have tariffs on robots to encourage the domestic manufacturing on robots so that we can automate our supply chains further.

But one of the reasons why this economic policy is not done right is that the tariffs are completely uncertain. One day they’re on, the next day they’re off, the next day they’re on again, which disincentivizes building a factory. If you want to build a factory, you need to have certainty from the government. You don’t want to build that factory and by the time it’s built, which by the way takes multiple years, hear from the government that they’re no longer tariffs and you’ve now lost all your money on that factory.

The other thing is that the tariffs should kind of be graduated in the sense that if we want to put a 200% tariff on agricultural drones, we should not do it at 200% tomorrow because we need those agricultural drones tomorrow and in the coming weeks in order to make our fields effective and crop-producing for farmers. What we should do is put perhaps a 25% tariff starting in three months, which becomes a 75% tariff in a year, which becomes a 150% tariff in two years, which becomes 200% after three years or something like that. Give supply chains time to adapt, give people an opportunity to build that drone factory rather than just throwing a 200% tax on it.

Another problem with the tariffs is that they apply tariffs to components and finished products equally. If you want to develop manufacturing in the United States, you might want to start with final assembly. A lot of cars are assembled in the United States from the import of foreign components. So maybe what you want to do is say, look, we want to start manufacturing these agricultural drones in the United States. We’re going to put that 200% tariff on agricultural drones, but we’re not going to have the same tariff on the motors. We’re not going to have the same tariff on the propellers, which currently, let’s say, the supply chain doesn’t exist for. And so we’re going to have lower tariffs on the components and higher tariffs on the finished goods that incentivize at least the assembly in the United States.

Another thing is that they apply the tariffs to finished goods, not only to the components but also to the machines you need to actually make the components. So if you want to make, all right, it’s a bad policy. If you want to make these propellers, these drones, these motors, you need machines. And by applying that tariff on the finished product that you also applied on the machine, you just made all the machines considerably more expensive. So it’s now become more expensive to open a factory in the United States to make those drones. And so there are all sorts of major problems with how this is done.

They also put tariffs on coffee, which doesn’t grow in the United States. And I think I speak for all of us when I say that we could use a little bit of coffee right now in adapting to this situation. One thing I’m watching is that this is a very transactional administration. And when you read stories like, for example, the NVIDIA CEO, Jensen Huang, having dinner with the president and getting the White House to exempt some of his chips from export controls, it definitely smells bad. But the broader principle here that I think is worth paying attention to is not just the possibility of corruption, but the fact that big companies have a much easier time making those kinds of asks of the White House.

Small companies don’t have the manpower to constantly adjust to all these tariffs, but even more importantly, maybe they can’t, they don’t have the resources to lobby the relevant U.S. trade representative to win an exemption for their particular thing. Do you think it’s reasonable to worry about this trade war becoming a kind of Christmas tree for one-off deals that helps big corporate giants and ironically hurts the little guys?

Yeah, I think that’s definitely right. Look, when you hear about Tim Cook or Jensen Huang having dinner with the president, the optics of that are obviously horrible, right? It’s like, hey, do me a favor. Dinner was a million dollars or whatever. But it doesn’t necessarily mean that something that the money was passed under the table or whatever, that a bunch of Trump coin was purchased. It could also mean that Trump, the president, doesn’t understand how the supply chain works. And he doesn’t understand that you can’t snap your fingers and open up that factory in the United States. And maybe he just needs to have information to understand how the policies need to be changed.

So it’s not just about having a big company and big money, but it’s also about having a voice. And so I hope that the administration listens to the voices of the companies, big and small, particularly the ones that employ a lot of people to hear how this is going to affect them. Because I do think that many of them have their heart in the right place, but the execution on these ideas is really poor. And so maybe the right small businesses can voice their opinion and the tariffs will change in some sort of way to be not only more beneficial to them, but like what’s actually important, which is like the country at large.

Let’s say the tariffs don’t go anywhere. It remains $145 on toy parts from China. How does the toy industry look different in three years? If the tariffs don’t go anywhere, I mean, don’t get me wrong. I love our educational toys. Toys are the least of our worries. What about transformers and pumps and air conditioners? A lot of these products aren’t really made anywhere but China, where China has 50, 60, 70% of the market share when it comes to the manufacturing of those products.

At a certain point, the stores of the stock of the replacement transformers, the replacement pumps that we need for clean water are going to run out. And there isn’t anyone building new factories for those things, and the factories won’t be ready in time. So as much as I’d like to see the tariff reduced on educational toys, because educational toys make people smarter, they make kids smarter, and that’s what’s needed to bring manufacturing back to the United States. As much as I’d like to see that, I’m going to tell you the truth, transformers and pumps are actually more important.

And so I don’t know what we’re going to see if we don’t have power. It’s like a really frightening thing, to be honest. And so hopefully this message comes across because by the time you have a shortage of transformers in a particular location, in order to get a new order of transformers, it’s not Amazon Prime, where you’re like, oh yeah, I’ll just go to Amazon Prime, click a button, buy now. Cool. It’s there in two days. It takes months to make the transformers, to ship them perhaps by sea. I don’t know. Maybe you can ship transformers by air, and they’re going to be significantly more expensive.

And so that’s what worries me. As for the toy industry, I mean, I’m not sure what’s going to happen. But a lot of companies are definitely going to go under. That’s for sure. But right now, I’m just more concerned. America is like a net food importer than we have been for the past two years. What about the cost of food? So yes, I’m totally concerned about my company, and I’m concerned about my employees and keeping them employed. And I want kids to have educational toys. But there are other things that we need to resolve first. And that is the thing that’s like really – it’s like weighing on me.

Molson Hart, thank you very much. Sure. Thank you so much. Many thanks to Molson Hart. I just want to recircle a point that’s somewhat compiled from this show you just listened to and our last show this week with the supply chain expert, Jason Miller. Even if you were committed to the Trump White House agenda to reshore critical supply chains in military equipment and energy and computing, there is just no reason to do it in a way that crash decouples the entire American toy supply chain.

More than 70% of American toy imports, 90% of American coloring books come from China. A full-on trade war is going to punish families and crush small businesses in the toy business for no good reason. Like, it’s not in America’s strategic interest to reshore Halloween lawn equipment or stuffed animals for the purpose of restoring American greatness. I remain incredibly worried that we haven’t seen the worst of this policy or anything close to it. And I think a lot of companies and families and people are going to suffer because of this utterly misguided trade policy.

Okay. Deep breath. Next week, I promise I will move on to other subjects. I promise, I promise. Thank you all for listening. Thank you. Thank you.

Sinica Live at Columbia University, with Yawei Liu and Yukon Zhang

2025-04-17T00:00:01+00:00

Sinica Live at Columbia University, with Yawei Liu and Yukon Zhang

Welcome to this special live edition of the Cynical Podcast. Coming to you from the 2025 Columbia China Summit at Columbia University. Hello, New York. Let’s hear you make a little noise. Thank you so much to Aidan, to Zhang Yu, to Annie Cui for taking such good care of us. All of the good folks here at the Columbia Greater China Society for having me and my two esteemed guests here.

This summit is taking place at a pivotal moment, not only for Columbia University but also for U.S.-China relations. Both the university and the bilateral relationship, the most important in the world, find themselves buffeted and embattled by the capricious whims of a certain U.S. president. I worry a lot about the future of both of them and of students like Mahmoud Khalil. When Annie Cui and her colleagues at the business school reached out to me to moderate this conversation, we were all living then, not so long ago, in a very different world.

It was possible to title a session like this in the way that this one was actually framed, which is, as you can see, Bridging the Divide: how the U.S. and China can find common ground. This week, in the midst of this awful trade war, the idea seems positively quaint. Fortunately, though, we have two guests who bring not only decades of experience, deep insight and clear-eyed analysis to the topic at hand, but also quite original thinking about China’s evolving role in the world and how we in the United States might better understand it.

First up, Dr. Yukon Huang, senior fellow at the Carnegie Endowment for International Peace, someone who I have long admired for his ability to cut through prevailing narratives about China’s economy with rigor and reason. I think we first met about 15 years ago or 16 years ago in, of all places, Sao Paulo, Brazil. He was, of course, the World Bank’s former country director for China. He has had a front row seat to the country’s most consequential period of economic reform. His book, Cracking the China Conundrum: Why Conventional Economic Wisdom is Wrong, remains one of the most lucid and empirically grounded treatments of China’s development that I have read. I had the pleasure of interviewing him on stage about it several years ago. His voice remains one of the rare ones in the field that is both deeply informed and refreshingly independent. That is to say, quite contrarian at times. He has something of a deserved reputation for being a contrarian.

I am also delighted to be joined by my very good friend, Dr. Yawe Liu, who serves as senior advisor for China at the Karcher Center in Atlanta and is also the founding editor of the U.S.-China Perception Monitor. Few people, I honestly say, have done more to foster constructive dialogue and mutual understanding between the U.S. and China than Yawe. Through his work at the Karcher Center, which is an institution with a unique legacy in U.S.-China engagement, he has helped to create platforms for civil society exchange, to track perception in both countries and champion truly more empathic approaches to bilateral relations.

Yawe brings not only academic expertise but a lifelong commitment to dialogue, public diplomacy, and the difficult work of bridge building. Together, Yukon and Yawe offer us an invaluable combination of perspectives, one grounded in economics and policy, the other in civil society and in political culture. I can’t think of two better people to help guide us through this complex moment in U.S.-China relations. So please join me in welcoming Yukon Fong and Yawe Liu. Thank you.

On this topic, it’s really hard to even know where to begin, given the state of things. But earlier, Yawe, you told me a bit about Columbia University’s history with China and its deep connection to China. Maybe we could get into the current situation by first talking a little bit about this institution and its relationship with China, hoping, of course, that in the current climate, such discussion doesn’t get the university into even deeper Danpu than it is already with the White House.

I’ll start with a recent history. In 1989, I was admitted by the political science department as a student in the Ph.D. program. Unfortunately, not like Nipi, who got a scholarship, they basically said I could come to study, but I had to pay my own tuition. Of course, I was not able to come to Columbia. Instead, I went to Emory because Coca-Cola has a Wuzhou fellowship, so I was able to study over there. And since then, there’s no Pepsi. All my students, the first thing I told them is, in my class, don’t let me see you drink Pepsi because my whole doctoral study was sponsored by Coca-Cola, which is, of course, one of the most popular drinks in China.

This podcast is not sponsored by the Coca-Cola Company. And, of course, President Carter always said he helped introduce Coca-Cola, even though Coca-Cola was in China in the 1930s. But we know what happened from the 30s all the way to the 1970s. Now, Columbia, I’m so happy to be here at Columbia because Columbia is the place to the divide that Kaiser talked about, the fact that you are all here. The China studies program at Columbia started in the early years of the 20th century, and one of the donations came from a Chinese-American who was a servant to a general. His name is Ding Long. There is a play, I think maybe you can find it on YouTube, called Ding Long de Yi Zi, the chair of Columbia. He was a servant, but he donated $50,000, and then his boss donated even more money. That’s where it started.

Now, of course, we have Professor Duwe, who influenced Hu Shi, who later on became China’s ambassador. There’s also Wellington Koo, who came here and studied, and of course, later on, we have Yang Lan, who studied here. We also have Fangang, whom Yukong probably knew, a prominent economist. Today we have Thomas Christensen and Andy Nathan, two of the most eminent China scholars. This is the right place, and I’m so happy to see so many Chinese students. According to DeepSeq, there are 3,000 Chinese students currently enrolled at Columbia. You are going to be the future.

I just saw one tweet by Laura Loomer urging President Trump to cancel all visas for this debate. I hope President Trump is not going to listen to her, but President Trump does listen to her all the time. Let’s hear a rousing boo for Laura Loomer. Boo!

My grandfather was a very eminent historian, Guo Tingyi, who was the founding director of the Modern History Bureau of the Academia Seneca. He was here in the last years of his life. This is where he finished his final magnum opus, his book, an outline of modern China’s history. I have some ties to Columbia. I used to visit him here, right on Riverside Drive, and walk with him in Grant’s tomb in the park there.

But let’s turn to this. In this age of such difficulty, I want to turn to you first, Yukon, to get a perspective from Beijing. What do things look like from Beijing? When you look at the bigger context of the bilateral relationship, what is it that shapes Beijing’s views on the trade war? And then I’d love, Yawe, for you to jump in as soon as Yukon is done talking about that because I know you have some very interesting perspectives on some of the fundamental aspects that you think are driving the problems that have led to this trade war.

But Yukon, let’s begin. How does Beijing see what’s happening right now? I would think that there are two questions in the minds of Chinese leaders. First is, is the United States really serious? What is it that the Trump administration wants? Is it about the conflict between two growing dynamic powers? Is it about technology wars? Is it about security? Is it about influence? Is it about power? If Trump is in favor of negotiating, what does he want to really negotiate about?

Does he want China to invest more in the United States to build up manufacturing in the U.S.? But if that is the case, why is it that U.S. policies discourage Chinese companies from investing in the United States? If the United States wants China to buy more from the United States, why does the United States consistently restrict the exports of high-tech equipment to China? After all, there’s only so much soybeans that China can buy or Boeing aircraft. They can’t buy high-tech; they don’t buy defense expenditures. It’s therefore almost impossible for China to buy enough to close the deficit. So is it really about deficits? Is it about technology? Superiority in terms of technological innovation?

That’s the big question in China. What is it that America really wants? Therefore, if Beijing understood that, what could Beijing offer back? Is it about 5G security? Is it about TikTok? Is it about green technologies? Now, later on, I’m sure we’re going to get into this because if it is about green technology, if it is about TikTok, if it is about 5G, then there is a potential for a collaboration which has not been explored. But I think the real problem from a Beijing perspective is, we will respond, Beijing says. But really, you need to know what you really want.

Is there an operating thesis, though? Well, the problem from a Beijing perspective is they see two broad forces in the United States. Hardline U.S. policy advisors favor disengaging with China, breaking down, and no longer trading. I don’t think that’s President Trump’s idea. Trump still wants to negotiate. He wants to interact. He actually has admiration for the Chinese system. But then who actually makes the decision? The hard-minded policy advisors who are in favor of a trade war or high tariffs? Or is it really Trump looking for some way to negotiate with China and to find a middle ground?

I think that’s the problem we face today. Okay, Yawe, let’s zoom out a bit and frame this current dispute in terms of, like I suggested, perception, which is something we both talk about an awful lot. I’m often on about cognitive or strategic empathy. I think we’re both fans of the late Robert Jervis, who was at UC Berkeley, where I studied, and the importance he placed on understanding perception on the other side. You’ve talked a lot about major foundational perceptions from each side and what’s wrong with them. Can you maybe do your best to unpack these while trying to do it succinctly?

Yeah, I wrote my dissertation on Mao Zedong himself and called the United States according to Mao Zedong because Mao’s perception of the U.S. and his shifting perception of the U.S. drove his policy toward the United States. Now, here today, you know, more and more, I see there is the Huntington book called the Clash of Civilizations. I think there is a clash of perceptions, or I prefer to call it a clash of misperceptions. Americans see China one way, and China sees the U.S. another way, and these perceptions clash.

The sound and fury we have experienced, starting with Trump in 2017, continued with Biden. I called the Biden four years Trump two, and now we’re starting Trump three. The misperceptions from the China side are that the U.S. is a pacing threat, that the U.S. is actually trying to engage in regime change in China. For Trump, it’s using economic leverage; for Biden, using ideological and high-tech restrictions.

That’s number one. And number two is the Chinese elite also believe the U.S. is in decline. So, one, the U.S. is trying to undermine us, and two, the U.S. is in decline. They’re going to triple, quadruple their effort to contain China. Now, I think what Trump did in the last few weeks has further deepened the Chinese perception that America, by whatever means that is available to them, wants to overthrow China. They want to overthrow the system.

Now, the American perception of China is even more interesting, as China poses a threat. China is doing everything to undermine the American way of life. That’s one. Two, I think it has borders of racism. When Vance said, “We Americans are buying a body from the Chinese peasants who are manufacturing the products that we buy.” Early on in Trump 1, 2017, the State Department policy planning lady said, “This new Cold War with China is different from the first Cold War. The first Cold War was about white people, Russians and Americans. This one is white people versus the yellow people.” That’s different.

At the bottom of some American thinkers and policy elite, they still have this sense of yellow peril that China poses. Even though Russia is fighting a war, China is seen as a bigger threat. These two misperceptions, eventually lead to miscalculations. That’s what keeps me awake in the evening. Maybe one day, by design or by accident, the two countries might be driven into a conflict by these misperceptions.

I understand the two American misperceptions that you identified, and I agree with your characterization of them as misperceptions. But China’s perceptions that, A, the United States seeks regime change ultimately, seems, when I look at the way that political legitimacy is never afforded, they refer to it as a regime, which is an inherently pejorative term.

When they talk about this assumption that America is in decline, the evidence before my eyes doesn’t exactly contradict that right now, either. So, are these misperceptions even, I mean, is China misperceiving American intention? Is China misperceiving America’s trajectory? Well, to a certain extent, these misperceptions are transformed into real perceptions because of all the evidence. But I still think, you know, maybe the Chinese are correct that the U.S. does want China to change its system.

You know, that’s actually when Americans realize there’s no way the market economy is going to change China’s political system. That led to the disillusionment, disappointment, and desperation that the U.S. wants to contain China because China is a free rider. The U.S. provided all the access to the market, which has strengthened China. China is now posing a threat.

So, on that part, I think maybe to a certain extent, it’s right. But I really don’t believe that organizations like the Carter Center are being perceived by China as the vanguard of the American effort to undermine the Chinese political system and promote a color revolution. I think on the second misperception of China is that the U.S., because of what we witnessed—because of the race relationship, because of the 2008 financial tsunami, and all the other things that happened— is in decline.

I think on that part, China probably is more erroneous, saying the days of America are numbered, and we are going to take over. There’s no way China is going to take over, and China really has no desire. That’s my conviction. China has no desire to supplant the U.S. as the superpower. China has no alternative to offer to other countries that our system is better than the American system.

Okay, Yukon, turning to you on the subject of assumptions and perceptions. I think it’s fair to say that a very widespread, maybe even foundational assumption of the Trump administration and probably a lot of Americans is that China’s trade surpluses have fueled American trade deficits. I think that most people take that on faith. You have noted, though, that, in fact, America’s trade deficits spiked long before China was a major player. In the 1990s, we started to see major deficits. So what was the catalyst, in fact, for those deficits?

Well, it’s kind of interesting. Trump launched the trade war because he believed that our trade deficits in the United States are largely due to China’s large trade surpluses, and therefore tariffs could perhaps reverse this trend. Go back to when U.S. trade deficits really became significant. It was in the late 90s, early 2000s. The U.S. began to run huge trade deficits.

Then look at China during that period; China didn’t even have a trade surplus of any significance. So how could America’s trade deficits be caused by China when China wasn’t even driving any trade surpluses at that period of time? Now, when China’s trade surpluses began to get really large, around 2005 to 2007, America’s trade deficits actually got smaller. They actually ran in the opposite direction.

So for most of the past two decades, China’s trade surpluses and America’s trade deficits ran in exactly the opposite direction that you would predict if you thought they were related. The great irony is that China’s trade surpluses became really large, and America’s deficits became really large precisely when Trump started the trade war. Before that, they were not linked.

In 2018, China did not have any trade surpluses until the trade war was launched, and then China’s trade surpluses soared, and America’s deficits started to get really large. So, in great irony, this tension between the two was launched or caused by the trade war.

Now, how do you explain this? Well, the answer is that trade balances are not due to trade tariff rates or exchange rates. They’re essentially due to fiscal imbalances on both sides. So how did America’s trade deficits emerge in the late 90s? Huge budget deficits, the war on terrorism, tax cuts in the United States. America generates a huge trade deficit that has nothing to do with China.

How did China’s tremendous surpluses emerge over the last 20 years? They emerged over the last 20 years precisely when China became more urbanized. So how does urbanization in China generate trade surpluses? It doesn’t have anything to do with exchange rates, it doesn’t have anything to do with interest rates, it doesn’t have anything to do with tariffs.

Five hundred million Chinese moved to the cities from rural areas because of growth and development. Most of them are migrant workers. Migrant workers in China save 50% more than normal Chinese because they can’t spend their money. So they’re earning high incomes in the big cities in Shanghai, Chongqing, Beijing, and Tendu, but they can’t spend it.

So China’s savings rate surges, and this leads to a trade surplus in China. So here’s what’s happening. China’s trade balances are largely driven, ironically, by the extraordinary savings of Chinese migrants. America’s deficits are largely driven by America’s budgetary deficits. They’re completely unrelated. Yet we think globally that China’s trade balances and America’s trade balances are somehow linked. They’re not.

And now we have a tariff war. As an economist, my basic problem is we’re launching a trade war on the assumption that China and America’s trade imbalances are somehow caused by the same factors, and that tariffs will resolve the problem when they’re fundamentally driven by two incredibly different issues: urbanization in China and budget deficits in the United States. And that’s why we really can’t solve this problem through tariffs.

It’s funny that only just a couple of weeks ago, people were still speaking about the possibility, the distinct possibility of some kind of grand bargain between Washington and Beijing, between Donald Trump and Xi Jinping. Do either of you hold out any hope for something like that still happening?

In light maybe of what happened this last night, I don’t know if you’ve all seen the news. But Customs and Border Patrol put out very quietly an announcement that many categories of trade goods would be exempted from the onerous tariffs, the reciprocal tariffs that he supposedly put in. In other words, everything except for things like toys and textiles are still going to be exempted from tariffs.

So it represents a pretty substantial percentage of the total trade goods, and it’s all the important stuff. But anyway, given these circumstances, I know we all have whiplash now from swinging our heads so wildly watching trade policy change so abruptly. But I’m curious, at some point, Trump and Xi Jinping will talk. It’s inevitable.

From our knowledge of these two individuals, their temperament, their respective agendas, their sense of their own strengths and vulnerabilities, what is the likely outcome of such an encounter? I want to start with you, Yukon, because you’ve suggested to me privately that there are certain similarities between the beliefs and aspirations of, on the one hand, Donald Trump and Xi Jinping, on the other, that might either exacerbate or, and I think this is the more intriguing part and the less intuitive part, actually mitigate bilateral tensions.

So I want to start with you and then turn to Yahweh and talk about the personalities and the temperaments and maybe the fuzzier cultural elements of the relationship. You hit on an interesting point. We think U.S.-China tensions are the consequence of differences between the United States and China. They’re actually the consequences of growing similarities between the U.S. and China.

Trump has “Make America Great Again.” Xi Jinping has the “China Dream.” Both have cultural wars. The United States is against diversity, education, inequalities, etc. In China, it’s a stronger movement toward the Communist Party ideology. Both sides favor manufacturing, technological development. Both sides have regional aspirations. Trump wants to link Canada, Greenland, and Mexico into a North American influence. China has the Belt and Road Initiative and Asia view.

Both are becoming what I call more authoritarian. Each of these tendencies can either exacerbate or moderate tensions between the U.S. and China. So if we are going to find a solution, we have to turn the growing similarities between the United States leadership and America’s leadership into issues in which they may find commonality. One of them is industrialization, security, and technology.

Where can you find harmony out of conflict? TikTok is going to be the example. We have a solution to TikTok. The security aspects that worry Americans will be handled by Oracle. China will be able to retain its intellectual technological algorithms and maintain that. But more of the control of TikTok will be in the United States. I think eventually that’s going to happen.

But that’s such an example. Let’s go to how you extend that example. Go to 5G communications where America is worried about security. The solution lies in providing a security comfort blanket to America or to Europe that merges China’s manufacturing prowess. The same thing in green technologies and batteries. How do you get, and you will see it, you will see Chinese manufacturing automobiles in the United States 10 years from now. Today, there’s none of that.

But there will be a compromise which merges China’s tremendous advancements in green technologies and manufacturing capacity and Trump’s desire to create a new manufacturing capacity of automobiles in the United States. Here’s the issue: they both share this common interest in promoting manufacturing technology. You have to find a way where both sides feel they can benefit from that relationship.

And as an economist, that’s what I focus upon. So the grand bargain may end up looking something like what Mitch Presnick was talking about this morning with having joint ventures, where there’s technology transfer to American manufacturers. I think the test cases will be in Europe first. That’s how you get around what I call the hardline policymakers in the U.S. system who are against integration or collaboration.

I don’t think Trump is actually in favor of that. Trump wants some kind of a collaborative relationship, but he has to show that collaborative relationship benefits America to make America great again. This is the key trend line. If China responds in a certain way to assure President Trump that there are ways in which both sides can gain, then I think President Trump can actually override what I would call the hardline views of many of his advisors.

That, I think, is the trickiest issue here in the United States. The hardline views of those advisors are, of course, mostly predicated on national security concerns, legitimate or otherwise. So, Yahweh, do you see that as being an insurmountable obstacle to the sorts of collaborations that we see?

Then maybe you could speak to this idea. Do you see the temperaments of these two leaders as being in some way compatible in a way we can, within the duration of this administration, see our way to some kind of a grand bargain? I do. I think there is a special relationship between President Trump and President Xi.

If you go back to April 2017, President Xi flew into Mar-a-Lago and spent two days talking to Trump. Trump’s family, the granddaughter who can sing and speak Chinese, and then Trump went to China in November 2017. President Xi hosted him in the Forbidden City. That’s when the grand bargain was actually talked about, and then there is the first phase trade agreement.

Now, of course, Americans are saying, particularly Trump’s Secretary of Treasury and Commerce, that there is the first phase trade agreement that China failed to deliver because of COVID and all the other things. Of course, China has its own explanations to do. Before the tariff showdown, they were all talking about maybe we’ll renegotiate that. So I think that’s still going to happen.

Trump is different from all the other American presidents. He engages in personal diplomacy. He engages with the leaders of other countries in a special kind of language. I really like his message on his True Social Account saying, “Xi and I, if we work together, we can resolve so many problems and we can make this world peaceful and prosper again.”

So I think there are things there. The thing that is missing is the mechanics. We need to get the two to talk to each other. Now, if you look at China’s three pillars of the U.S.-China relationship, the first one is mutual respect. For now, I think the China side believes Trump is not serious. Trump doesn’t respect the Chinese leader, doesn’t respect the Chinese perspective.

So if that difficulty is overcome, and then the two leaders, starting with the working level, eventually if the two leaders met in person, just think about Trump meeting King III to try to resolve the North Korean nuclear crisis. No American president would have ever done that until Trump came along and almost resolved it. So I’m still hopeful, I’m not totally depressed, that some sort of grand bargain can be negotiated and even realized, hopefully, in the next three years.

Trump’s obsession with trade deficits, with re-industrialization, resonates not just with his populist base. I think there’s some interesting combination of pure xenophobia, racism, and a general kind of moral panic that was, I think, ginned up largely by the Trump administration during the COVID pandemic, but that predates that, I think, arising simply out of the psychological discomfort at seeing this multidimensional peer competitor rising so quickly.

This is deep-rooted. Is this something that one president, whatever his personal relationship is, can reverse? And I want to ask you, do you sense that things are changing in the American view about China? We’ve seen the TikTok refugees at Xiaohongshu. We’ve seen this phenomenon of deep-seek causing both panic and admiration within the American intelligentsia.

But also we’ve seen things like iShowSpeed, a 37 million follower YouTube celebrity traveling around China for three weeks and just finishing his trip there, glowing, saying very, very positive things. My sense is that, and I’m really looking forward to seeing the research that comes out of the Carter Center soon, is that within younger people, Gen Z people especially, attitudes toward China and America are changing. Are you seeing any evidence of that?

I’m hopeful whenever I see young Chinese students at American universities and colleges. I think, you know, the days you’re going to spend here, eventually, either you’re going to work here or you’re going to go back to China. You will be able to present a more objective view of what the U.S. is all about. At the same time, you can convey to the Americans what China is all about.

I don’t think, despite the differences in history, in political system, and in political legitimacy you mentioned, the two countries will be able to find a way to peacefully coexist. That’s the second pillar of the U.S.-China relationship. Because I just cannot imagine that these misperceptions will drive the two countries to conflict, although we see people on both sides pushing toward that direction.

Pegasus traveled to Japan and other parts of Asia, basically saying they would do everything to stop China from taking over Taiwan. What’s so important to the U.S. about Taiwan? Can’t you just say the Taiwan issue is an issue between people on both sides of the street? They can eventually reconcile and come to a decision.

So, no president during his four years or eight years can change the perception. But Trump being president is different because of his unique characteristics, personality, and the position that the Chinese leader is in. Hopefully, something can be worked out. There may be more than three and a half years left if he gets his way.

Of course, all of this is happening, Yukon, in a time when, as I said, Americans are increasingly aware of China’s technological prowess. We’ve had the deep-seek moments and all of that, but also in a time when there is a big rethink happening about the Biden administration’s policy trying to basically starve China of key technology inputs.

We’ve seen some reversals already. H20 chips now, after the meeting with Jensen Huang and Mar-a-Lago, are now going to be exportable to China. These were very key in deep-seek and other things. There’s a lot of debate happening on this. I’ve often said that what these Biden-era policies have done is forced the frog to leap. Now it’s just done another leapfrog.

So let’s talk about this. What is the relationship between innovation in China and the trade war that’s happening right now? You have some really interesting ideas on this. The trade war is largely focused upon the exchange of products. The tension comes from the fact that China is a key manufacturer of a whole range of products and largely exports to Europe and the United States.

There’s a concern in the United States that we are losing our capacity to manufacture and to be self-sufficient, and we worry about security. Technology is becoming a large part of the debate because the issue in the future is who’s going to be the technological leader. I think technology wars are actually going to be more important than trade wars in the future. That’s the key issue here.

Now, as an economist, technology benefits everybody if knowledge can be shared. That’s what the world has been doing for 30, 40, 50 years. Knowledge flows back and forth and everyone grows much faster. Now we’re trying to establish barriers. We’re trying to restrict it. Everyone’s trying to become independent and self-sufficient. That consequence is quite negative.

We all believe that the more innovative we are as a country, the faster we will grow. We believe that in the United States, and they believe that in China and South Korea. Every country believes that the more innovative you are, the faster you will grow. The reality is actually the opposite. The more innovative you are, the more technologically advanced you are, the slower you will grow.

For example, South Korea over the last 20 years became much more innovative and technologically self-sufficient than anyone would have predicted. At a much lower income level, it became as innovative as Japan, but South Korea’s growth rate was cut in half.

How do you establish causality there, though? We have ways of measuring how innovative countries are by indices. Every year, there are institutions, intellectual property rights institutions, which rank all the countries of the world. If you rank all the countries of the world in terms of how innovative they are, you can see that it is mostly determined by how rich they are.

Richer countries are more innovative than poor ones. So, if you plot per capita income along one line and how innovative they are on another line, the correlation is incredible. It’s 90%. Except for three countries. Only three countries are more innovative than you would predict. That is, they’re above the line. They’re more innovative than their income level would predict. And those three countries, only three out of 100, are China, South Korea, and India.

Here’s the irony. They’re trying to leapfrog ahead, and all of them are growing slower because of that. So the interesting question is, why is it when you try to become more innovative than you should be in some ways that you actually grow slower? The answer is, you’re pouring resources into activities which have high risk, and before they are logically competitive, this is costly.

So why is China doing this? Security, tension, competition. It has no choice. The big problem in China is, the big problem or challenge for Xi Jinping, who is trying to promote highly productive sectors and technology, is how to maintain significantly high growth rates when I have no money to support this. The answer to something I’ve been writing about is that China can actually grow much faster without spending any money at all. And that’s the big challenge in China: costless growth.

Most countries don’t have this. Most countries that grow faster have to invest more. China is the only country that could grow faster without investing more. But they haven’t been able to. They haven’t basically recognized this. I know your answer to that, and it has to do with the hukou system.

Here’s the big issue here. What is everyone complaining about? That China doesn’t consume enough. It saves too much. So why is consumption in China so low compared to other countries? You have 400 million households in China who don’t have hukou. They’re migrant workers. They account for 40%, 50% of the labor force in Shanghai, Beijing, Guangzhou, Shenzhen.

Here’s the problem. Those 400 million households save 50% more than normal Chinese. That explains why Chinese savings rates are so high and consumption rates are so low. Suppose you abolish hukou. You say to all these households, you can now live, buy, and spend just like normal Chinese. You can buy houses, you can buy cars, you can do everything else. China’s consumption rate would be the same as other countries. At no cost to the government, China would no longer be a trade surplus country. It would actually become a trade deficit country. Global trade tensions would disappear because China would no longer be generating surpluses.

So the great irony is, if you just abolished hukou, you wouldn’t have trade wars anymore. From your lips to Xi Jinping’s ears, let us hope. China does have a trump card, whether you’re going to deploy it or not, to remove the hukou. I mean, no country in the world, I may be exaggerating, divides its citizens into two categories. So there’s the urban resident and there’s the rural resident. Rural resident, migrant worker, 400 million of them, they’re discriminated against because they don’t have the same rights and privileges.

Unless you remove the hukou and then start consuming, China will be able to be in a much better position to deal with what may come in the wake of this trade war. Because China needs to increase domestic consumption, which is very small in terms of China’s GDP growth. So really, I think, Dr. Huang, you should try to call someone who is close to China and say, you know, play the trump card, and you will be able to resolve a lot of the card difficulties.

But Kaiser, maybe I can interject here because what are we debating globally here in China? We’re all focused on tariffs. We’re all focused on manufacturing. We’re all focused on trade issues here. Think about it. In the United States, only 9% of the population is employed in manufacturing. 91% of Americans are employed in services.

But even in China, the service sector is now much larger than the manufacturing sector and it will become larger. The actual big issue facing both China and the U.S. is how to generate high-paying jobs in services and not low-paying jobs in services. Because as incomes grow, our graduates want jobs that will pay $50,000, $60,000, $70,000, $80,000, $90,000, and these have to be in services.

So rather than worrying about manufacturing, tariffs, and trade, the big issue in China and both the United States is how do I collaborate so that both sides can foster better jobs and services? And actually, there is a path to this because there are restrictive policies in China, and services are retarding China’s growth. America is really strong in services, whether it’s education, health, lawyers, the arts, media, journalism, everywhere.

Education is actually the big issue between China and the United States. How can I draw on America’s strengths so that China can prosper and that America is able to earn more income and export services, and therefore generate more jobs for Americans? And that’s a cooperative path, not a conflicting path. Yet the entire debate is about manufacturing, and it’s hopeless. Even if you double the number of manufacturing jobs in America, it will only increase America’s employment by three percentage points.

Yeah. It doesn’t matter at all. Yet that’s what everyone is debating. Absolutely. And I think in that positive direction, I want to talk to you about something. You have long advocated, Yahweh, waging peace, the proactive effort to build peace constructively. This is something that President Carter dedicated his life to, somebody who you worked with very closely and that your center is dedicated to.

I really worry right now in this present atmosphere that trade and the concern over manufacturing jobs and all this, this is just the tip of an iceberg that is much more profound and much deeper that has really deep, sinophobic roots. It’s especially now that the administration has decided to sort of seek the cover of it was always about China all along in the way that it’s explained its reversal on tariffs, right, with every country except for China.

So I feel like they’re going to lean even more into the side of phobia, and I worry about that. So how does one now in this atmosphere effectively go and wage peace? What can people in this very room do? Well, waging peace, I think, as I said, the three pillars of U.S.-China relations, from the cliché of the Chinese leadership and the elite, the second one is peaceful coexistence.

Peaceful coexistence, indeed, right now, is being threatened. I mean, if you look at the trade war, both sides are going to be confrontational and they’re not going to back down. Imagine China’s economy; 5% growth is going to go down, and there are 12 million graduate students, undergrad students, coming out of school, and there’s no jobs.

So when the economy is so bad, how do you rally the nation? Think about what may happen there. And then you look at the Pentagon, you look at the defense minister, you look at the military. Even though Russia is at war now, they’re all talking about 2027. That’s when the U.S. and China are going to engage each other, because that’s the year when China is going to seize Taiwan. You know, you think about this, this is, you know, everything is connected.

So how do we wage peace? I think a grand bargain, you know, not just sort of reduce the deficit; there’s also the Taiwan issue. You know, when Nixon went to China in 1972, they negotiated the Shanghai community case. So that’s Tom Friedman, saying it’s time to have another Shanghai community case. Americans and the Chinese, Trump and Xi, need to come into one room, they need to sit down, they need to negotiate, they need to say, we need to have each other’s red lines clearly defined.

And we need to make sure that AI does not control all the weapons. We need to make sure that we have a clear understanding of how to prevent this kind of conflict between the two countries. Of course, we can tell all the Chinese, we can tell all the Americans, you know, we can work together. Basically, we can all say to each other what President Trump said in his messages: U.S., China, we work together.

We can do so much to resolve the problem. I mean, look at Ebola in West Africa. Right now, there’s another outbreak. Actually, in 2014, the U.S. and China worked together to contain Ebola. You know, look at Sudan, civil war. U.S. and China can work together to stop that. I mean, U.S.-China can work together to ensure a peaceful transition for Syria. U.S.-China can work together to get the Russians to end the war.

But right now, of course, China has no inclination to negotiate because China doesn’t want to drive Russia away to the other side when the U.S. is putting so much pressure on China. So all these things, I think, are negotiable, and they’re all interconnected. They’re all connected.

And so President Carter, of course, you know, he’d probably say, during my four years in the White House, the American military never fired shots. Many Americans think President Carter was weak; he’s one of the worst presidents. I think as time goes by, you’ll see. I mean, including the Panama Canal, Trump said it’s a very bad deal because President Carter negotiated the Panama Canal. But overnight, American standing in the entire Americas changed because of it. Because at that time, it was unsustainable for Americans to keep the canal.

Yeah. Well, I certainly appreciate everything that President Carter did. May he rest in power, and you’ll be sorely missed, as I’m sure you must miss him deeply at a personal level. I want to thank you both. It’s just been so wonderful to talk to you, both Yukon and Yahweh.

Before we wrap up, we do have this recommendation section. I also have a section that I added to the show at the beginning of the year called Paying It Forward, where I ask if either of you have somebody top of mind to name-check, a young colleague or someone who’s in our field doing really admirable work that you think should deserve more attention. So, maybe you can start with that and then go straight to your recommendation.

So, Yahweh, why don’t you begin? I’m going to recommend, of course, my colleague, his name is Nicholas Zeller, the editor of the rising, more popular Substack called the U.S.-China Perception Monitor. It’s just called the Monitor. If you go to uscmpm.substack.com, you’ll see why I say he’s a rising star, because he updates that newsletter on a daily basis. It’s fantastic. I really love it.

So go there, or you can Google Nicholas Zeller. I’ll be talking to Nicholas about this new research that you have. Exactly. He’s also leading a public opinion survey that we conducted in China. So, we’re now the only American-based organization that got the approval from Emory University to conduct a public opinion survey in China about how Chinese people see the U.S. We’re talking about perception.

So, we’re going to launch the results of that on April 30th in Washington, D.C., at the Princeton School of International and Public Affairs. Nick is leading that. In terms of recommendations, I have two fantastic books to recommend for you to read. The first one is edited by Harry Lautz, who was the former vice president of the Luce Foundation, which has done so much in bringing the two countries together.

This first book, which came out last year by Oxford University Press, is called Americans in China: Encounters with the People’s Republic. So these are the Americans who saw the emergence of China, who studied America, who were diplomats. For example, Jerry Cohen is one of the 10 Americans that appeared in the book. Jerry Cohen. Legendary legal scholar.

Exactly. He will be 95 years old on July the 1st this year, so he’s still around. Even a year ago, he was still engaging in China. So, right. So, he’s one. Elizabeth Perry is also one of those Americans. These are the Americans that shape the American view of China. These are the Americans that give advice. Stapleton Roy is also one of them.

The second book will come out. It’s actually going to be published by this university press. It’s called Chinese in America: Journeys that Shape the Future of China. So, you see, the angles are different. Who’s the author of this? This is, again, Terry Lotz and Deborah Davis. They edited it, and there are many authors.

So, who are the Chinese that shape the view of America? So, we’re talking about Zizong Yun. We’re talking about Wang Jisi. We’re talking about Xie Xie De, the Fudan University president who actually got the Center for American Studies. It’s actually that Center was paid for by American taxpayers, over $10 million. And we’re talking about Lang Ping, the volleyball coach.

So, this book will come out in June of this year. Fantastic recommendations. That’s great. Thank you so much. What about you, Yukon? What do you have to recommend? I’m going to divert from your question a bit because last time I was rereading an article that David Brooks wrote in the New York Times exactly three years ago, in 1922. The title was The End of Globalization: The Dominance of Global Cultural Wars.

We think of globalization as basically trade and goods and services. But he was writing about globalization as basically the main force for convergence: cultural convergence. Cultural convergence. Ideology becoming closer. People of discussion. Interaction among people. And that’s been essentially weakened over the last three or four years.

In the United States, we talk a lot about cultural wars. We think about it in terms of Democrats versus Republicans, rural and urban, progressive conservatives, elites, the working class. How much stress that puts on the system. But in terms of the world, it’s a global cultural war. The North and the South, autocracies versus democracies, regions against each other, individualism against collective action.

These are all becoming stronger forces that pull us apart, create differences. The question I read, I read it because I was thinking to myself, we have a really hard time in the United States dealing with cultural wars. Now we have it on a global basis, and it seems to be getting worse. How do you actually try to turn the tide in a world in which globalization is no longer favored, and it’s going to actually become less and less important?

And the answer, of course, is you’re out there. You’re going to have to be the bridge of actually trying to link the world together in a different way. Because by itself, we’re actually pulling ourselves apart. Let me just stop there. Fantastic. Great recommendation. David Brooks is often a very, very thoughtful writer, and he’s worth reading even if you disagree with him, as I so often do.

I am going to recommend a book called The Weimar Years: Rise and Fall, 1918 and 1933 by Frank McDonough. Frank McDonough has written this; this is the third book in his trilogy about the war years in Germany. This is actually going back in time because the other two were really about Hitler’s rise and about the war itself.

So this one on Weimar, the Weimar Republic, is really a warning in America, and I highly recommend that we all read it to understand and draw lessons from how things went so terribly, terribly wrong in those years. They had such bright promise. It’s really quite an incredible book, and it just came out in September of last year, so it’s still quite new and current.

All right. I want to thank everybody for coming tonight. Thank you to my two fantastic guests. Thank you to the University of Wisconsin, Center for East Asian Studies for their tireless support of my program. And thank you so much to Columbia University, to the Columbia China Summit.

And best of luck to you for the rest of the event. Thank you. Take care. Take care.

This is an experimental rewrite

Welcome to this special live edition of the Cynical Podcast, coming to you from the 2025 Columbia China Summit at Columbia University. Hello, New York! Let’s hear a little noise from the crowd. Thank you to Aidan, Zhang Yu, and Annie Cui for taking such good care of us, as well as to everyone here at the Columbia Greater China Society for hosting me and my two esteemed guests.

This summit takes place at a pivotal moment, not only for Columbia University but also for U.S.-China relations. Both the university and the bilateral relationship, the most significant in the world, are influenced by the unpredictable actions of a particular U.S. president. I have strong concerns about the future of both the university and students like Mahmoud Khalil. When Annie Cui and her colleagues at the business school approached me to moderate this conversation, we were all living in what felt like a very different world not so long ago.

Back then, it was possible to label a session like this, focusing on “Bridging the Divide: How the U.S. and China Can Find Common Ground.” In the midst of this troubling trade war, that idea now seems almost quaint. Fortunately, we have two guests today who bring not only decades of experience and deep insight but also original thinking about China’s evolving role in the world, and how we in the United States might better understand it.

First up is Dr. Yukon Huang, a senior fellow at the Carnegie Endowment for International Peace. I’ve long admired his ability to clear the fog surrounding popular narratives about China’s economy with rigor and reason. We met about 15 or 16 years ago in São Paulo, Brazil. He was the World Bank’s former country director for China and witnessed the most significant period of economic reform in the country. His book, Cracking the China Conundrum: Why Conventional Economic Wisdom is Wrong, remains one of the most clear and empirically grounded discussions of China’s development that I’ve encountered. I had the pleasure of interviewing him on stage about it several years ago. His voice is one of the rare, deeply informed, and refreshingly independent ones in the field; he often takes contrarian views.

I’m also delighted to be joined by my very good friend, Dr. Yawe Liu, senior advisor for China at the Karcher Center in Atlanta, and the founding editor of the U.S.-China Perception Monitor. Few people have done more to foster constructive dialogue and mutual understanding between the U.S. and China than Yawe. Through his work at the Karcher Center, an institution with a unique legacy in U.S.-China engagement, he has created platforms for civil society exchange, tracked perceptions in both countries, and championed empathetic approaches to bilateral relations.

Yawe brings not only academic expertise but also a lifelong commitment to dialogue, public diplomacy, and the complex work of bridge building. Together, Yukon and Yawe provide us with invaluable perspectives—one rooted in economics and policy, the other in civil society and political culture. I couldn’t think of two better people to help guide us through this complicated moment in U.S.-China relations. So please join me in welcoming Dr. Yukon Huang and Dr. Yawe Liu!

On this topic, it’s hard to know where to begin, given the state of affairs. Earlier, Yawe, you mentioned a bit about Columbia University’s history with China and its deep connection. Maybe we could start the discussion by diving into this institution and its relationship with China, hoping, of course, that this discussion doesn’t put the university in hotter water with the White House than it already is.

Yawe Liu: I’ll start with some recent history. In 1989, I was admitted to the political science department as a Ph.D. student. Unfortunately, unlike Nipi, who secured a scholarship, I was told I could come to study but had to pay my own tuition. Naturally, I couldn’t afford to attend Columbia. Instead, I went to Emory, thanks to a Wuzhou fellowship from Coca-Cola. Since then, I jokingly tell my students not to show up to my class with a Pepsi, considering my entire doctoral study was sponsored by Coca-Cola, which is, of course, one of the most popular drinks in China.

This podcast is not sponsored by the Coca-Cola Company. Of course, President Carter has always claimed he helped introduce Coca-Cola to China, even though the drink was there as early as the 1930s. But we know the circumstances from the 30s to the 70s. Now, I’m so happy to be here at Columbia, which is pivotal to the divide mentioned by Kaiser—you’re all here! The China studies program at Columbia began in the early 20th century, and one of the program’s initial donations came from a Chinese-American who served a general named Ding Long. There’s even a play titled Ding Long de Yi Zi, which you might find on YouTube, featuring his story. He was a servant who donated $50,000, and his employer matched that donation. That’s how it all began.

Today, we also honor figures like Professor Duwe, who influenced Hu Shi, later China’s ambassador. There are others, like Wellington Koo, who studied here, and more recently, Yang Lan, along with Fangang, a prominent economist that Yukon probably knows. Today, we’re fortunate to have Thomas Christensen and Andy Nathan, two of the most eminent China scholars. This is indeed the right place, and I’m thrilled to see so many Chinese students here. According to DeepSeq, there are 3,000 Chinese students currently enrolled at Columbia. You all are the future!

I recently saw a tweet from Laura Loomer urging President Trump to cancel all visas for this debate. I sincerely hope he doesn’t heed her call, but we know he tends to listen to her often. Let’s give a good round of applause—and maybe a boo—for Laura Loomer. Boo!

My grandfather, Guo Tingyi, was an eminent historian and the founding director of the Modern History Bureau of the Academia Seneca. He spent his final years here, completing his magnum opus, an outline of modern China’s history. I have ties to Columbia, having visited him here on Riverside Drive and walked with him in Grant’s Tomb and the park nearby.

But let’s transition back to the topic at hand. In this age of difficulty, I want to turn to you first, Yukon, for perspective from Beijing. What does the current situation look like from that vantage point? When considering the broader context of the bilateral relationship, what influences Beijing’s views on the trade war? Then, I’d love for you, Yawe, to chime in right after Yukon, as I know you have fascinating insights into the fundamental aspects driving the issues leading to this trade war.

Yukon Huang: Let’s start by addressing what Beijing sees currently. I believe there are two key questions on the minds of Chinese leaders. First, they ponder whether the United States is genuinely serious. What does the Trump administration actually want? Is it about the rivalry between two rising powers, technology wars, security, influence, or power? If Trump seeks to negotiate, what exactly is he looking to negotiate?

Does he want China to invest more in the United States to bolster manufacturing? But if that’s the case, why do U.S. policies discourage Chinese companies from investing here? If the U.S. desires China to buy more American goods, why does it consistently restrict high-tech exports to China? There’s only so much soybeans or Boeing aircraft China can purchase. They won’t invest in high-tech or defense. It becomes nearly impossible for China to buy enough to close the trade deficit. So, is the deficit really the concern, or is it about technological superiority?

That’s a significant question for China: What does America genuinely desire? If Beijing could understand that, what could it offer in return? Is it about 5G security, TikTok, or green technologies? We’ll undoubtedly explore those issues later, as there could be unexplored opportunities for collaboration in those areas. However, the core problem from Beijing’s perspective is, as they say, “we will respond.” But you need to know what you genuinely want.

Is there an overarching thesis here? The problem, from Beijing’s standpoint, is that they perceive two broad forces at play in the U.S. There are hardline U.S. policymakers advocating for disengagement with China, breaking down trade relations altogether. I don’t think this aligns with President Trump’s views. Trump appears to still seek negotiation; he seems to have a certain admiration for the Chinese system. But who truly makes the decisions? The hardline advisors favoring a trade war with high tariffs, or is it Trump who wants to find common ground?

That’s the dilemma we face today.

Yawe Liu: Let’s broaden our perspective and frame this current dispute in terms of perception, which is a topic we often discuss. I’m a big believer in cognitive or strategic empathy. We both respect the late Robert Jervis from UC Berkeley, who emphasized the importance of understanding the opposing side’s perceptions. You’ve talked extensively about foundational perceptions from each side and the problems that arise from them. Could you succinctly unpack these misperceptions?

Yukon Huang: Absolutely! I wrote my dissertation focused on Mao Zedong’s views on the United States. His shifting perceptions of the U.S. significantly influenced his policies towards the country. Nowadays, it seems we are encountering a clash of perceptions, which I prefer to call a clash of misperceptions. Americans view China in one way, while China perceives the U.S. from another angle, and these perceptions are at odds.

The noise we’ve been hearing since Trump took office in 2017 has continued under Biden. I often refer to it as “Biden four years equals Trump two,” and now it feels like we’re entering “Trump three.” From the Chinese perspective, there’s a belief that the U.S. represents a pacing threat—it’s perceived as attempting regime change. For Trump, this is achieved through economic leverage; for Biden, it’s through ideological and high-tech restrictions.

There’s also a prevalent notion among the Chinese elite that the U.S. is in decline. They believe that the U.S. is leveraging all means to contain China, leading them to intensify their efforts to counteract it. Recent actions by Trump have fueled Chinese perceptions that America is trying to undermine the system.

The American viewpoint of China is equally fascinating. Many view China as a direct threat, believing it undermines the American way of life. This perspective often drifts into territory that borders on racism. For instance, when Vance said, “We Americans are purchasing products from Chinese peasants manufacturing our goods,” it highlights underlying biases.

Early in Trump’s first term, a State Department policy planner remarked that this new Cold War with China differs from the former one; it’s not just a battle of white Russians and Americans, but rather a struggle featuring white people against yellow people. This perception endures among some policymakers, contributing to the idea of China as a larger threat, especially as Russia engages in warfare.

Such misperceptions can lead to significant miscalculations, which concern me greatly. We risk stumbling into conflict driven by these misunderstandings, either by design or by accident.

Yukon Huang: While I understand the misperceptions you’ve identified and accept your characterization of them, I must point out that China’s belief the U.S. seeks regime change appears pejorative. When they label it a “regime,” it underscores a negative viewpoint on political legitimacy.

Regarding China’s belief that America is in decline, contemporary evidence indeed seems to support that assertion. So, I wonder: Are these misperceptions? Is it possible that China misjudges American intentions as well as its trajectory?

To some extent, these misperceptions do morph into genuine perceptions due to the evidence available. But I still posit that the Chinese may be correct in inferring that the U.S. desires a change in their political system.

When Americans acknowledge that the market economy will not impact China’s political structure, it leads to disillusionment and the notion that the U.S. aims to contain China. The U.S. provided access to shared markets that have bolstered China’s economy, but now that China poses a threat, it fosters apprehension.

In that light, I believe it’s partially valid. However, I don’t think organizations like the Carter Center are seen by China as the forefront of American efforts to undermine their political system or promote a color revolution.

As for the second misperception, regarding the belief that the U.S. is in decline, shaped by events like racial tensions, the 2008 financial crisis, and many other factors, I think it’s erroneous for China to assume that America is on the verge of collapse and that they will inevitably supersede it. It’s clear to me that China has no real desire to replace the U.S. as the global superpower or to present an alternative system as superior to ours.

Yukon Huang: Shifting back to assumptions and perceptions, it’s widely held—perhaps even foundationally assumed within the Trump administration and prevalent among Americans—that China’s trade surpluses have been the driving force behind American trade deficits. Most people seem to accept this notion as fact. However, you’ve pointed out that America’s trade deficits began to spike well before China became a major player. What catalyzed those deficits?

Yukon Huang: That’s an interesting inquiry! Trump initiated the trade war under the belief that America’s trade deficits stemmed largely from China’s surplus. The logic was that imposing tariffs could reverse this trend. However, if we look back at when U.S. trade deficits truly began to escalate, it was in the late 90s and early 2000s—during a time when China had no significant trade surplus.

So, how could China’s trade surpluses be the cause when they weren’t even present at that stage? Conversely, when China’s trade surpluses became substantial between 2005 and 2007, surprisingly, America’s trade deficits diminished. They moved in completely opposite directions.

Ironically, China’s trade surpluses surged and America’s deficits inflated precisely as the trade war commenced. Before that point, there was no correlation.

In 2018, China had no trade surpluses before the trade war kicked off, yet those surpluses skyrocketed afterward, while American deficits expanded significantly. Ironically, the trade war itself created the tension we currently see.

How do I explain this? Trade balances aren’t the product of tariff rates or currency values; they fundamentally arise from fiscal imbalances on both sides.

So, what sparked America’s trade deficits in the late 90s? Major budget deficits, the war on terrorism, and tax cuts in the U.S. generated considerable trade deficits independent of China’s influence.

Conversely, China’s significant surpluses coincided with urbanization that unfolded over the past 20 years. This urban migration, involving 500 million Chinese moving from rural areas to cities for economic opportunities, didn’t hinge on currency values, interest rates, or tariffs.

These migrant workers tend to save significantly more because they have limited opportunities to spend their earnings. Yukon Huang: So, China’s savings rate surges, leading to a trade surplus for China. Here’s what’s happening: China’s trade balances are largely driven, ironically, by the extraordinary savings of Chinese migrants. In contrast, America’s deficits are fundamentally tied to America’s budgetary deficits. These two are completely unrelated; yet, we generally think that China’s trade balances and America’s trade balances are somehow linked. They are not.

Now, we have a tariff war. As an economist, my primary concern is that we’re launching a trade war based on the assumption that China’s and America’s trade imbalances are caused by the same factors. We believe that tariffs will resolve the problem, but they are fundamentally driven by two incredibly different issues: urbanization in China and budget deficits in the United States. That’s why tariffs won’t truly resolve this issue.

It’s amusing that just a couple of weeks ago, many were still discussing the possibility of a grand bargain between Washington and Beijing, between Donald Trump and Xi Jinping. Do either of you still hold out hope for something like that?

Given the current climate—like the recent announcement from Customs and Border Patrol that many categories of trade goods would be exempt from tariffs, except for items like toys and textiles—it represents a significant percentage of total trade goods. But given these developments, I understand we all feel a bit whiplashed from the rapid changes in trade policy. At some point, Trump and Xi Jinping will talk; that seems inevitable.

Considering the temperaments and agendas of these two leaders, what do you think is the likely outcome of such an encounter? Yukon, I remember you mentioning that there are certain similarities between Donald Trump and Xi Jinping that could either exacerbate or potentially mitigate bilateral tensions.

Yukon Huang: You raise an interesting point. The U.S.-China tensions are often viewed as a result of fundamental differences between the two nations, but they might actually be stemming from growing similarities. Trump promotes “Make America Great Again,” while Xi pursues the “China Dream.” Each has cultural wars on their respective grounds; the U.S. faces internal debates on diversity and inequality, whereas China sees a resurgence of Communist Party ideology.

Both countries emphasize manufacturing and technological development, with Trump looking to enhance North American influence through connections with Canada, Greenland, and Mexico. Simultaneously, China is advancing its Belt and Road Initiative.

Both nations are trending toward more authoritarian governance. These tendencies can either exacerbate or moderate tensions between the two. To find a solution, we must identify these similarities between leadership and transform them into opportunities for collaboration—such as industrialization and technology.

Take TikTok as an example. We could address the American security concerns through Oracle, while allowing China to keep its technological innovations. Eventually, I believe we’d see that arrangement. Moreover, consider 5G communications; America and Europe need a security solution combined with China’s manufacturing capabilities. This could extend to green technologies and batteries as well.

In ten years, we might see Chinese-made automobiles in the U.S. This indicates a willingness on both sides to leverage manufacturing technology to create mutual benefits. The grand bargain could revolve around joint ventures that incorporate technology transfer to American manufacturers, beginning in Europe to navigate around the hardline perspectives prevalent in the U.S. administration. Trump, however, needs to present collaborative relationships that align with his “Make America Great Again” narrative.

Ultimately, if China can assure Trump that collaborative avenues exist that benefit both sides, he may overlook the hardline advice of his advisors.

Yawe Liu: Do you see this as an insurmountable obstacle to collaboration? Furthermore, can you comment on how the temperaments of these two leaders might play into potential agreements during this administration?

Yawe Liu: I think there exists a special rapport between Presidents Trump and Xi. In April 2017, President Xi visited Mar-a-Lago, spending two days engaged in discussions with Trump. They shared a personal connection, with Trump’s granddaughter even singing in Chinese. When Trump visited China in November 2017, he was welcomed by Xi in the Forbidden City. During these interactions, the idea of a grand bargain was explicitly discussed, which eventually led to the first phase of the trade agreement.

However, Americans are now saying that China failed to deliver on its commitments due to COVID-19 and other factors. Of course, China has its own perspective to share. Before the tariff disputes escalated, both parties were exploring renegotiation options.

Trump is a unique figure in diplomacy; he engages in personal diplomacy with other nations, conveying a special language and tone. Recently, he hinted via social media that, if he and Xi could collaborate, they could resolve numerous global issues and foster prosperity.

Yawe Liu: Nonetheless, for meaningful progress, we need to facilitate direct communication between the two leaders. In terms of China’s three pillars for the U.S.-China relationship, the first pillar is mutual respect. Presently, the Chinese side perceives Trump as lacking respect for both the Chinese leader and their perspective.

If they can overcome this perception, and if meetings begin at lower levels leading up to a potential direct encounter, we might see positive changes. Just consider how Trump approached dialogue with Kim Jong-un regarding the North Korean nuclear crisis—a move unprecedented for any American president until then.

I remain hopeful that we might see a grand bargain negotiated and possibly realized in the next few years.

Trump’s fixation on trade deficits and re-industrialization resonates with his populist supporters. It’s intertwined with complex feelings of xenophobia and racism, which have intensified during the COVID-19 pandemic chronicling the rapid ascent of a formidable peer competitor.

Yawe Liu: Changing public perceptions is a substantial challenge. Reflecting on the attitudes of Gen Z, indications suggest their views on China and America may be shifting. We’ve seen both negative and positive representations of China in popular culture, as well as the seemingly positive experiences of figures like iShowSpeed, a YouTube celebrity who traveled across China.

Moreover, I feel optimistic when I see young Chinese students at American universities; in their time here, they can either work in the U.S. or return to China, presenting a more informed perspective about both countries.

Yet, despite historical, political, and systemic differences, I believe peaceful coexistence between the two nations is possible. While some individuals are indeed pushing for conflict, I think a balanced perception can help mitigate that.

Yawe Liu: For instance, what’s the significance of Taiwan to the U.S.? Why can’t it simply be viewed as an issue to be resolved between both sides of the strait?

However, I think no president can single-handedly change deeply rooted perceptions about China. Trump’s unique personality and the situational constraints that Xi Jinping faces might allow for some room for negotiation and resolution, even with a bit more than three years left in office.

Amid this all, Americans are becoming increasingly cognizant of China’s technological capabilities. There’s ongoing debate regarding the Biden administration’s policy about limiting key technological inputs to China, which has already seen some reversals.

Considering this context, what do you think is the relationship between innovation in China and the current trade war? The trade war primarily revolves around the exchange of products, and tensions arise from China’s significant manufacturing capacity, largely directed at exports to the U.S. and Europe.

Yukon Huang: This concern stems from a desire to regain self-sufficiency in manufacturing. The emerging debate centers around which country will lead technologically in the future. I believe that technology wars will outweigh trade wars going forward.

As an economist, I see value in knowledge sharing—it’s been the norm for decades. However, the current direction of establishing barriers threatens this, leading to a negative outcome.

Every country, including the U.S. and China, holds the belief that greater innovation translates to accelerated growth. Yet, interestingly, the most innovative nations often experience slower growth rates.

For example, South Korea has become more innovative than anticipated over the past two decades but has seen its growth rate halved.

Causation here can be complex, as indices measure innovation levels globally. Most rankings demonstrate a 90% correlation between a country’s affluence and its innovation levels. However, three exceptions exist: China, South Korea, and India, all of which outperform predictive measures of innovation relative to their income levels. Yukon Huang: Here’s the irony: countries are trying to leapfrog their way to progress, yet they are growing slower because of that pursuit. So the interesting question is, why does attempting to innovate exceed what one is capable of actually result in slower growth? The answer lies in pouring resources into high-risk activities that are not logically competitive yet, which can be costly.

So why is China pursuing this path? The answer is security, tension, and competition. Xi Jinping faces a significant challenge: promoting highly productive sectors and advanced technology while maintaining high growth rates without the financial resources to support it. My thesis is that China could actually achieve much faster growth without additional spending. This is the substantial challenge of achieving costless growth in China.

Most countries do not enjoy this advantage; those that grow faster typically need to invest more. China, however, is uniquely positioned to grow without increasing its investments, although it has yet to fully recognize this potential. And I know your thoughts on that relate to the hukou system.

Yukon Huang: Here’s the key issue everyone is concerned about: China does not consume enough and saves too much. Why is China’s consumption so low compared to other countries? Around 400 million households in China lack hukou, which makes them migrant workers. They represent approximately 40% to 50% of the labor force in major cities like Shanghai, Beijing, Guangzhou, and Shenzhen.

These 400 million households save 50% more than typical Chinese citizens, which explains the high savings and correspondingly low consumption rates in China. If hukou were abolished and these households gained the same rights as regular citizens, enabling them to purchase homes and cars, China’s consumption rate would align more closely with other countries. At no cost to the government, China would likely transition from being a trade surplus nation to a trade deficit one, effectively eliminating global trade tensions.

The great irony is that simply removing the hukou system could resolve trade wars. If only those in power, like Xi Jinping, would recognize this. No other country in the world, and I might be exaggerating slightly, divides its citizens into two categories—urban residents and rural residents, with rural migrant workers facing discrimination due to the lack of rights and privileges.

Unless the hukou system is abolished, China will struggle in its position amid the trade war, as domestic consumption plays a small role in GDP growth. Therefore, Dr. Huang, it would be beneficial for you to reach out to someone close to the Chinese leadership and advocate for using this “trump card” to mitigate some of the challenges they face.

Kaiser Kuo: But let me interject here, as we’re currently debating global trade issues, especially tariffs and manufacturing. Think about it: only 9% of the U.S. population is employed in manufacturing, leaving 91% in services. Even in China, the service sector has expanded beyond manufacturing and will continue to grow.

The critical issue for both nations is creating high-paying jobs in the service sector, rather than low-wage roles. As income levels rise, graduates desire better-paying jobs, typically in services that offer salaries ranging from $50,000 to $90,000.

Thus, the focus should shift from manufacturing and tariffs to collaborating in ways that generate quality jobs in services for both countries. There is indeed a pathway to this collaboration, as restrictive policies in China hinder service sector growth. The U.S. excels in areas such as education, healthcare, law, the arts, media, and journalism.

Education stands out as a major factor influencing U.S.-China relations. Finding ways to leverage America’s strengths could lead to mutual benefits, allowing China to thrive while providing Americans with more income through service exports and creating jobs. This is a cooperative route rather than a confrontational one. Yet the ongoing discourse fixates on manufacturing, which is fundamentally flawed. Even doubling U.S. manufacturing jobs would only improve employment by three percentage points—laughable in the grand scheme of things.

Kaiser Kuo: In light of that, I want to discuss a concept you’ve championed, Yahweh. You’ve advocated for “waging peace,” a proactive effort to constructively build peace. This is a principle that President Carter dedicated his life to—a person you worked with closely.

Given the current atmosphere, with trade, manufacturing jobs, and related concerns serving as mere surface-level issues, there’s a far deeper agenda rooted in sinophobia. The administration has found cover in blaming China as the primary issue, as evidenced by its policy reversals on tariffs, excluding everyone else.

In this context, how can individuals here effectively wage peace?

Yawe Liu: Waging peace, I believe, hinges on three key pillars in U.S.-China relations, one of which is mutual respect. Currently, the Chinese perspective is that Trump lacks respect for both their leadership and viewpoint.

If they can transcend this perception, and if meetings initiate at lower levels leading to a potential direct encounter, there may be positive developments. Just look at how Trump engaged in dialogue with Kim Jong-un regarding North Korea—a groundbreaking move for any U.S. president until then.

I still hope we might witness a grand bargain between the two nations in the coming years.

Trump’s focus on trade deficits and reindustrialization resonates deeply with his populist base. It taps into complicated emotions involving xenophobia and racism, particularly as the COVID-19 pandemic has accelerated the narrative of a rising peer competitor.

Yawe Liu: Altering public perceptions presents a formidable challenge. However, considering Gen Z’s attitudes, it seems their perspectives on China and America are evolving. We’ve seen both negative and positive portrayals of China in popular culture, alongside the favorable experiences of figures like iShowSpeed, a YouTube influencer who traveled across China.

Moreover, I feel optimistic when I observe young Chinese students studying at American universities; they can utilize their experiences to contribute informed insights about both nations.

Despite historical, political, and systemic differences, I maintain that peaceful coexistence between China and the U.S. is achievable. While some forces advocate for conflict, I believe that a balanced understanding can help mitigate those tensions.

Yawe Liu: For instance, what significance does Taiwan hold for the U.S.? Why can’t this issue simply be seen as one to be resolved between both sides of the strait?

However, it remains clear that no president can single-handedly shift deeply entrenched American perceptions of China. The unique personalities and situational challenges faced by both Trump and Xi Jinping may allow for some negotiation and resolution opportunities, even with three years left in office.

Amidst all this, Americans are becoming increasingly aware of China’s technological capabilities. Ongoing debates regarding the Biden administration’s policies limiting key technology inputs to China have already seen some reversals.

Considering this landscape, what do you think the relationship between innovation in China and the current trade war is? The trade war is primarily focused on product exchanges, with tensions arising from China’s significant manufacturing capacity aimed at exports to the U.S. and Europe.

Yukon Huang: This concern arises from a desire to regain self-sufficiency in manufacturing. The emerging debate centers around which country will lead technologically moving forward. I believe that technology wars will overshadow trade wars in the future.

As an economist, I value knowledge sharing—it’s been a norm for decades. However, the current trend of creating barriers presents a risk to this practice, leading to adverse outcomes.

Every nation, including the U.S. and China, harbors the belief that increased innovation results in accelerated growth. Yet, intriguingly, the most innovative countries often experience slower growth rates.

For example, South Korea has recently become more innovative than anticipated over the past two decades but has simultaneously seen its growth rate halved.

Causation in this scenario can be complicated, as indices evaluate global innovation levels. Most rankings show a 90% correlation between a country’s affluence and its innovation levels. Nevertheless, there are three notable exceptions: China, South Korea, and India, which outperform predictive measures of innovation relative to their income levels. Yukon Huang: And best of luck to you for the rest of the event. Thank you. Take care.

Kaiser Kuo: Take care!