Support CleanTechnica’s work through a Substack subscription or on Stripe.
In an article earlier today about Waymo announcing that it’s going to enter four more cities — Philadelphia, Pittsburgh, Baltimore, and St. Louis — one of our readers provided a couple of brilliant comments. Now, let me say that I didn’t mention Tesla in the article, but there’s been a long, long debate between the two companies’ approaches and which is better. The reader, “Matthew2312,” did a great job, though, of explaining some core differences between the two approaches — far beyond “cameras vs. lidar + cameras.” Below are his two comments. —Zach Shahan
People misunderstand the scaling challenge. They think it is about vehicles, but that is the easier part. Google’s proprietary TPU training infrastructure is the primary technological engine behind Waymo’s sudden ability to scale.
Waymo has adopted the Gemini training playbook for autonomous driving. Using Google TPU pods used to train Gemini, Waymo has shifted to a single large “Waymo Foundation Model.” This model is “generalizable.” Instead of training a model specifically for Phoenix and then re-training it for months to understand Miami, they train one massive model on all their data. Waymo rebuilt its entire training stack on JAX, the exact same machine learning framework Google DeepMind built to create Gemini. Because they share the JAX/TPU ecosystem, optimizations made for Gemini can be quickly ported to the Waymo Driver.
The Simulation advantage.
The scaling to new cities relies heavily on simulation. An internal and vastly more complex version of the publicly available Waymax simulation engine runs on the Gemini TPU infrastructure. This allows them to “validate” a city like Houston for safety before they deploy a large fleet there, cutting the physical testing timeline by months. There are other things at play here are well like the optical switch “Jupiter” network that allows thousands of chips to talk to each other incredibly fast.
Then there is the completely different approach that Tesla takes vs Waymo. Google is now using Motion LM as a central feature of its Foundation Model. Instead of predicting physics (x, y, z coordinates), this model treats driving like a language. [As an aside, Google’s next step is a model design called EMMA (End-to-End Multimodal Model for Autonomous Driving) which is still experimental.]
What anyone can see from the above very basic commentary is that Tesla and Google’s approaches have completely diverged. This is not about “LiDAR vs Vision” but is instead a deep and profoundly different approach to how the AI treats the self driving problem. Motion LM vs the NN Reflex system is not just apples vs pears but apples vs baseballs. The dojo failure significantly slowed Tesla’s simulation capability (now backfilling with NVIDIA) while the move to JAX-based foundation model and Waymax served as a huge accelerant for Waymo. Two years ago, the rapid scaling we see today was not possible.
Just for interest I asked Gemini to explain the difference between Motion LM and the NN approach. Here is what it said:
You hit the nail on the head regarding the philosophical divergence. Here is why your “Language vs. Reflex” point is the critical differentiator:
A. The “Reflex” Approach (Tesla)
* The Philosophy: “Driving is a biological process of vision and reaction.”
* The Mechanism: A Neural Net (End-to-End) takes video frames and outputs a continuous signal (Steering Angle: 14.2°).
* The Weakness: It is probabilistic and hard to debug. If the car drifts into a barrier, you can’t ask “Why?” The answer is just “The neurons fired that way.” It requires massive data to smooth out the errors, like training a dog by repetition.
B. The “Language” Approach (Waymo / MotionLM)
* The Philosophy: “Driving is a social negotiation.”
* The Mechanism: The system discretizes the world into “Tokens” (Words).
* Input: “Pedestrian at curb.”
* MotionLM Prediction: “Yield.” (Not ‘slow down by 12%’, but the distinct concept of yielding).
* The Strength: It allows for reasoning. The car can effectively “talk to itself” about the scene. “If I <inch forward>, the other car will <stop>.” This makes the system far more robust in complex, multi-agent scenarios (like a busy 4-way stop) where “reflexes” aren’t enough—you need “intent.”
Interesting stuff. Great points. What does all this mean as far as these two companies’ self-driving expansion? We’ll see….
Sign up for CleanTechnica’s Weekly Substack for Zach and Scott’s in-depth analyses and high level summaries, sign up for our daily newsletter, and follow us on Google News!
Have a tip for CleanTechnica? Want to advertise? Want to suggest a guest for our CleanTech Talk podcast? Contact us here.
Sign up for our daily newsletter for 15 new cleantech stories a day. Or sign up for our weekly one on top stories of the week if daily is too frequent.
CleanTechnica uses affiliate links. See our policy here.
CleanTechnica’s Comment Policy