Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught

Physical Intelligence, the two-year-old, San Francisco-based robotics startup that has quietly become one of the most closely watched AI companies in the Bay Area, published new research Thursday showing that its latest model can direct robots to perform tasks they were never explicitly trained on — a capability the company’s own researchers say caught them off guard.

The new model, called π0.7, represents what the company describes as an early but meaningful step toward the long-sought goal of a general-purpose robot brain: One that can be pointed at an unfamiliar task, coached through it in plain language, and actually pull it off. If the findings hold up to scrutiny, they suggest that robotic AI may be approaching an inflection point similar to what the field saw with large language models — where capabilities begin compounding in ways that outpace what the underlying data would seem to predict.

But first: The core claim in the paper is compositional generalization — the ability to combine skills learned in different contexts to solve problems the model has never encountered. Until now, the standard approach to robot training has been essentially rote memorization — collect data on a specific task, train a specialist model on that data, then repeat for every new task. π0.7, Physical Intelligence says, breaks that pattern.

“Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways,” says Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor focused on AI for robotics, “the capabilities are going up more than linearly with the amount of data. That much more favorable scaling property is something we’ve seen in other domains, like language and vision.”

The paper’s most striking demonstration involves an air fryer the model had essentially never seen in training. When the research team investigated, they found only two relevant episodes in the entire training dataset: One where a different robot merely pushed the air fryer closed, and one from an open-source dataset where yet another robot placed a plastic bottle inside one on someone’s instructions. The model had somehow synthesized those fragments, plus broader web-based pretraining data, into a functional understanding of how the appliance works.

“It’s very hard to track down where the knowledge is coming from, or where it will succeed or fail,” says Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Still, with zero coaching, the model made a passable attempt at using the appliance to cook a sweet potato. With step-by-step verbal instructions — essentially, a human walking the robot through the task the way you might explain something to a new employee — it performed successfully.

That coaching capability matters because it suggests robots could be deployed in new environments and improved in real time without additional data collection or model retraining.

So what does it all mean? The researchers aren’t shy about the model’s limitations and are careful not to get ahead of themselves. In at least one case, they point the finger squarely at their own team.

“Sometimes the failure mode is not on the robot or on the model,” Balakrishna says. “It’s on us. Not being good at prompt engineering.” He describes an early air fryer experiment that produced a 5% success rate. After spending about half an hour refining how the task was explained to the model, it jumped to 95%, he says.

Image Credits:Physical Intelligence

The model also isn’t yet capable of executing complex multi-step tasks autonomously from a single high-level command. “You can’t tell it, ‘Hey, go make me some toast’,” Levine says. “But if you walk it through — ‘for the toaster, open this part, push that button, do this’ — then it actually tends to work pretty well.”

The team also acknowledged that standardized benchmarks for robotics don’t really exist, which makes external validation of their claims difficult. Instead, the company measured π0.7 against its own previous specialist models — purpose-built systems trained on individual tasks — and found that the generalist model matched their performance across a range of complex work including making coffee, folding laundry, and assembling boxes.

What may be most notable about the research — if you take the researchers at their word — is not any single demo but the degree to which the results surprised them, people whose job it is to know exactly what is in the training data and therefore what the model should and shouldn’t be able to do.

“My experience has always been that when I deeply know what’s in the data, I can kind of just guess what the model will be able to do,” Balakrishna says. “I’m rarely surprised. But the last few months have been the first time where I’m genuinely surprised. I just bought a gear set randomly and asked the robot, ‘Hey, can you rotate this gear?’ And it just worked.”

Levine recalled the moment researchers first encountered GPT-2 generating a story about unicorns in the Andes. “Where the heck did it learn about unicorns in Peru?” he says. “That’s such a weird combination. And I think that seeing that in robotics is really special.”

Naturally, critics will point to an uncomfortable asymmetry here: Language models had the entire internet to learn from. Robots don’t, and no amount of clever prompting fully closes that gap. But when asked where he expects the skepticism, Levine points somewhere else entirely.

“The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring,” he says. “The robot is not doing a backflip.” He pushes back on that framing, arguing that the distinction between an impressive robot demo and a robotic system that actually generalizes is precisely the point. Generalization, he suggests, will always look less dramatic than a carefully choreographed stunt — but it is considerably more useful.

The paper itself uses careful hedging language throughout, describing π0.7 as showing “early signs” of generalization and “initial demonstrations” of new capabilities. These are research results, not a deployed product, and Physical Intelligence has been restrained from the start about commercial timelines.

When asked directly when a system based on these findings might be ready for real-world deployment, Levine declines to speculate. “I think there’s good reason to be optimistic, and certainly it’s progressing faster than I expected a couple of years ago,” he says. “But it’s very hard for me to answer that question.”

Physical Intelligence has raised over $1 billion to date and was most recently valued at $5.6 billion. A significant part of the investor enthusiasm around the company traces to Lachy Groom, a co-founder who spent years as one of Silicon Valley’s most well-regarded angel investors — backing Figma, Notion, and Ramp, among others — before deciding that Physical Intelligence was the company he’d been looking for. That pedigree has helped the startup attract serious institutional money even as it has refused to offer investors a commercialization timeline.

The company is now said to be in discussions for a new round that would nearly double that figure to $11 billion. The team declined to comment.

Source: techcrunch.com