March 1, 2026 · News

Manifold on Theory of Space: 97% accuracy with structured sensor input, where frontier LLMs collapse to 4-10%

Niva ran Manifold on the Theory of Space (ToS) benchmark, a spatial reasoning evaluation introduced by Stanford and Northwestern at ICLR 2026. The benchmark tests whether AI agents can build cognitive maps through active exploration of partially observable environments and answer nine types of spatial reasoning queries. Until now, every published baseline was a frontier LLM consuming text observations. Manifold is the first non-LLM system evaluated on it.

Headline numbers

At structured sensor input, Manifold scores 97% on spatial reasoning tasks.
At text-equivalent input (apples-to-apples with LLMs), Manifold scores 68%, between GPT-5.2 (72%) and Claude-4.5 Sonnet (66%).
When we gave frontier LLMs the same structured sensor data Manifold uses, accuracy collapsed to 4-10%, a 60+ point regression from their text baselines.
Chain-of-thought prompting did not recover LLM performance. In most conditions, it made accuracy worse.

Sensor fidelity is the dominant variable

Within Manifold, holding the reasoning pipeline constant and varying only the observation input produced a 29-point accuracy jump from text-equivalent to structured input. Same architecture, same code, same evaluation. The only thing that changed was the quality of the sensor data feeding the world model.

This is the core argument for physics-native architecture over LLMs in robotics. Real-world deployment runs on real sensors: depth cameras, LiDAR, IMUs, force-torque sensors. Each carries precise structured data. A system that can use that data is operating in a fundamentally different regime than one that consumes text descriptions of it.

LLMs cannot process precise sensor data

The cross-architecture pilot is the sharper finding. We gave GPT-5.2 and Claude-4.5 Sonnet the exact structured pose data Manifold uses to reach 97%, formatted as natural-language descriptions with precise coordinates, bearings in degrees, and exact distances. Both models scored 4-10% across two prompting conditions. With chain-of-thought, they got worse. The LLMs attempted geometric reasoning in their CoT traces and produced systematic computational errors in bearing arithmetic and distance comparison.

This is not a prompting problem or a fine-tuning problem. The auto-regressive transformer architecture cannot perform spatial constraint satisfaction over precise coordinates regardless of how the input is framed. It is a fundamental computational limitation of the substrate.

What this means for deployment

Manifold's pipeline is deterministic: the same input always produces the same output, bitwise. It is noise-tolerant: under one percentage point of accuracy loss at 10-20x real sensor noise. It runs in well under 50 ms end-to-end. These are deployment requirements, not research metrics. The Theory of Space result is the strongest external evidence that the architecture meets them on a problem class that frontier LLMs cannot solve at scale.

The full technical report is available on the Research page: https://www.nivatech.io/research/research-structured-state-with-higher-fidelity-observations-dramatically-improves-spatial-reasoning

One architecture, multiple domains

Manifold wasn't designed for the Theory of Space benchmark. It was built around constitutive physics that runs continuously and a deterministic world model that integrates whatever sensor data is available. The benchmark exercises a slice of that capability, the spatial reasoning slice, and shows what happens when a deterministic geometry pipeline meets precise structured input. The result generalizes: anywhere real sensor data exists and physical reasoning is required, the same advantage applies.