The On-Policy Startup - Huaijiang Zhu

I’ve spent six years doing a robotics PhD, watching GPU simulation transform how we train robot policies. Recently, AI coding tools have started doing the same for software. And I think the lesson emerging is the same one robotics already learned.

Finding product-market fit is verifiable. You know if customers pay for your product. This makes it naturally suited to reinforcement learning: paying customer equals positive reward, no uptake equals zero.

In RL, an agent takes actions in an environment and receives feedback that guides it toward an optimal policy. A startup does the same. It launches products in the market and receives clear signals: adoption, revenue, churn. Achieving PMF becomes a hill to climb through trial and error.

Like any RL problem, founding a startup faces the exploration-exploitation tradeoff, operates under uncertainty, and learns from experience. Both robotics RL and startup founding have been constrained by the cost of taking actions. In robotics, real-world rollouts are costly and traditional CPU-based simulator was slow. In startups, building products was expensive and slow. This led both fields to favor similar strategies: off-policy learning.

The Off-Policy Approach

In classical RL, off-policy algorithms learn from data generated by different policies, reusing past experiences through replay buffers. This makes them sample-efficient. The downside is complexity and instability. But when interactions are expensive, off-policy methods make sense.

Traditional customer discovery is off-policy learning. Interviews, surveys, landing page tests: these are cheap, low-commitment actions. They gather information rather than generate reward. Building features, marketing, selling: these generate reward but cost more. You gather data under one policy to estimate value, then bootstrap into a different policy.

One might ask: isn’t customer discovery about learning the reward model—figuring out what customers want? Not quite. The ultimate reward is already clear: paying customers. What customer discovery learns is a value function, i.e. which decisions lead to that known reward.

The On-Policy Approach

What happens when taking an action becomes dramatically cheaper?

In 2021, NVIDIA released Isaac Gym: a GPU-accelerated physics simulator that runs thousands of robotic environments in parallel. This delivered 2-3 orders of magnitude speedup. Tasks that took days now took hours.

The robotics community quickly discovered that on-policy algorithms like PPO began outperforming off-policy methods. With abundant data, simplicity became an advantage. The underlying idea of PPO is simple: estimate policy gradients from rollouts to optimize expected return. No replay buffers, no target network. When rollouts are cheap, just collect fresh data and update the policy directly.

Software is undergoing the same shift. AI coding tools have made building MVPs dramatically cheaper and faster. What took a team months can be prototyped by one developer with an AI assistant in days.

The lean startup methodology is losing its edge. Why spend weeks conducting interviews if you can build a basic app in 48 hours and observe user behavior directly? Showing beats telling. Actual payment is the ground truth that off-policy methods only approximate.

This leads to a provocative implication: finding PMF can be done algorithmically, minimizing human intuition.

A startup-agent could try many product variants and directly observe reward. Different features, markets, pricing models. The agent doesn’t need to understand why customers want something. It just detects reward and updates its policy.

We already have the pieces. LLMs write code. Analytics track everything. AI can handle customer interactions. An agent could generate product ideas, spin up MVPs, run parallel experiments, and observe results at a scale no human founder could match.

Instead of needing a genius visionary, you throw computing power at the problem. The founders of the future may need to be experiment designers more than visionaries. The vision will emerge from sheer volume of experiments.

The Human Role

Even when rollouts are cheap, some problems remain hard: these become the founder’s real job.

The reward signal for PMF is clear but sparse. Most experiments yield nothing. Random guessing gets you nowhere. This is where reward shaping comes in: designing intermediate signals that guide toward the sparse ultimate reward. Leading indicators like engagement, activation, retention. The founder’s job becomes crafting a reward function dense enough to learn from but still aligned with the true objective.

There’s also curriculum design. Complex tasks are learned by starting simple and progressively increasing difficulty. You don’t train a robot to run before it can stand. The startup equivalent: find PMF in a narrow niche before expanding. Validate individual modules before optimizing the whole ecosystem.

And there’s the search space itself. What actions are available? B2B or B2C. Which platforms. What pricing models. These architectural choices constrain what optimization can find. The agent searches within the space. The founder defines the space.

The human role evolves from “person who figures out what customers want” to “person who structures the problem so optimization can figure it out.”

The Bitter Lesson

Rich Sutton’s “Bitter Lesson” observes that across AI history, general methods leveraging computation consistently outperform human-engineered knowledge. Chess, Go, computer vision, NLP. Scale and learning beat clever priors. The lesson is bitter because it sidelines human expertise.

We may be approaching a similar moment in entrepreneurship.

The human-centric approach is the art of startups: brilliant insight, customer empathy, crafted solutions. The emerging challenger is scale-driven: massive experimentation, letting data reveal what works. Brute-forcing the space of business ideas.

Human-driven customer discovery seems beneficial in the short term. It’s a useful prior when experiments are expensive. But as the cost of building and testing approaches zero, that prior becomes a bottleneck.

The search for product-market fit will become what it always was beneath the surface: an optimization problem. And optimization problems solve themselves.