WarmPrior:
Straightening Flow-Matching Policies with Temporal Priors

Sinjae Kang1, Chanyoung Kim1, Kaixin Wang2, Li Zhao2, Kimin Lee1
1KAIST, 2Microsoft Research

TL;DR: Replace the standard N(0, I) source with a temporally grounded prior anchored on the agent's recent action history. The flow becomes shorter and straighter, recovering the path-straightening benefit of optimal-transport couplings without any OT solver.

WarmPrior teaser

WarmPrior. Standard flow matching transports samples from a context-free N(0, I) onto the action manifold (left). WarmPrior anchors the source on the recent past chunk (WP-Past) or on the model's own previous forecast (WP-Preview) (middle, right). The probability path is shorter, straighter, and temporally consistent across consecutive chunks.

Abstract

Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.

Method

WarmPrior modifies only the source distribution. The network, interpolant, and training objective are untouched. We instantiate it in two minimal variants that differ in how the prior mean is anchored to the agent's own action history.

WP-Past

Use the previously executed action chunk as the prior mean. Consecutive chunks tend to be similar, so the flow starts close to the target.

\( a_0 \;=\; \hat{a}^{\mathrm{prev}} + \sigma\,\varepsilon \)

Predicts H actions per step.

WP-Preview

Predict 2H actions, execute only the first H, and keep the second H as a preview of the next chunk. At the next decision step, that preview becomes the prior mean.

\( a_0[0{:}H] \;=\; \hat{a}^{\mathrm{prev}}[H{:}2H] + \sigma\,\varepsilon \)

Predicts 2H, executes first H.

Main Results

Success rate (%) with the Diffusion Policy (ChiTransformer) backbone.
Parentheses: gain over N(0, I). Green: non-overlapping 1σ seed intervals. Bold: best per (task, NFE).

Task NFE = 9 NFE = 3 NFE = 1
BaseWP-PastWP-Preview BaseWP-PastWP-Preview BaseWP-PastWP-Preview
Robomimic — state observation
Square-PH 86.788.1(+1.4)88.1(+1.4) 86.288.0(+1.8)87.9(+1.7) 83.686.6(+3.0)87.3(+3.7)
Square-MH 65.969.2(+3.3)72.7(+6.8) 65.473.2(+7.8)72.9(+7.5) 65.970.1(+4.2)77.8(+11.9)
Transport-PH 34.136.2(+2.1)43.3(+9.2) 39.044.0(+5.0)49.1(+10.1) 36.839.8(+3.0)47.6(+10.8)
Transport-MH 16.320.7(+4.4)24.3(+8.0) 21.330.7(+9.4)30.4(+9.1) 23.330.2(+6.9)34.5(+11.2)
Tool-Hang-PH 79.480.6(+1.2)82.8(+3.4) 72.375.1(+2.8)75.8(+3.5) 77.778.2(+0.5)81.9(+4.2)
Robomimic — image observation
Square-PH 86.988.2(+1.3)88.7(+1.7) 87.789.2(+1.4)89.6(+1.9) 88.789.3(+0.6)89.1(+0.4)
Square-MH 76.178.0(+1.9)77.8(+1.7) 73.877.9(+4.1)77.1(+3.2) 72.477.6(+5.2)75.1(+2.7)
Transport-PH 92.894.5(+1.7)94.3(+1.6) 92.193.9(+1.9)94.9(+2.9) 91.393.4(+2.2)93.7(+2.4)
Transport-MH 74.879.7(+4.9)79.8(+4.9) 73.880.0(+6.2)80.7(+6.9) 74.378.6(+4.3)79.7(+5.4)
Tool-Hang-PH 43.745.8(+2.1)56.3(+12.6) 36.938.4(+1.4)50.7(+13.8) 41.338.9(−2.4)54.0(+12.7)
MimicGen — image observation
Stack 21.422.8(+1.4)31.6(+10.2) 21.323.7(+2.4)30.7(+9.4) 21.322.4(+1.1)28.7(+7.4)
Coffee 26.829.6(+2.8)34.7(+7.9) 23.324.1(+0.8)33.4(+10.1) 16.220.4(+4.2)29.4(+13.2)
Threading 13.815.5(+1.7)20.9(+7.1) 16.316.6(+0.3)22.0(+5.7) 12.515.6(+3.1)18.0(+5.5)

Why It Works: Straighter Flows

The baseline paths curve noticeably as the network pulls samples from a random origin onto the action manifold; WarmPrior paths, already starting close to the manifold, are visibly straighter and more parallel. Because fewer flows cross one another, the network spends less capacity realigning samples from the random base distribution and can devote more to refining actions, exactly where it matters for downstream success rate.

Pathwise curvature (lower is straighter)

TaskN(0, I)WP-PastWP-Preview
Square-PH 1.0000.8230.803
Square-MH 1.0000.7050.559
Transport-PH 1.0000.7200.692
Transport-MH 1.0000.6950.637
Tool-Hang-PH 1.0000.8060.807

Tasks with the largest curvature reduction (Square-MH, Transport-MH) also show the largest success-rate gain.

Flow trajectories on Square-MH

Flow paths on Square-MH. Normalized action coordinate vs. denoising time t ∈ [0, 1]. WarmPrior paths are straighter and more parallel than the baseline.

We make this quantitative through a branching cost bound:

\( \mathcal{B}_{\mathcal{W}}(o) \;\le\; \mathbb{E}\bigl[\|P_{\mathcal{W}}(A_1 - \mu)\|^2 \mid o\bigr] + \sigma^2 d_{\mathcal{W}} \)

A better prior mean μ tightens the first term, while the residual σ² governs the second, and the two interact through a non-trivial trade-off: too small a σ concentrates the source onto μ and is fragile when μ is imperfect, while too large a σ leaves the field less tightly straightened. See Appendix B of the paper for the full derivation.

A Tunable Knob for Temporal Consistency

Setup. A 1D navigation toy where the observation o is the agent's horizontal position and the action is its vertical height. Demonstrations split evenly between passing above and below each obstacle, so the conditional p(a | o) is multimodal at every position. A regression policy collapses to the mean and drives straight through the obstacle (panel b).

Mode switching. A flow-matching policy recovers both modes per inference (panel c), but the standard objective places no constraint linking consecutive chunks: at every chunk boundary the policy can flip between modes. We call this pathology mode switching. Action chunking commits within a chunk but does not stop the oscillation. Naive history conditioning (panel d) eliminates the flips only by pinning each rollout to a single mode, sacrificing multimodality, and is further fragile to inference-time distributional shift over the action history.

σ as a knob. Because the WarmPrior mean is correlated with the previous chunk, a small σ keeps the new chunk in the nearby mode's basin and prevents crossings between distant modes (panel e); a large σ broadens the source and recovers the multimodal distribution at every step (panel f). σ thus becomes a continuous regulator between temporal consistency and multimodal expressiveness, an implicit alternative to history conditioning that does not pin the policy to a single mode.

Mode switching toy

1D navigation toy. (a) demonstrations; (b) regression collapses to the mean; (c) flow matching recovers both modes but oscillates; (d) history conditioning commits but drifts under inference-time history shift; (e, f) WarmPrior with small / large σ tunes between consistency and multimodality.

Real-Robot Experiments

Four tabletop tasks on a Franka Research 3 using the DROID setup, with the GR00T N1.5 VLA backbone.
30 demos per task, 50 trials per seed, 3 seeds, NFE = 4.

Real robot tasks

Four tabletop tasks: Food Waste Disposal, Cup Stacking, Block Stacking, Cable Insertion.

Real robot results

WarmPrior consistently improves success rate. Gains are largest on the harder tasks (Cable Insertion, Block Stacking) where endpoint ambiguity is highest, mirroring the simulation pattern.

WarmPrior Improves Prior-Space RL Efficiency

DSRL fine-tunes a frozen diffusion policy by acting in the prior space: the RL agent proposes a prior sample that the ODE sampler maps to an action. The exploration space, however, is still the uninformative N(0, I), an unstructured noise space the agent must search from scratch. WarmPrior offers a semantically meaningful prior space: because μ is already a temporally grounded plausible action, searching around it is far more productive than exploring random noise.

WP-Past and WP-Preview learn faster, converge more stably, and reach higher asymptotes than DSRL-SAC and DSRL-NA: both clear 99% on Square and reach ~97% on Transport, while DSRL baselines plateau around 90%. To our knowledge this is the first stable result above 95% success on Transport, the hardest Robomimic task, by RL fine-tuning a flow-matching policy.

DSRL vs WarmPrior

Prior-space RL on Robomimic Square and Transport.
Mean ± 1σ over 3 seeds.

BibTeX

@article{kang2026warmprior,
  title     = {WarmPrior: Straightening Flow-Matching Policies with Temporal Priors},
  author    = {Kang, Sinjae and Kim, Chanyoung and Wang, Kaixin and Zhao, Li and Lee, Kimin},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2026}
}