Optimizing Flow Policies

University of California, Los Angeles

May 2026

Flow matching (FM) is an increasingly popular way to parameterize control policies, due to their expressivity in modeling complex, multimodal joint distributions and ability to leverage iterative computation. However, it also comes with many difficulties, as common techniques for both optimization and exploration are closely tied to Gaussian policies and do not generalize well. For example, FM policies do not admit exact likelihood computations, do not have natural exploration mechanisms (unlike learnable variance for Gaussians), and face difficulties in using reparameterized policy gradient due to unstable backpropagation through time.

Here, we are interested in investigating the techniques for finetuning FM policies, including mechanisms to optimize the flow policy based on a critic or reward function, and to induce exploration during online interaction in the environment.

Optimizing flow matching policies

Unlike Gaussian policies, which can directly parameterize the variance of the distribution for exploration and compute (and optimize) exact likelihoods, flow matching-based policies lack both an inherent mechanism for exploration and a tractable likelihood computation. To make such policies useful for reinforcement learning, there have been multiple ways developed to steer a flow-based policy towards desirable actions:

Critic gradient-based methods

These methods use the gradient of the critic either to directly optimize the parameters of the flow model following standard reparameterized policy gradient methods, or use it as a score function to guide the action generation process without retraining. These methods are desirable because they can extrapolate outside of the dataset to some degree (controlled with a BC penalty or guidance weight), providing some natural exploration and potential generalization.

Reparameterized policy gradient: doesn't work well with many flow steps due to gradients through time and typically used on distilled shortcut models (FQL).
Velocity field guidance: classifier-free guidance (CFG) has been thoroughly explored in diffusion models for conditional generation. One recent approach is to add a return prediction head and learn a full base policy with imitation learning, but simply perform advantage-guided sampling at inference time rather than attempting to directly fit a flow field to the advantage-conditioned distribution. However, this is learning free and can only find local optima, although an expressive flow policy may allow gradient ascent from multiple modes (OT-CFM guidance).

In-sample optimization

In-sample optimization only optimizes within the convex hull of dataset actions, or an approximation thereof learned via supervised imitation learning. Such methods can essentially be viewed as weighted behavioral cloning, where the weight is incorporated post-sampling by rejection sampling, pre-sampling by learning a policy over the noise space, or during sampling.

Rejection (best-of-$N$) sampling: simply learn a BC policy and perform a round of rejection sampling using a value function as the acceptance criteria.
1. This seems to be an increasingly popular paradigm when going OOD is very harmful or naive optimization is difficult, e.g., model-based rollouts (MAC), action chunking policies (QC, DQC), or the high-level policy in hierarchies (SHARSA, hierarchical FBC), where the subgoal space may be discrete and therefore incompatible with gradient-based methods.
Steering in the noise space: learns a policy over the noise space (DSRL), ensuring that we stay in the convex hull of the learned action space — albeit the approximated action space parameterized by the base flow policy rather than the true, non-parametric, dataset.
Advantage conditioning: these policies (CFGRL, $\pi_{0.6}^*$) seek to address the optimization instabilities and data inefficiency of weighted regression methods such as AWR, which is known to discard or significantly downweight much of the data. To get the same effect while still learning from suboptimal data, they instead learn the full distribution with supervised BC, but add a simple binary optimality indicator $o$, which is $$ o = \begin{cases} 1 & \text{if } A(s, a) \geq 0 \\ 0 & \text{if } A(s, a) < 0 \end{cases} $$ This is very similar to classifier-free guidance, where you train a single model that can optionally accept a conditioning variable. CFGRL essentially trains two policies parameterized by the same network with a binary condition: $0$ is pure BC, and $1$ is the same CFM loss but actions with positive advantages contribute to the loss (the loss is simply masked out for negative-advantage actions). Then, at test time, the velocity at every step is a weighted combination of $\pi(a \mid s, o=0)$ and $\pi(a \mid s, o=1)$, allowing one to sweep across weights to control the degree of exploitation without retraining.

Policy gradient approximations

These approaches can be divided into (1) those which estimate surrogates for the log-likelihood using the base flow field, and (2) those which explicitly inject (Gaussian) noise into the denoising path to turn it into a stochastic Markov chain, where the likelihood of a flow trajectory can be easily computed in closed form.

Log-likelihood ratio approximations: These approaches are inspired by policy gradient methods such as PPO and seek to incorporate similar ideas into flow policy optimization (FPO, FPO++). To do so, they approximate the likelihood ratio $\rho_\theta = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$, i.e., the difference in action log-likelihoods, with the difference in CFM losses (not the difference in predictions): $$ \hat{\rho}_{\mathrm{FPO}}(\theta) = \exp\left(\hat{\mathcal{L}}_{\mathrm{CFM}, \theta_{\mathrm{old}}}(a_t; s_t) - \hat{\mathcal{L}}_{\mathrm{CFM}, \theta}(a_t; s_t)\right) $$ and use it to clip the advantage weight by which the CFM loss is multiplied.
Noise injection to "create" likelihoods: Methods in this category (ReinFlow, GRPO) inject Gaussian noise into one or more of the integration steps, essentially converting the deterministic flow into a stochastic Markov chain with a computable log-probability. It's useful to think of this as turning the problem into a "bi-level" Markov decision process, where we now have a Gaussian policy taking "mini-actions" at each step of the integration path, but we only get one reward from the environment per "mini" trajectory. Unlike Gaussian policies, the likelihood of a flow path does not correspond to the marginal likelihood of the action under the entire policy, but "pushing" the velocity field towards higher-reward paths is a form of policy improvement!

Residual policy learning

Another optimization paradigm is to fix the base policy and simply learn a residual policy on top of it to dynamically modify its behavior in some way. While residual policies typically make residual corrections to a relatively strong base policy — which can be an MPC controller, base behavior cloning policy, or large VLA — another interesting direction is to instead use the residual to inject exploration capabilities, effectively altering the variance instead of the mean (ReinFlow with learnable noise is one example of this).

While I haven't worked much with these approaches, I think residual approaches are very promising: they allow quick, on-the-fly continual learning and improvement without altering the stability of the base policy, and learned improvements/generated data can then later be used to update the base policy. From a latency perspective, this can be especially useful for test-time adaptation when gradient-based optimization through the full policy may be prohibitively slow.