Flattening Hierarchies with Policy Bootstrapping

University of California, Los Angeles

paper code

Subgoal Advantage-Weighted policy bootstrapping (SAW) is an offline goal-conditioned RL algorithm that scales to complex, long-horizon tasks without needing hierarchical policies or generative subgoal models.

What are the benefits of hierarchies in offline RL?

Goal-conditioned hierarchies achieve state-of-the-art performance on long-horizon tasks, but require a lot of design complexity and a generative model over the subgoal space, which is expensive to train. We do a deep dive into a state-of-the-art hierarchical method for offline goal-conditioned RL and identify a simple yet key reason for its success: it's just easier to train policies for short-horizon goals!

Prior work makes the argument that combining sparse rewards, discounting, and distant goals makes it very difficult to compare the values of nearby states, and that mitigating this low "value signal-to-noise ratio" is a key benefit of hierarchies.
However, we find that clearer advantage signals are not enough to close the performance gap.
The reason is simple: dataset state-action pairs are far more likely to be near-optimal with respect to nearby goals, due to the compounding likelihood of suboptimal "detours" as the trajectory horizon increases.
Even with clear advantage signals and hindsight goal relabeling, it is difficult to find good state-action-goal tuples on which to train.

Hierarchies bootstrap on policies

How do hierarchical methods solve this issue? They exploit the inductive bias that actions which are good for reaching (good) subgoals are also good for reaching the final goal, as well as the relative ease of training policies only on nearby goals, which we call subpolicies (functionally identical to the low-level actor, but for non-hierarchical policies).

Hierarchical policies perform test-time bootstrapping: in RL, bootstrapping uses one estimate to update another estimate.
We can think of policies as estimates of the optimal action distribution, and use policies conditioned on (good) subgoals as target estimates for the policy conditioned on the full goal!
Rather than update another policy, hierarchies directly sample from the target subgoal-conditioned policy at test time.
Instead, we could use a subgoal generator to retrieve a subgoal-conditioned policy, and regress a full, flat goal-conditioned policy towards the resulting action distribution during training — similar to prior work in online GCRL.

Algorithm

However, requiring an expensive generative model for subgoals means that we suffer all the costs and limitations of hierarchical methods. Instead, we look at hierarchical policy optimization from the perspective of probabilistic inference, allowing us to (1) unify existing methods and (2) replace explicit subgoal generation with an importance weight over future states.

$$\mathcal{J}(\theta) = \mathbb{E}_{p^{\mathcal D}(s,a,w),\, p(g)}\!\left[\, e^{\alpha A(s, a, g)} \log\pi_\theta(a\mid s,g) - e^{\beta A(s,w,g)}\, D_{\mathrm{KL}}\!\left(\pi_\theta(a \mid s, g)\,\|\,\pi^{\mathrm{sub}}(a \mid s, w)\right)\,\right]$$

Our method, Subgoal Advantage-Weighted policy bootstrapping (SAW), combines two learning signals into a single objective: the one-step advantage signal from the goal-conditioned value function, and the subpolicy's estimate of the optimal action distribution for a given subgoal. Rather than explicitly generate subgoals, we instead sample waypoints directly from the same trajectory as the initial state-action pair and weight the contribution of the KL divergence term by the optimality of that waypoint.

Algorithm 1 Subgoal Advantage-Weighted Policy Bootstrapping (SAW)

Input: offline dataset $\mathcal D$, goal distribution $p(g)$.
Initialize value function $V_\phi$, target subpolicy $\pi_\psi$, and policy $\pi_\theta$.
while not converged do
Train value function: $\phi \leftarrow \phi - \lambda \nabla_\phi \mathcal{L}_{\mathrm{GCIVL}}(\phi)$ with $(s_t, s_{t+1}) \sim p^{\mathcal D},\ g \sim p(g)$.
end while
while not converged do
Train target subpolicy: $\omega \leftarrow \omega - \lambda \nabla_\omega \mathcal{J}_{\mathrm{AWR}}(\omega)$ with $(s_t, a, w) \sim p^{\mathcal D}$.
end while
while not converged do
Train policy: $\theta \leftarrow \theta - \lambda \nabla_\theta \mathcal{J}_{\mathrm{SAW}}(\theta)$ with $(s_t, a, w) \sim p^{\mathcal D},\ g \sim p(g)$.
end while

Our training recipe is simple: we train a goal-conditioned value function and a subpolicy on nearby goals in a fashion identical to the low-level actor in hierarchical methods. Then, we train the full flat goal-conditioned policy on goals of all horizons, by sampling states, actions, waypoints, and goals all from the same trajectory.

Experiments

We evaluate SAW on a variety of locomotion and manipulation tasks from OGBench, showing state-of-the-art performance on nearly all tasks and outperforming hierarchical methods on the longest-horizon tasks: antmaze-giant-navigate-v0 and humanoidmaze-giant-navigate-v0 — being the first to achieve non-trivial performance on the latter!

Evaluating SAW on state- and pixel-based offline goal-conditioned RL tasks. Average (binary) success rate (%) compared against the numbers reported in Park et al. (2024), across the five test-time goals for each environment, averaged over 8 seeds (4 seeds for pixel-based visual tasks). Numbers within 5% of the best in the row are in bold. Results with an asterisk (^*) use different value learning hyperparameters.

Environment	Dataset	GCBC	GCIVL	GCIQL	QRL	CRL	HIQL	RIS^off	SAW
pointmaze	pointmaze-medium-navigate-v0	9 ±6	63 ±6	53 ±8	82 ±5	29 ±7	79 ±5	88 ±6	97 ±2
	pointmaze-large-navigate-v0	29 ±6	45 ±5	34 ±3	86 ±9	39 ±7	58 ±5	63 ±13	85 ±10
	pointmaze-giant-navigate-v0	1 ±2	0 ±0	0 ±0	68 ±7	27 ±10	46 ±9	57 ±12	68 ±8
antmaze	antmaze-medium-navigate-v0	29 ±4	72 ±8	71 ±4	88 ±3	95 ±1	96 ±1	96 ±1	97 ±1
	antmaze-large-navigate-v0	24 ±2	16 ±5	34 ±4	75 ±6	83 ±4	91 ±2	89 ±3	90 ±3
	antmaze-giant-navigate-v0	0 ±0	0 ±0	0 ±0	14 ±3	16 ±3	65 ±5	65 ±4	73 ±4
humanoidmaze	humanoidmaze-medium-navigate-v0	8 ±2	24 ±2	27 ±2	21 ±8	60 ±4	89 ±2	73 ±5	88 ±3
	humanoidmaze-large-navigate-v0	1 ±0	2 ±1	2 ±1	5 ±1	24 ±4	49 ±4	21 ±7	46 ±4
	humanoidmaze-giant-navigate-v0	0 ±0	0 ±0	0 ±0	1 ±0	3 ±2	12 ±4	3 ±2	35 ±4
cube	cube-single-play-v0	6 ±2	53 ±4	68 ±6	5 ±1	19 ±2	44^* ±9	81^* ±6	72^* ±5
	cube-double-play-v0	1 ±1	36 ±3	40 ±5	1 ±0	10 ±2	6 ±2	36 ±4	40 ±7
	cube-triple-play-v0	1 ±1	1 ±0	3 ±1	0 ±0	4 ±1	3 ±1	3 ±2	4 ±2
scene	scene-play-v0	5 ±1	42 ±4	51 ±4	5 ±1	19 ±2	38 ±3	64 ±7	63 ±6
visual-antmaze	visual-antmaze-medium-navigate-v0	11 ±2	22 ±2	11 ±1	0 ±0	94 ±1	93 ±4	55 ±47	95 ±0
	visual-antmaze-large-navigate-v0	4 ±0	5 ±1	4 ±1	0 ±0	84 ±1	53 ±9	43 ±44	82 ±4
	visual-antmaze-giant-navigate-v0	0 ±0	1 ±1	0 ±0	0 ±0	47 ±2	6 ±4	4 ±1	10 ±2
visual-cube	visual-cube-single-play-v0	5 ±1	60 ±5	30 ±5	41 ±15	31 ±15	89 ±0	63 ±37	88 ±3
	visual-cube-double-play-v0	1 ±1	10 ±2	1 ±1	5 ±0	2 ±1	39 ±2	28 ±6	40 ±3
	visual-cube-triple-play-v0	15 ±2	14 ±2	15 ±1	16 ±1	17 ±2	21 ±0	18 ±1	20 ±1
visual-scene	visual-scene-play-v0	12 ±2	25 ±3	12 ±2	10 ±1	11 ±2	49 ±4	38 ±3	47 ±6

Methods which generate subgoals often must predict in a compact latent space in order to scale to high-dimensional observation spaces. We find that one popular choice of sharing a representation between the goal-conditioned value function and the policy significantly harms SAW's performance and emphasizes a fundamental tradeoff in hierarchical methods: subgoal representations are essential for making high-level policy prediction tractable, but those same representations can constrain policy expressiveness and limit overall performance.

Citation

@article{zhou_flattening_2025,
  title  = {Flattening Hierarchies with Policy Bootstrapping},
  url    = {http://arxiv.org/abs/2505.14975},
  doi    = {10.48550/arXiv.2505.14975},
  publisher = {arXiv},
  author = {Zhou, John L. and Kao, Jonathan C.},
  year   = {2025},
}