← back

Flattening Hierarchies with Policy Bootstrapping

John L. Zhou and Jonathan C. Kao

University of California, Los Angeles

Subgoal Advantage-Weighted policy bootstrapping (SAW) is an offline goal-conditioned RL algorithm that scales to complex, long-horizon tasks without needing hierarchical policies or generative subgoal models.

What are the benefits of hierarchies in offline RL?

Goal-conditioned hierarchies achieve state-of-the-art performance on long-horizon tasks, but require a lot of design complexity and a generative model over the subgoal space, which is expensive to train. We do a deep dive into a state-of-the-art hierarchical method for offline goal-conditioned RL and identify a simple yet key reason for its success: it's just easier to train policies for short-horizon goals!

Hierarchies bootstrap on policies

How do hierarchical methods solve this issue? They exploit the inductive bias that actions which are good for reaching (good) subgoals are also good for reaching the final goal, as well as the relative ease of training policies only on nearby goals, which we call subpolicies (functionally identical to the low-level actor, but for non-hierarchical policies).

Algorithm

However, requiring an expensive generative model for subgoals means that we suffer all the costs and limitations of hierarchical methods. Instead, we look at hierarchical policy optimization from the perspective of probabilistic inference, allowing us to (1) unify existing methods and (2) replace explicit subgoal generation with an importance weight over future states.

$$\mathcal{J}(\theta) = \mathbb{E}_{p^{\mathcal D}(s,a,w),\, p(g)}\!\left[\, e^{\alpha A(s, a, g)} \log\pi_\theta(a\mid s,g) - e^{\beta A(s,w,g)}\, D_{\mathrm{KL}}\!\left(\pi_\theta(a \mid s, g)\,\|\,\pi^{\mathrm{sub}}(a \mid s, w)\right)\,\right]$$

Our method, Subgoal Advantage-Weighted policy bootstrapping (SAW), combines two learning signals into a single objective: the one-step advantage signal from the goal-conditioned value function, and the subpolicy's estimate of the optimal action distribution for a given subgoal. Rather than explicitly generate subgoals, we instead sample waypoints directly from the same trajectory as the initial state-action pair and weight the contribution of the KL divergence term by the optimality of that waypoint.

Algorithm 1   Subgoal Advantage-Weighted Policy Bootstrapping (SAW)
  1. Input: offline dataset $\mathcal D$, goal distribution $p(g)$.
  2. Initialize value function $V_\phi$, target subpolicy $\pi_\psi$, and policy $\pi_\theta$.
  3. while not converged do
  4. Train value function: $\phi \leftarrow \phi - \lambda \nabla_\phi \mathcal{L}_{\mathrm{GCIVL}}(\phi)$ with $(s_t, s_{t+1}) \sim p^{\mathcal D},\ g \sim p(g)$.
  5. end while
  6. while not converged do
  7. Train target subpolicy: $\omega \leftarrow \omega - \lambda \nabla_\omega \mathcal{J}_{\mathrm{AWR}}(\omega)$ with $(s_t, a, w) \sim p^{\mathcal D}$.
  8. end while
  9. while not converged do
  10. Train policy: $\theta \leftarrow \theta - \lambda \nabla_\theta \mathcal{J}_{\mathrm{SAW}}(\theta)$ with $(s_t, a, w) \sim p^{\mathcal D},\ g \sim p(g)$.
  11. end while

Our training recipe is simple: we train a goal-conditioned value function and a subpolicy on nearby goals in a fashion identical to the low-level actor in hierarchical methods. Then, we train the full flat goal-conditioned policy on goals of all horizons, by sampling states, actions, waypoints, and goals all from the same trajectory.

Experiments

OGBench overview

We evaluate SAW on a variety of locomotion and manipulation tasks from OGBench, showing state-of-the-art performance on nearly all tasks and outperforming hierarchical methods on the longest-horizon tasks: antmaze-giant-navigate-v0 and humanoidmaze-giant-navigate-v0being the first to achieve non-trivial performance on the latter!

Evaluating SAW on state- and pixel-based offline goal-conditioned RL tasks. Average (binary) success rate (%) compared against the numbers reported in Park et al. (2024), across the five test-time goals for each environment, averaged over 8 seeds (4 seeds for pixel-based visual tasks). Numbers within 5% of the best in the row are in bold. Results with an asterisk (*) use different value learning hyperparameters.

Environment Dataset GCBC GCIVL GCIQL QRL CRL HIQL RISoff SAW
pointmaze pointmaze-medium-navigate-v0 9 ±6 63 ±6 53 ±8 82 ±5 29 ±7 79 ±5 88 ±6 97 ±2
pointmaze-large-navigate-v0 29 ±6 45 ±5 34 ±3 86 ±9 39 ±7 58 ±5 63 ±13 85 ±10
pointmaze-giant-navigate-v0 1 ±2 0 ±0 0 ±0 68 ±7 27 ±10 46 ±9 57 ±12 68 ±8
antmaze antmaze-medium-navigate-v0 29 ±4 72 ±8 71 ±4 88 ±3 95 ±1 96 ±1 96 ±1 97 ±1
antmaze-large-navigate-v0 24 ±2 16 ±5 34 ±4 75 ±6 83 ±4 91 ±2 89 ±3 90 ±3
antmaze-giant-navigate-v0 0 ±0 0 ±0 0 ±0 14 ±3 16 ±3 65 ±5 65 ±4 73 ±4
humanoidmaze humanoidmaze-medium-navigate-v0 8 ±2 24 ±2 27 ±2 21 ±8 60 ±4 89 ±2 73 ±5 88 ±3
humanoidmaze-large-navigate-v0 1 ±0 2 ±1 2 ±1 5 ±1 24 ±4 49 ±4 21 ±7 46 ±4
humanoidmaze-giant-navigate-v0 0 ±0 0 ±0 0 ±0 1 ±0 3 ±2 12 ±4 3 ±2 35 ±4
cube cube-single-play-v0 6 ±2 53 ±4 68 ±6 5 ±1 19 ±2 44* ±9 81* ±6 72* ±5
cube-double-play-v0 1 ±1 36 ±3 40 ±5 1 ±0 10 ±2 6 ±2 36 ±4 40 ±7
cube-triple-play-v0 1 ±1 1 ±0 3 ±1 0 ±0 4 ±1 3 ±1 3 ±2 4 ±2
scene scene-play-v0 5 ±1 42 ±4 51 ±4 5 ±1 19 ±2 38 ±3 64 ±7 63 ±6
visual-antmaze visual-antmaze-medium-navigate-v0 11 ±2 22 ±2 11 ±1 0 ±0 94 ±1 93 ±4 55 ±47 95 ±0
visual-antmaze-large-navigate-v0 4 ±0 5 ±1 4 ±1 0 ±0 84 ±1 53 ±9 43 ±44 82 ±4
visual-antmaze-giant-navigate-v0 0 ±0 1 ±1 0 ±0 0 ±0 47 ±2 6 ±4 4 ±1 10 ±2
visual-cube visual-cube-single-play-v0 5 ±1 60 ±5 30 ±5 41 ±15 31 ±15 89 ±0 63 ±37 88 ±3
visual-cube-double-play-v0 1 ±1 10 ±2 1 ±1 5 ±0 2 ±1 39 ±2 28 ±6 40 ±3
visual-cube-triple-play-v0 15 ±2 14 ±2 15 ±1 16 ±1 17 ±2 21 ±0 18 ±1 20 ±1
visual-scene visual-scene-play-v0 12 ±2 25 ±3 12 ±2 10 ±1 11 ±2 49 ±4 38 ±3 47 ±6

Methods which generate subgoals often must predict in a compact latent space in order to scale to high-dimensional observation spaces. We find that one popular choice of sharing a representation between the goal-conditioned value function and the policy significantly harms SAW's performance and emphasizes a fundamental tradeoff in hierarchical methods: subgoal representations are essential for making high-level policy prediction tractable, but those same representations can constrain policy expressiveness and limit overall performance.

subgoal representation comparison

Citation

@article{zhou_flattening_2025,
  title  = {Flattening Hierarchies with Policy Bootstrapping},
  url    = {http://arxiv.org/abs/2505.14975},
  doi    = {10.48550/arXiv.2505.14975},
  publisher = {arXiv},
  author = {Zhou, John L. and Kao, Jonathan C.},
  year   = {2025},
}