TL;DR:

Research Idea 2: Hierarchical Reward for Long-Horizon Planning and Agent RL

Working title

Hierarchical Rewards for Long-Horizon Language Agents: Learning to Plan, Execute, Revise, and Credit Subgoals

Core research question

Long-horizon agents often fail because they must solve tasks that require many dependent decisions. A single final success/failure reward is too sparse. The idea is to design a hierarchical reward structure where:

A long-term goal is decomposed into high-level plans.
Each high-level plan is decomposed into subgoals.
Each subgoal is decomposed into executable steps.
Rewards are assigned not only for final task success, but also for correct intermediate planning, subgoal completion, and plan revision.
Higher-level goal completion can validate or invalidate lower-level plans.

The central question is:

Can hierarchical reward design improve long-horizon agent learning by giving credit to correct plans, correct subgoals, correct execution, and correct revision?

Motivation

In long-horizon tasks, final rewards are sparse and delayed. For example, in a travel-planning agent:

Goal: Plan a 5-day Cancun trip under budget and time constraints.

High-level plan:
Choose hotel route.
Schedule outdoor tours.
Arrange transport.
Check constraints.

Subgoals:
Find hotels.
Select tours.
Balance rest and activity.
Verify cost and timing.

Low-level actions:
Search hotel.
Search ferry.
Compare tour hours.
Update itinerary.

If the final itinerary fails because one ferry time is impossible, standard RL may give a bad reward to the whole trajectory. But many parts of the plan may still be correct. A hierarchical reward can assign partial credit and identify which level failed.

Relationship to existing work

Hierarchical RL

Traditional hierarchical reinforcement learning already studies temporal abstraction:

Options framework
MAXQ decomposition
Feudal RL
Subgoal discovery
Skill learning

These methods decompose long-horizon tasks into higher-level policies and lower-level actions. However, classical HRL usually assumes environment states/actions, not natural-language plans.

LLM agent planning

Recent LLM-agent work uses explicit planning and decomposition:

Plan-and-Act style frameworks separate a planner from an executor.
Web-agent methods use high-level plans to guide low-level browser actions.
TravelPlanner, WebArena, WebVoyager, DeepPlanning, and related benchmarks test long-horizon constrained planning.
ReAct / Reflexion / Tree-of-Thought / plan-execute-replan methods use reasoning traces and feedback loops.

But many such systems are prompt- or inference-time frameworks, not fully trained with hierarchical reward signals.

Planning benchmarks

Relevant benchmarks include:

PlanBench: formal planning and reasoning about actions/change.
TravelPlanner: real-world travel planning with constraints and tools.
DeepPlanning: long-horizon planning with travel and shopping tasks.
WebArena / WebVoyager: web-navigation tasks requiring many steps.
Robotouille: asynchronous planning benchmark.
Flex-TravelPlanner: flexible planning with changing constraints.
REALM-Bench: real-world planning scenarios and adaptation to disruptions.
HeroBench: long-horizon hierarchical planning in an RPG-like world.

These benchmarks are relevant because they can expose failures in decomposition, execution, constraint satisfaction, and replanning.

Existing design patterns close to your idea

1. Plan-and-execute agents

A model first generates a high-level plan, then another module executes it. The plan provides structure, but the system may still lack explicit reward for whether the plan hierarchy itself is correct.

2. Replan / revise loops

Agents can revise the plan after observing failure. This is close to your idea, but many works treat revision as an inference-time heuristic rather than a trainable ability with explicit revision rewards.

3. Verifier-guided planning

A verifier checks whether a plan or step satisfies constraints. This can provide intermediate reward:

reward = final_success + constraint_satisfaction + verifier_score

4. Process reward models

Instead of only rewarding final answer correctness, a process reward model scores intermediate reasoning steps. This idea can be adapted from math/code reasoning to long-horizon agent planning.

5. Hierarchical goal/subgoal rewards

A reward can be assigned at multiple levels:

R_total = R_final + R_plan + R_subgoal + R_step + R_revision

This is the direct version of your proposal.

Proposed method

Represent an agent trajectory as a hierarchy:

Goal G
 ├── Plan P1
 │    ├── Subgoal S1
 │    │    ├── Action a1
 │    │    └── Action a2
 │    └── Subgoal S2
 │         ├── Action a3
 │         └── Action a4
 └── Plan revision P2
      └── ...

Each level receives a reward.

Reward design

1. Final task reward

Measures whether the whole task succeeds.

R_final = 1 if final task is completed else 0

Examples:

Correct travel itinerary satisfying all constraints
Web task completed
Shopping task completed
Game/crafting goal achieved
Robot task completed

2. High-level plan reward

Measures whether the proposed plan is valid and complete before execution.

Possible criteria:

Covers all required constraints
Has feasible ordering
Does not contain contradictions
Decomposes the goal into meaningful subgoals
Matches environment affordances

Example:

R_plan = verifier(plan, goal, constraints)

3. Subgoal reward

Measures whether each subgoal is achieved.

Example:

R_subgoal_i = 1 if subgoal_i is completed else 0

For travel planning:

Hotel chosen for every night
Transportation selected
Tour schedule feasible
Budget satisfied
Opening hours respected

4. Step/action reward

Measures whether each action contributes to the current subgoal.

Example:

R_action_t = usefulness(action_t, subgoal_i)

This can be learned by a reward model or computed by an environment verifier.

5. Revision reward

Measures whether the agent correctly detects and fixes a plan failure.

Possible criteria:

Detects infeasible step
Identifies the correct failed constraint
Proposes minimal repair
Preserves still-correct parts of the plan
Improves final feasibility

Example:

R_revision = score(new_plan) - score(old_plan)

This is especially important for your idea.

6. Cross-level consistency reward

Higher-level goals should validate lower-level subgoals.

Example:

R_consistency = 1 if all subgoals jointly satisfy the high-level goal else 0

This prevents locally correct but globally inconsistent plans.

A possible total reward

R = λ_final R_final
  + λ_plan R_plan
  + λ_subgoal Σ_i R_subgoal_i
  + λ_action Σ_t R_action_t
  + λ_revision R_revision
  + λ_consistency R_consistency

The key research issue is choosing weights and avoiding reward hacking.

Training setup

Possible training pipeline:

Stage 1: Supervised plan decomposition

Train the model to produce hierarchical plans from successful trajectories.

Data format:

{
  "goal": "...",
  "high_level_plan": [...],
  "subgoals": [...],
  "actions": [...],
  "revision": [...]
}

Stage 2: Verifier / reward model training

Train verifiers for:

Plan feasibility
Constraint satisfaction
Subgoal completion
Action usefulness
Revision correctness

These can be trained from:

Environment outcomes
Synthetic traces
Human annotations
LLM-as-judge labels
Constraint solvers

Stage 3: RL fine-tuning

Use PPO / GRPO / DPO-like preference optimization / actor-critic methods with hierarchical rewards.

The model is rewarded for:

Producing a valid plan
Executing subgoals
Revising when needed
Achieving final success

Datasets for plan revision or flexible planning

There are some relevant datasets/benchmarks, but the exact ability you describe is not fully solved.

Useful existing benchmarks

TravelPlanner

Good for constrained real-world planning. It includes tool use, user constraints, commonsense constraints, and hard constraints. It can be adapted to revision by changing constraints mid-task or hiding information until later.

Flex-TravelPlanner

Closer to plan revision because it evaluates flexible planning under changing constraints.

DeepPlanning

Useful for long-horizon global/local planning, especially travel and shopping. It can test whether an agent performs global optimization and local constraint reasoning.

WebArena / WebVoyager

Useful for long-horizon web execution. Revision can be tested when a page state changes or an earlier action fails.

PlanBench

Useful for formal plan validity and action precondition/effect reasoning. Plan repair can be created by perturbing valid plans.

Robotouille

Useful for asynchronous planning, where actions may happen in parallel or with delays.

REALM-Bench

Useful for real-world planning scenarios, multi-agent planning, and adaptation to disruptions.

HeroBench

Useful for explicit hierarchical planning, multi-level dependencies, and very long action sequences.

Gap

Most benchmarks test whether the final plan succeeds, but fewer provide dense labels for:

Whether each subgoal is correct
Whether each decomposition is valid
Whether each revision is minimal and correct
Whether the agent knows which level of the hierarchy failed

This is a strong research opportunity.

Possible benchmark contribution

Create a benchmark called something like:

HierPlan-Revise

Each task includes:

A goal
A hierarchical gold plan
Subgoal dependencies
Hidden or changing constraints
Execution feedback
Required plan revision
Final success criteria
Labels for which subgoal failed

Example task:

Goal: Plan a 4-day trip under $1200.

Initial plan:
- Day 1: Museum
- Day 2: Island tour
- Day 3: Ruins
- Day 4: Rest

New feedback:
- Ferry unavailable on Day 2.
- Ruins closed on Day 3.

Required behavior:
- Detect affected subgoals.
- Revise only the necessary days.
- Preserve valid hotel and budget choices.
- Produce final feasible plan.

Evaluation:

Plan validity
Subgoal completion
Revision minimality
Constraint satisfaction
Final success
Token/action efficiency
Error localization accuracy

Possible experiments

Experiment 1: Does hierarchical reward improve final success?

Compare:

Sparse final reward only
Final reward + subgoal reward
Final reward + plan reward
Final reward + subgoal + revision reward
Full hierarchical reward

Measure:

Final success rate
Constraint satisfaction
Number of invalid actions
Replanning success
Efficiency

Experiment 2: Does revision reward improve recovery from failure?

Create tasks where the initial plan becomes invalid.

Measure:

Whether the agent detects the failure
Whether it revises the correct subgoal
Whether it avoids changing correct parts
Whether final success improves

Experiment 3: Does cross-level reward prevent local reward hacking?

Test whether agents optimize subgoals independently but violate the global goal.

Example:

Each day of travel plan looks good.
But total budget exceeds limit.
Or hotel location makes all tours impossible.

Cross-level consistency reward should reduce this.

Experiment 4: Generalization to longer horizons

Train on shorter tasks and test on longer tasks.

Measure:

Decomposition depth generalization
Subgoal dependency handling
Failure recovery
Reward hacking

Novelty angle

The novelty is not just “use hierarchical planning.” The stronger idea is:

Use explicit hierarchical reward credit assignment to train agents to decompose, execute, verify, and revise long-horizon plans.

Compared with existing work:

Plan-and-act gives structure but not necessarily learned hierarchical reward.
Process reward models score reasoning steps but are usually not tied to executable subgoal hierarchies.
HRL gives temporal abstraction but not natural-language plan revision.
Planning benchmarks test final plans but often lack dense hierarchical supervision.

Possible paper structure

1. Introduction

Long-horizon agents suffer from sparse rewards and compounding errors.
Planning helps, but plans can be wrong or require revision.
Current systems often lack explicit credit assignment across plan levels.
We propose hierarchical rewards for plan, subgoal, action, and revision quality.

Hierarchical RL
LLM planning and agent benchmarks
Process reward models
Plan verification and plan repair
Long-horizon web/travel/embodied agents

3. Problem Formulation

Define a task as:

T = (G, C, E)

where:

G is the goal
C is a set of constraints
E is the environment

The agent produces a hierarchy:

H = (P, S, A, R)

where:

P is the high-level plan
S are subgoals
A are actions
R are revisions

4. Hierarchical Reward

Define reward at each level:

R_total = R_final + R_plan + R_subgoal + R_action + R_revision + R_consistency

5. Benchmark / Data Construction

Use existing benchmarks or create perturbations:

TravelPlanner with changing constraints
PlanBench with invalidated action preconditions
WebArena tasks with execution failures
Synthetic hierarchical tasks with ground-truth subgoal trees

6. Training

SFT on successful hierarchical traces
Train verifiers / reward models
RL with hierarchical reward
Optional GRPO/PPO optimization

7. Experiments

Compare reward designs
Analyze plan revision
Analyze long-horizon scaling
Analyze reward hacking
Evaluate cross-domain transfer

8. Analysis

Key questions:

Which reward level contributes most?
Does subgoal reward help or hurt?
Does revision reward improve recovery?
Does plan reward improve global coherence?
Does hierarchical reward generalize to longer tasks?

Risks and limitations

Reward hacking: agent may optimize subgoals without solving global task.
Verifier quality: bad verifier gives misleading rewards.
Annotation cost: hierarchical gold plans are expensive.
Credit assignment: hard to know which subgoal caused final failure.
Over-planning: too much planning may reduce efficiency.
Domain specificity: travel/web planning rewards may not transfer to embodied tasks.

Concrete first project

A practical first paper:

Hierarchical reward for TravelPlanner-style revision

Start from TravelPlanner or a similar constrained planning benchmark.
Create perturbed tasks where one or more constraints change after initial planning.
Ask the agent to:
- generate a plan,
- receive feedback,
- localize the failed subgoal,
- revise the plan,
- produce final answer.
Define rewards:
- final pass rate,
- constraint satisfaction,
- subgoal correctness,
- revision minimality,
- failure localization.
Train or tune an agent with hierarchical reward.
Compare against:
- final reward only,
- ReAct,
- Reflexion,
- Plan-and-Act,
- verifier-only reranking.

This is a clean and publishable version of your idea.

Key references to start from

Sutton, Precup, and Singh, Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning, 1999.
https://www.sciencedirect.com/science/article/pii/S0004370299000521
Dietterich, Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition, 2000.
https://www.jair.org/index.php/jair/article/view/10266
Valmeekam et al., PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, NeurIPS 2023.
https://arxiv.org/abs/2206.10498
Xie et al., TravelPlanner: A Benchmark for Real-World Planning with Language Agents, ICML 2024.
https://arxiv.org/abs/2402.01622
Erdogan et al., Improving Planning of Agents for Long-Horizon Tasks / Plan-and-Act, ICML 2025.
https://arxiv.org/abs/2503.09572
DeepPlanning benchmark, 2026.
https://qwenlm.github.io/Qwen-Agent/en/benchmarks/deepplanning/
Flex-TravelPlanner, 2025.
https://openreview.net/forum?id=a7unQ5jMx7
HeroBench, 2026.
https://arxiv.org/html/2508.12782v2

One-sentence pitch

This project proposes hierarchical reward credit assignment for long-horizon language agents, rewarding not only final task success but also plan validity, subgoal completion, action usefulness, failure localization, and plan revision.

Addendum: Counterfactual World-Model Evaluation for Long-Horizon Agents

New idea

A natural extension of the hierarchical reward idea is to ask whether an agent has a counterfactual world model.

The key question is:

Given the current world state or trajectory, can the model reason about what would happen if one condition, action, observation, or plan step were different?

This is closely related to world modeling, but the focus is more specific:

Not just “can the agent complete the task?”
Not just “can the agent predict the next state?”
Not just “can the agent revise after failure?”
But: can the agent mentally simulate alternative trajectories and compare their downstream consequences?

In other words, the model should answer questions like:

Given the current trajectory τ:
- What would happen if action a_t were replaced by a'_t?
- What if observation o_t had been different?
- What if subgoal S_i were skipped?
- What if the high-level plan order were changed?
- What if a constraint appeared earlier?
- Which future failure would this alternative plan cause?
- Which plan revision would minimally fix the trajectory?

This ability can be called:

Counterfactual Trajectory Reasoning
or
Counterfactual World-Modeling for Agents

Why this sounds like a world model

A world model usually predicts environment dynamics:

current state + action -> next state / future trajectory

A counterfactual world model asks a stronger question:

current trajectory + intervention -> alternative future trajectory

So the agent is not only predicting the next factual state. It is simulating a hypothetical branch of the world.

For long-horizon planning, this is essential because good agents must reason about delayed consequences:

If I choose this hotel today, then tomorrow's tour becomes impossible.
If I skip this information-gathering step, I may not know whether the final answer is valid.
If I change this subgoal, the later transportation plan must also change.

This connects naturally to hierarchical planning because counterfactuals can occur at multiple levels:

Level	Counterfactual question
Goal level	What if the user changed the objective?
Constraint level	What if the budget/time/location constraint changed?
Plan level	What if we used a different high-level plan?
Subgoal level	What if this subgoal were skipped or reordered?
Action level	What if this tool/action were replaced?
Observation level	What if the agent observed a different result?
Revision level	What if the agent repaired the wrong part of the plan?

The idea is related to several existing research lines, but it is not fully covered by any single one.

1. Counterfactual reasoning benchmarks for LLMs

There are benchmarks such as CounterBench, which evaluate whether LLMs can perform counterfactual reasoning over causal graphs and structured questions. CounterBench includes different counterfactual types such as basic, joint, conditional, nested, and backdoor counterfactuals.

However, this is mostly about static causal reasoning rather than interactive agent trajectories.

Gap:

CounterBench tests whether a model can answer counterfactual causal questions.
It does not directly test whether an agent can simulate how an alternative action or observation changes a long-horizon task trajectory.

2. Forward counterfactual generation

Some work studies forward counterfactual reasoning, where the model predicts what future developments would follow from an alternative condition. For example, FIN-FORCE studies forward counterfactual generation for financial news.

This is conceptually close because it asks “what would happen next if something were different?”

Gap:

Forward counterfactual generation focuses on future scenario generation, often in a domain like finance.
It is not primarily an agent benchmark with actions, observations, tool use, subgoal dependencies, and executable plan revision.

3. Counterfactuals for language model agents

Recent work on Abstract Counterfactuals for Language Model Agents proposes counterfactual reasoning at the level of high-level action semantics instead of raw token-level interventions. This is very relevant because it argues that action counterfactuals should be represented at an abstract semantic level.

Gap:

This work studies how to construct meaningful counterfactual actions for LM agents.
It does not provide a broad benchmark for hierarchical long-horizon plan counterfactuals or evaluate whether agents can predict downstream trajectory consequences across plan/subgoal/action levels.

Older embodied AI work such as Counterfactual Vision-and-Language Navigation uses counterfactual observations and trajectories to improve robustness in navigation. It explicitly asks questions like what would happen if a different object were observed.

Gap:

This is close in spirit, but it is domain-specific to vision-and-language navigation and mainly used as a training strategy for generalization.
It is not a general agent benchmark for counterfactual world models across web, travel, tool-use, planning, or hierarchical task domains.

5. World-model benchmarks

Recent work on world-model evaluation argues that world models should be evaluated not only by visual fidelity or next-state prediction, but also by prediction, planning, and counterfactual reasoning. For example:

AutumnBench / WorldTest evaluates world-model learning using interactive grid-world environments and includes prediction, planning, and counterfactual reasoning tasks.
World Reasoning Arena (WR-Arena) evaluates world models on action simulation fidelity, long-horizon forecast, and simulative reasoning/planning.
Recent surveys on agentic world modeling define world models as learning state-transition dynamics and helping agents anticipate consequences of candidate actions.

Gap:

These works evaluate world-model capabilities, but often focus on world-model systems or simulated environments.
The proposed idea targets LLM agents specifically and asks whether their internal or external reasoning can support counterfactual trajectory evaluation over natural-language hierarchical plans.

6. Planning analysis for LLM agents

Recent planning-centric analyses argue that LLM agents often behave like step-wise greedy policies and fail because early actions do not account for delayed consequences. This directly supports the need for future-aware and counterfactual evaluation.

Gap:

These works show long-horizon planning failure and propose lookahead/value-estimation methods.
But they do not necessarily isolate counterfactual world-model competence as a benchmarked capability:
given trajectory τ and intervention do(a_t = a'_t), can the model predict the alternative outcome?

7. Counterfactual trajectory pairs for agent training

CaRT uses counterfactual trajectory pairs to teach LLM agents when to stop gathering information. It creates paired trajectories where termination is appropriate vs. minimally modified trajectories where termination is not.

Gap:

CaRT is highly relevant, but it focuses on the specific skill of termination / information-gathering.
A broader benchmark could evaluate counterfactual reasoning over arbitrary actions, observations, plan steps, subgoals, constraints, and revisions.

Novelty assessment

The broad phrase “counterfactual world model” is not completely new. There is existing work on:

Counterfactual reasoning in LLMs
Counterfactual planning
Counterfactual trajectories in navigation
World-model benchmarks with counterfactual tasks
LLM-as-world-model planning methods
Counterfactual trajectory pairs for specific agent skills

So the paper should not claim:

No one has studied counterfactual reasoning or world models before.

A safer and stronger novelty claim is:

Existing agent benchmarks mostly evaluate final task success or factual trajectory execution. Existing counterfactual benchmarks often focus on static causal reasoning, domain-specific future generation, or simulated world-model evaluation. What remains underexplored is a benchmark and training framework for counterfactual trajectory reasoning in long-horizon LLM agents, especially across hierarchical plan levels: goal, constraint, plan, subgoal, action, observation, and revision.

This is a more defensible research gap.

Proposed research direction

Main task

Given a factual agent trajectory:

τ = (s_0, a_0, o_1, a_1, o_2, ..., a_T, o_T)

and an intervention:

do(x_i = x'_i)

where x_i may be an action, observation, constraint, subgoal, or plan step, ask the model to predict:

τ' = alternative future trajectory

and/or answer:

Will the final task still succeed?
Which future step changes?
Which constraint will fail?
Which subgoal becomes invalid?
What minimal revision restores success?

Example: travel planning

Factual trajectory:

Goal: Plan a 4-day Cancun trip.
Plan:
Day 1: Isla Mujeres
Day 2: Chichen Itza + cenote
Day 3: Xplor
Day 4: Cozumel

Counterfactual intervention:

What if the ferry to Cozumel is unavailable on Day 4?

Expected model behavior:

- Recognize that only the Cozumel subgoal is directly affected.
- Predict downstream effects on transport, hotel location, and timing.
- Preserve unaffected days if still valid.
- Propose a minimal repair, e.g., swap Cozumel with Isla Mujeres or replace Cozumel with a local activity.

Example: web agent

Factual trajectory:

Goal: Buy the cheapest compatible laptop charger.
The agent searches product page A, filters by price, and purchases item X.

Counterfactual intervention:

What if the agent had clicked the compatibility tab before purchasing?

Expected behavior:

- Predict that the agent would discover item X is incompatible.
- Predict that the original purchase should be avoided.
- Identify which earlier decision changes.
- Continue with an alternative search path.

Example: tool-use agent

Factual trajectory:

The agent answers a question after one web search.

Counterfactual intervention:

What if the first search result were outdated?

Expected behavior:

- Predict that the confidence of the answer should decrease.
- Continue searching or verify with a more recent source.
- Avoid finalizing too early.

Benchmark design: Counterfactual Agent World Model Benchmark

A possible benchmark name:

CAWM-Bench: Counterfactual Agent World-Model Benchmark

CoTrajBench: Counterfactual Trajectory Reasoning Benchmark

Hier-CFBench: Hierarchical Counterfactual Planning Benchmark

Data format

Each example contains:

{
  "goal": "...",
  "constraints": [...],
  "factual_plan": [...],
  "factual_trajectory": [...],
  "intervention": {
    "type": "action | observation | constraint | subgoal | plan_step | revision",
    "target": "...",
    "replacement": "..."
  },
  "expected_counterfactual_effects": [...],
  "expected_final_outcome": "success | failure | changed_success",
  "minimal_revision": [...],
  "affected_subgoals": [...],
  "unaffected_subgoals": [...]
}

Intervention types

Action counterfactual
- What if the agent used a different tool/action?
Observation counterfactual
- What if the tool returned different information?
Constraint counterfactual
- What if a new constraint appeared?
Subgoal counterfactual
- What if a subgoal were skipped, reordered, or replaced?
Plan-level counterfactual
- What if the high-level strategy were different?
Revision counterfactual
- What if the agent repaired the wrong part of the plan?
Information-gathering counterfactual
- What if the agent stopped earlier or searched one more time?

Evaluation metrics

1. Outcome prediction accuracy

Can the model correctly predict whether the counterfactual trajectory succeeds or fails?

Acc(success/failure)

2. Affected-subgoal localization

Can it identify which subgoals are affected by the intervention?

F1(affected_subgoals)

3. Unaffected-subgoal preservation

Can it avoid unnecessarily changing valid parts of the plan?

Preservation score

4. Counterfactual consistency

Does the predicted alternative trajectory logically follow from the intervention?

consistency(intervention, predicted_trajectory)

5. Minimal repair score

If the counterfactual causes failure, can the model propose the smallest valid revision?

minimal_repair_score = validity - unnecessary_changes

6. Long-horizon dependency accuracy

Can the model track delayed effects several steps later?

accuracy_by_distance_from_intervention

7. Causal contrast score

Can the model explicitly contrast factual and counterfactual outcomes?

Δ = predicted_outcome(τ') - predicted_outcome(τ)

Training with hierarchical counterfactual reward

This addendum also strengthens the hierarchical reward idea.

Instead of only rewarding factual task completion:

R_final

we can add counterfactual rewards:

R_counterfactual =
  R_outcome_prediction
+ R_affected_subgoal
+ R_minimal_revision
+ R_counterfactual_consistency
+ R_unaffected_preservation

Then the total hierarchical reward becomes:

R_total =
  R_final
+ R_plan
+ R_subgoal
+ R_action
+ R_revision
+ R_consistency
+ R_counterfactual

This encourages the agent not only to execute the current plan, but also to understand why the plan works and what would break if the world changed.

Why this is useful

A counterfactual world-model benchmark would test abilities that normal success-rate benchmarks miss:

Normal agent benchmark	Counterfactual world-model benchmark
Did the agent complete the task?	Does the agent know why the task succeeded?
Did the agent recover after failure?	Could the agent predict the failure before executing?
Did the final answer satisfy constraints?	Does the agent know which constraint would fail under an alternative?
Did the agent use tools correctly?	Does the agent know when a different tool result would change the plan?
Did the plan work once?	Does the agent understand the space of nearby possible plans?

Strong paper framing

The paper can be framed as:

Current agent benchmarks evaluate realized behavior, but not counterfactual competence. We propose evaluating whether LLM agents possess a counterfactual world model: the ability to predict how alternative actions, observations, constraints, or plan steps would change future trajectory outcomes.

This is highly aligned with your original hierarchical reward idea because:

hierarchical plans define the structure of possible interventions;
counterfactual reasoning tests whether the model understands dependencies between levels;
counterfactual reward provides dense supervision for planning and revision;
minimal repair tests whether the model can revise plans without destroying valid subgoals.

Most promising concrete version

A first publishable version could focus on constrained planning:

Dataset

Use or extend:

TravelPlanner
Flex-TravelPlanner
WebArena-style tasks
PlanBench-style symbolic plans
Synthetic hierarchical planning tasks

Perturbation generation

For each factual successful trajectory, generate counterfactual interventions:

change one constraint
change one observation
replace one action
remove one subgoal
swap two plan steps
invalidate one tool result

Labels

Automatically or semi-automatically label:

success/failure under counterfactual
affected subgoals
required repair
minimal valid revised plan

Baselines

Compare:

Direct prompting
Chain-of-thought
ReAct
Reflexion
Plan-and-Act
Tree-of-thought / MCTS-style planning
Agent with explicit world model
Agent trained with hierarchical counterfactual reward

Main claim

Agents with stronger counterfactual trajectory reasoning should perform better at long-horizon planning, plan revision, and robustness under distribution shift.

Updated one-sentence pitch

This project proposes hierarchical reward learning for long-horizon agents, extended with counterfactual world-model evaluation: testing whether agents can predict how alternative actions, observations, constraints, or plan steps would change future trajectory outcomes and use that knowledge to revise plans minimally and correctly.

References for this addendum

Chen et al., CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models, AAAI 2026.
https://arxiv.org/abs/2502.11008
Ong et al., A Benchmark for Forward Counterfactual Generation, EMNLP 2025.
https://aclanthology.org/2025.emnlp-main.575/
Warrier et al., Benchmarking World-Model Learning / AutumnBench, 2025.
https://arxiv.org/abs/2510.19788
World Reasoning Arena: A Benchmark for Next-Generation World Models, 2026.
https://arxiv.org/abs/2603.25887
Anonymous / arXiv, Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond, 2026.
https://arxiv.org/abs/2604.22748
Parvaneh et al., Counterfactual Vision-and-Language Navigation, NeurIPS 2020.
https://proceedings.neurips.cc/paper/2020/hash/39016cfe079db1bfb359ca72fcba3fd8-Abstract.html
Qiao et al., Agent Planning with World Knowledge Model, NeurIPS 2024.
https://proceedings.neurips.cc/paper_files/paper/2024/hash/d032263772946dd5026e7f3cd22bce5b-Abstract-Conference.html
Abstract Counterfactuals for Language Model Agents, 2025.
https://arxiv.org/abs/2506.02946
CaRT: Teaching LLM Agents to Know When They Know Enough, 2025.
https://arxiv.org/abs/2510.08517
Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents, 2026.
https://arxiv.org/abs/2601.22311

Research Idea 2: Hierarchical Reward for Long-Horizon Planning and Agent RL

Working title

Core research question

Motivation

Relationship to existing work

Hierarchical RL

LLM agent planning

Planning benchmarks

Existing design patterns close to your idea

1. Plan-and-execute agents

2. Replan / revise loops

3. Verifier-guided planning

4. Process reward models

5. Hierarchical goal/subgoal rewards

Proposed method

Reward design

1. Final task reward

2. High-level plan reward

3. Subgoal reward

4. Step/action reward

5. Revision reward

6. Cross-level consistency reward

A possible total reward

Training setup

Stage 1: Supervised plan decomposition

Stage 2: Verifier / reward model training

Stage 3: RL fine-tuning

Datasets for plan revision or flexible planning

Useful existing benchmarks

TravelPlanner

Flex-TravelPlanner

DeepPlanning

WebArena / WebVoyager

PlanBench

Robotouille

REALM-Bench

HeroBench

Gap

Possible benchmark contribution

Possible experiments

Experiment 1: Does hierarchical reward improve final success?

Experiment 2: Does revision reward improve recovery from failure?

Experiment 3: Does cross-level reward prevent local reward hacking?

Experiment 4: Generalization to longer horizons

Novelty angle

Possible paper structure

1. Introduction

2. Related Work

3. Problem Formulation

4. Hierarchical Reward

5. Benchmark / Data Construction

6. Training

7. Experiments

8. Analysis

Risks and limitations

Concrete first project

Hierarchical reward for TravelPlanner-style revision

Key references to start from

One-sentence pitch

Addendum: Counterfactual World-Model Evaluation for Long-Horizon Agents

New idea

Why this sounds like a world model

Deeper related work search

1. Counterfactual reasoning benchmarks for LLMs

2. Forward counterfactual generation

3. Counterfactuals for language model agents

4. Counterfactual trajectory training in navigation

5. World-model benchmarks

6. Planning analysis for LLM agents

7. Counterfactual trajectory pairs for agent training

Novelty assessment

Proposed research direction

Main task

Example: travel planning

Example: web agent

Example: tool-use agent

Benchmark design: Counterfactual Agent World Model Benchmark

Data format

Intervention types

Evaluation metrics