# OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang*, Xuyang Chen*, Xiaolong Jin*, Mengdi Wang†, Ling Yang†

## https://github.com/Gen-Verse/OpenClaw-RL

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards.

Figure 1 | OpenClaw-RL infrastructure overview. Interaction streams come from two agent types: Personal Agents (conversational, single-user), hosted on personal devices, and General Agents (terminal, GUI, SWE, and tool-call agents), hosted on cloud services. The collected samples flow into our RL server built on the asynchronous slime framework, which consists of four decoupled components: (1) the environment server, (2) PRM / Judge for reward computation, (3) Megatron for policy training, and (4) SGLang for policy serving. These components support graceful weight updates and enable training with any agentic framework. The environment for personal agents is simply the users’ personal devices, which connect to the RL server over HTTP with confidential API keys. The environments for general agents are hosted on cloud services to enable scalable parallelization.

∗Equal contribution; †Corresponding authors; Main contact: yangling0818@163.com

OpenClaw-RL: Train Any Agent Simply by Talking

Contents

## 1 Introduction 3

## 2 Problem Setting 4

## 3 OpenClaw-RL Infrastructure: Unified System for Personal and General Agents 5

## 3.1 Asynchronous Pipeline with Four Decoupled Components . . . . . . . . . . . . . . . . . . . . 5

## 3.2 Session-Aware Environment Server for Personal Agents . . . . . . . . . . . . . . . . . . . . . 5

## 3.3 Scalability: From Single-User Personalization to Large-Scale Agent Deployment . . . . . . . . 6

3.4 Support for Multiple Real World Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.5 Non-Blocking Record and Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

## 4 Learning from Next-State Signals: Unified RL Across Interaction Types 6

4.1 Binary RL for Personal Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

## 4.1.1 PRM Judge Construction via Majority Vote . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1.2 RL Training Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

## 4.2 Hindsight-Guided On-Policy Distillation (OPD) for Personal Agent . . . . . . . . . . . . . . . 7

## 4.2.1 Why Token-Level Supervision from Next-State Signals? . . . . . . . . . . . . . . . . . 7

## 4.2.2 Token-Level OPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3 Combine Binary and OPD Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.4 Step-wise Reward for General Agentic RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

## 4.4.1 Why Process Rewards Are Vital for Agentic Tasks . . . . . . . . . . . . . . . . . . . . . 9

## 4.4.2 Integrate Outcome and Process Rewards . . . . . . . . . . . . . . . . . . . . . . . . . 9

## 5 Experiments 10

5.1 Personal Agent Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

## 5.1.1 Student Who Uses OpenClaw to Do Homework . . . . . . . . . . . . . . . . . . . . . 10

## 5.1.2 Teacher Who Uses OpenClaw to Grade Homework . . . . . . . . . . . . . . . . . . . . 10

5.2 General Agent Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

## 5.2.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.3 Personal Agent Track: Learning from Conversational Signals . . . . . . . . . . . . . . . . . . . 11

## 5.4 General Agents: Unified RL Across Terminal, GUI, SWE, and Tool-Call . . . . . . . . . . . . . 12

## 6 Related Work 12

## 7 Conclusion 14

A Algorithm Pseudocode 18

B More Optimization Examples 18

B.1 Student Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2 Teacher Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

C Prompt Templates 20

C.1 Personal Agent: PRM Judge Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C.2 Personal Agent: OPD Hindsight Hint Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C.3 Personal Agent: Evaluative Prompt from Simulator . . . . . . . . . . . . . . . . . . . . . . . . 20

C.4 General Agent: PRM Judge Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D Hyperparameters 24

2

OpenClaw-RL: Train Any Agent Simply by Talking

## 1. Introduction

Every deployed AI agent is already collecting the data it needs to improve and discarding it. After each action 𝑎𝑡, the agent receives a next-state signal 𝑠𝑡+1: a user reply, a tool execution result, a GUI state transition, or a test verdict. Existing systems treat this purely as context for the next action (Fu et al., 2025; Mei et al., 2025; Sheng et al., 2025; Wang et al., 2025b; Zhu et al., 2025). We argue that next-state signals encode something more valuable: an implicit evaluation of 𝑎𝑡, including how well it performed and, often, how it should have been different. Critically, this signal arises for free across every interaction type, including personal conversations, terminal environments, GUI environments, SWE tasks, and tool-call environments, yet no existing agentic RL system recovers them as a live, online learning source We identify two distinct and recoverable forms of waste.

Waste 1 — Evaluative signals. The next-state signal implicitly scores the preceding action: a user re-query signals dissatisfaction, a passing test signals success, and an error trace signals failure. This forms a natural process reward and requires no separate annotation pipeline, yet PRMs have been studied almost exclusively in mathematical reasoning with verifiable ground truth (Cui et al., 2025b; Lightman et al., 2023; Wang et al., 2024). In personal agents, it captures user satisfaction turn by turn. In general agents, it provides the dense per-step credit assignment that long-horizon tasks require (Wang et al., 2026). Existing systems either ignore this signal or exploit it only in offline, pre-collected form, relying on fixed datasets or terminal outcome rewards.

Waste 2 — Directive signals. Beyond scoring, next-state signals often carry directive information: a user who says “you should have checked the file first” specifies not only that the response was wrong, but also how it should change at the token level. Likewise, a detailed SWE error trace often implies a concrete correction direction. Current RLVR methods use scalar rewards and thus cannot convert such information into a directional policy gradient (Guo et al., 2025; Hu et al., 2025; Shao et al., 2024; Yu et al., 2025a), while distillation methods (Hübotter et al., 2026; Shenfeld et al., 2026) rely on pre-curated feedback-response pairs rather than live signals. Hindsight relabeling (Hübotter et al., 2026; Zhang et al., 2023) and context-enriched distillation (Yang et al., 2024b, 2025c) show that adding structured correction information to the context can substantially improve outputs, but these methods all operate on fixed datasets. In concurrent work, Buening et al. (2026) improves online policy by directly prompting with next-state information, but the corrective hints remain implicit.

OpenClaw-RL. We present OpenClaw-RL, a unified framework that recovers both forms of next-state signal waste for personal agents and general-purpose agents across diverse settings, including personal conversations with OpenClaw (OpenClaw, 2026), terminal, GUI, SWE, and tool-call environments. OpenClaw-RL is a fully decoupled asynchronous architecture built on slime (Zhu et al., 2025), where policy serving, rollout collection, PRM judging, and policy training run as four independent loops with no blocking dependencies. In the personal-agent setting, the model can be optimized automatically through normal usage. This extends existing RL infrastructure, which typically assumes batch data collection rather than continuous learning from live deployment. We provide two optimization options. First, binary RL uses a PRM to recover conversations as scalar process rewards. Second, our Hindsight-Guided On-Policy Distillation (OPD) extracts textual hints from the next state, constructs an enhanced teacher context, and distills token-level directional supervision back into the student, providing training signals unavailable from scalar rewards alone. In simulation experiments, we find that combining the two methods with a weighted loss yields significant gains. Our framework also extends to RL training for general agents, including terminal, GUI, SWE, and tool-call settings. We integrate PRM judging with verifiable outcomes to provide

3

OpenClaw-RL: Train Any Agent Simply by Talking

Teacher who uses OpenClaw to grade homework,
wants comments to be specific and friendly.

Student who uses OpenClaw to do homework,
does not want to be found using AI.

Figure 2 | Optimize your OpenClaw simply by using it. We provide a simulation result here.

supervision that is both dense and reliable (Wang et al., 2026; Zou et al., 2025). We further enhance
the scalability of this framework by allowing environments to be hosted at scale on cloud services.

Contributions.

• Next-state signal as a live, online learning source. We identify that next-state signals, whether user replies, execution results, test verdicts, or GUI transitions, encode both evaluative and directive information about the preceding action. We recover these signals as a live, online training source across heterogeneous interaction types.

• OpenClaw-RL Infrastructure. The first system to unify multiple concurrent interaction streams, including personal conversations, terminal, GUI, SWE, and tool-call agentic settings. It is designed for zero interruption to serving, with session-aware multi-turn tracking, graceful weight updates, flexible PRM support, and large-scale environment parallelization.

• Two complementary next-state signal recovery methods. Binary RL via PRM converts evaluative next-state signals into dense scalar process rewards, while our Hindsight-Guided OPD converts directive signals into token-level advantage supervision by extracting textual hints from the next state and constructing an enhanced teacher context, where rich textual feedback provides directional guidance for improvement.

• Empirical validation across personal and general agents. We validate OpenClaw-RL in experiments on both personal agent personalization and agentic RL across terminal, GUI, SWE, and tool-call settings. We provide evidence that binary RL and Hindsight-Guided OPD are complementary, and that their combination yields significant gains for personal agents. We also validate the effectiveness of integrating process and outcome rewards in the general-agent RL setting.

## 2. Problem Setting

OpenClaw-RL operates on a policy 𝜋𝜃 that simultaneously receives multiple interaction streams, decouples them from the inference pipeline, and is therefore flexible enough for a wide range of agentic settings, including personal agent conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces. We formalize each interaction stream as a MDP (S, A, T , 𝑟):

• State 𝑠𝑡 ∈ S: the full conversational or environmental context up to turn 𝑡. • Action 𝑎𝑡 ∈ A: the agent’s response, a sequence of tokens generated by 𝜋𝜃. • Transition T (𝑠𝑡+1 | 𝑠𝑡, 𝑎𝑡): deterministic given the environment; 𝑠𝑡+1 is the user reply, execution result, or tool output that follows 𝑎𝑡.

4

OpenClaw-RL: Train Any Agent Simply by Talking

• Reward 𝑟(𝑎𝑡, 𝑠𝑡+1): inferred from the next-state signal via a PRM judge.

In standard RLVR, the outcome 𝑜 serves as the reward for the entire trajectory. However, the process reward 𝑟(𝑎𝑡, 𝑠𝑡+1), which depends on the next state 𝑠𝑡+1, contains much richer signals. In particular, when next state contains explicit directive information about how the action should have been different, on-policy distillation enables directional improvement (Agarwal et al., 2024; Hübotter et al., 2026) by converting such directional next-state signals into token-level teacher supervision.

## 3. OpenClaw-RL Infrastructure: Unified System for Personal and General Agents

We unify automatic optimization of personal OpenClaw agents and large scale agentic RL for general
agents, including terminal, GUI, SWE, and tool-call settings, within a single framework.

## 3.1. Asynchronous Pipeline with Four Decoupled Components

The core architectural principle of OpenClaw-RL is full decoupling: policy serving, environment hosting, PRM judging, and policy training run as four completely independent asynchronous loops with no blocking dependencies between them (Figure 1).

## Policy Serving → environment → Reward Judging → Policy Training (SGLang) (Http / API) (SGLang / API) (Megatron)

The model serves the next user request while the PRM judges the previous response and the trainer applies gradient updates; none waits for the others. This is what makes continuous training from live, heterogeneous interaction streams practical: no stream needs to be paused or batched to accommodate another component’s schedule.

For personal agents, the model is connected through a confidential API for private and secure deployment, requires no modification to the personal-agent framework, and is updated gracefully without interrupting inference. For large-scale training of general agents, this asynchronous design allows each components to proceed without being blocked, thereby mitigating the long-tail problem caused by long-horizon rollout durations.

## 3.2. Session-Aware Environment Server for Personal Agents

The environment for a personal agent is the user’s device, which connects to our RL server through a
confidential API. Each API request is classified into one of two types:

• Main-line turn: the agent’s primary response and tool execution results, which form trainable
samples.

• Side turn: auxiliary queries, memory organization, and environment transitions, which are for-
warded but do not produce training data.

This classification allows the RL framework to precisely identify which turns belong to which sessions, enabling targeted training. Currently, we train only on main-line turns. The message of each new main-line request contains the reaction to the previous turn, whether a user’s reply or an environment’s execution result. This becomes the next-state signal 𝑠𝑡+1 for the previous turn’s reward computation.

5

OpenClaw-RL: Train Any Agent Simply by Talking

## 3.3. Scalability: From Single-User Personalization to Large-Scale Agent Deployment

OpenClaw-RL is designed to operate across the full spectrum from single-user personal agent to large-scale multi-environment general agent deployment. For personal agents, the environment is a single user’s device and the interaction stream is sparse, session-based, and highly personalized. Built on slime (Zhu et al., 2025), OpenClaw-RL inherits a scalable training infrastructure for general agents, and we further support cloud-hosted environments across diverse agent settings (Section 3.4). Hundreds of parallel environments hosted on cloud services produce a dense stream of structured execution signals, enabling scalable RL training.

## 3.4. Support for Multiple Real World Scenarios

OpenClaw-RL supports a broad set of general-agent scenarios that cover the most common realworld deployment settings in our open-source implementation (Table 1). Terminal agents are a core component of computer-use systems: they are efficient, cheap to scale, and naturally aligned with the text-based interface of LLMs (Anthropic, 2026; OpenAI, 2026; Shen et al., 2026). GUI agents cover capabilities that terminal agents cannot access directly, such as visual interfaces and pointer-based interactions, making them necessary for more general computer-use tasks (Qin et al., 2025; Wang et al., 2025a,c; Xue et al., 2026). SWE agents represent a particularly important class of coding agents, where the environment provides rich executable feedback through tests, diffs, and static analysis (Cao et al., 2026). Tool-call agents are also critical, since external tools improve both reasoning capability and factual accuracy (Feng et al., 2025a).

Table 1 | Supported agent settings and their environment characteristics.

Setting Environment Next-state signal Horizon

OpenClaw Personal devices user response / tool-call results Long Terminal Shell execution sandbox stdout/stderr, exit code Long GUI Screen state + accessibility tree Visual state diff, task progress Long SWE Code repository + test suite Test verdicts, diff, lint output Long Tool-call API/function execution Return values, error traces Medium

## 3.5. Non-Blocking Record and Observability

All interactions and reward evaluations are logged to JSONL in real time: full message history, prompt/response text, tool calls, next-state content, per-vote PRM scores, selected hints (OPD), and accept/reject decisions. Logging is non-blocking, writes are fire-and-forget on a background thread, adding no latency to the serving or PRM paths. Record files are purged at each weight update boundary, ensuring logs always correspond to a single policy version.

## 4. Learning from Next-State Signals: Unified RL Across Interaction Types

We convert next-state signals from heterogeneous interaction streams, including personal conversations,
terminal interactions, GUI interactions, SWE tasks, and tool-call traces, into policy gradients.

## 4.1. Binary RL for Personal Agent

Converting evaluative next-state signals into scalar process rewards.

6

OpenClaw-RL: Train Any Agent Simply by Talking

Figure 3 | Method Overview. For personal agents, we support both binary-reward optimization and on-policy distillation training. In our experiments, we find that their combination yields significant performance gains. For general agentic RL, in addition to standard RLVR, we provide integrated step-wise rewards and a simple but effective standardization approach (Wang et al., 2026).

## 4.1.1. PRM Judge Construction via Majority Vote

Given response 𝑎𝑡 and next state 𝑠𝑡+1, a judge model evaluates quality of 𝑎𝑡:

PRM(𝑎𝑡, 𝑠𝑡+1) → 𝑟 ∈ {+1, −1, 0}.

Specifically, the PRM judges each action based on the user’s next response or the tool-call results. Tool-call results usually lead to a clear conclusion. The user’s next response may contain signals of satisfaction or dissatisfaction. If there is no clear sign of the user’s reaction, the model also makes an estimate based on the scenario, although users are encouraged to provide more explicit feedback. For general agents, the judge reasons about whether the environment’s feedback indicates progress toward the task goal. We run 𝑚 independent queries and take majority vote 𝑟final = MajorityVote(𝑟1, . . . , 𝑟𝑚).

## 4.1.2. RL Training Objective

By directly using the advantage 𝐴𝑡 = 𝑟final, the training objective is a standard PPO-style clipped
surrogate with asymmetric bounds (Schulman et al., 2017):

𝜋old(𝑎𝑡 | 𝑠𝑡) , Lpg = −𝔼𝑡 min 𝜌𝑡 𝐴𝑡, clip(𝜌𝑡, 1 − 𝜀, 1 + 𝜀high) · 𝐴𝑡 , L = Lpg + 𝛽KL · LKL, (1)

𝜌𝑡 = 𝜋𝜃(𝑎𝑡 | 𝑠𝑡)

where 𝜀 = 0.2, 𝜀high = 0.28, 𝛽KL = 0.02. Note that this is a real-time conversational setting, so there is
no group structure available for standardization as in GRPO (Shao et al., 2024).

## 4.2. Hindsight-Guided On-Policy Distillation (OPD) for Personal Agent

Converting directional next-state signals into token-level teacher supervision.

## 4.2.1. Why Token-Level Supervision from Next-State Signals?

Binary RL reduces the entire information content of 𝑠𝑡+1 to a single scalar 𝑟 ∈ {+1, −1, 0}. Yet a user who writes “you should have checked the file before editing it” communicates far more: not just that the response was wrong, but which tokens should have been different and how. This directive information is lost entirely by scalar rewards.

7

OpenClaw-RL: Train Any Agent Simply by Talking

OPD recovers this information by converting the next-state signal into a token-level training signal. The key insight is that if we augment the original prompt with a textual hint extracted from 𝑠𝑡+1, the same model produces a different token distribution, one that “knows” what the response should have been. The per-token gap between this hint-enhanced distribution and the student distribution provides a directional advantage: positive at tokens the model should upweight, negative at tokens it should downweight. This is fundamentally different from RLHF (Christiano et al., 2017; Ziegler et al., 2019) (which uses scalar preference signals), DPO (Rafailov et al., 2023) (which requires paired preferences), and standard distillation (which requires a separate, stronger teacher model).

## 4.2.2. Token-Level OPD

Step 1. Hindsight hint extraction.

Judge(𝑎𝑡, 𝑠𝑡+1) → score ∈ {+1, −1}, hint ∈ T ∗ .

If score = +1, the judge produces a concise hint in [HINT_START]...[HINT_END]. We run 𝑚
parallel judge calls. A critical design choice: we do not use 𝑠𝑡+1 directly as the hint. Raw next-state
signals are often noisy, verbose, or contain irrelevant information (e.g., a user reply may include both
a correction and an unrelated new question). The judge model distills 𝑠𝑡+1 into a concise, actionable
instruction that isolates the directive content, typically 1–3 sentences focusing on what the response
should have done differently.

Step 2. Hint selection and quality filtering. Among positive votes with hints >10 characters, select the longest (most informative). If no valid hint exists, drop the sample entirely, this is deliberate. OPD trades sample quantity for signal quality: only turns where the next-state signal carries a clear, extractable correction direction enter training. This strict filtering is complementary to Binary RL, which accepts all scored turns: Binary RL provides broad coverage with coarse signal, while OPD provides targeted, high-resolution supervision on fewer samples.

Step 3. Enhanced teacher construction.
The hint is appended to the last user message as [user’s
hint / instruction]\n{hint}, creating an enhanced prompt 𝑠enhanced = 𝑠𝑡 ⊕ hint, that the
model “would have seen” if the user had provided the correction upfront.

Step 4. Token-level advantage. The policy model is queried under 𝑠enhanced with the original response 𝑎𝑡 as forced input, computing log-probabilities for each response token. Then we have the token-level advantage in on-policy distillation:

𝐴𝑡 = log 𝜋teacher(𝑎𝑡 | 𝑠enhanced) − log 𝜋𝜃(𝑎𝑡 | 𝑠𝑡).

𝐴𝑡 > 0: the teacher (knowing the hint) assigns higher probability to this token—the student should increase it. 𝐴𝑡 < 0: the teacher considers this token less appropriate given the hint—the student should decrease it. Unlike scalar advantages that push all tokens in the same direction, this provides per-token directional guidance: within a single response, some tokens may be reinforced while others are suppressed. Training follows the same clipped surrogate as equation (1), but now the advantage carries far richer information per sample.

## 4.3. Combine Binary and OPD Methods

Let us build on each other’s strengths and offset each other’s weaknesses.

8

OpenClaw-RL: Train Any Agent Simply by Talking

Table 2 | Comparison of different learning methods.

Dimension Binary RL OPD Combined

Signal type Evaluative (good/bad) Directional Evaluative + directional Advantage Sequence-level scalar Token-level directional Mixed sequence and token-level Density All scored turns Hint-accepted turns only All scored turns Feedback type User / environment Explicit corrections Both implicit and explicit feedback Signal richness 1 scalar per sample 1 value per token 1 value per token

The binary and OPD methods are complementary, not competing. Binary RL accepts every scored turn, requires no hint extraction, and works with any next-state signal, including terse, implicit reactions (a user simply re-asking a question) or structured environment outputs (exit codes, test verdicts). OPD should be enabled additionally when the interaction stream is likely to carry rich directive content: users who give explicit corrections (“don’t use that library”, “check the file first”), or environments that produce detailed error traces (SWE diffs, compiler diagnostics). In practice, we recommend running both simultaneously: Binary RL provides broad gradient coverage across all turns, while OPD provides high-resolution, per-token corrections on the subset of turns where directive signals are available.

Therefore, we propose combining these two complementary methods with a weighted loss function. Note that they share the same PPO loss, and only the advantage computation differs. Thus, we can directly use the following advantage:

𝐴𝑡 = 𝑤binary 𝑟final + 𝑤opd (log 𝜋teacher(𝑎𝑡 | 𝑠enhanced) − log 𝜋𝜃(𝑎𝑡 | 𝑠𝑡)) ,

where 𝑤binary = 𝑤opd = 1 by default. In our experiments, we show that this approach achieves
significant performance gains.

## 4.4. Step-wise Reward for General Agentic RL

How to combine the outcome and process rewards?

## 4.4.1. Why Process Rewards Are Vital for Agentic Tasks

In long-horizon agentic tasks, outcome-only rewards provide gradient signal only at the terminal step, leaving the vast majority of turns unsupervised. A PRM assigns a reward to each turn based on the next-state signal, providing dense credit assignment throughout the trajectory. Recent work has provided strong empirical evidence for this. RLAnything (Wang et al., 2026) demonstrates that integrating step-wise PRM signals with outcome rewards consistently outperforms outcome-only training across GUI agents, text-game agents, and coding tasks. We build directly on this insight in OpenClaw-RL: our PRM judges each turn using the live next-state signal as evidence, and we demonstrate empirically (§5.4) that this dense signal is helpful for long-horizon RL settings.

## 4.4.2. Integrate Outcome and Process Rewards

Verifiable outcomes are standard supervision signals in RLVR settings. Following RLAnything (Wang et al., 2026), we integrate outcome and process rewards by simply adding them together, using 𝑜+Í𝑚 𝑖=1 𝑟𝑖/𝑚 as the reward for step 𝑡, where the 𝑟𝑖 are independently assigned by PRM(𝑎𝑡, 𝑠𝑡+1). Unlike GRPO, the presence of step-wise rewards makes it less straightforward to compute advantages. Feng et al. (2025b) group similar states and perform standardization within each group. However, in

9

OpenClaw-RL: Train Any Agent Simply by Talking

real-world settings such as terminal agents, states are not easily clustered. Therefore, we directly
group actions with the same step index, which we find effective in our empirical studies.

## 5. Experiments

We evaluate OpenClaw-RL along two complementary tracks that share the same infrastructure and training loop. §5.3 evaluates the personal agent track, demonstrating that conversational nextstate signals enable continuous personalization to individual user preferences. §5.4 evaluates the general agent track across terminal, GUI, SWE, and tool-call settings, demonstrating that the same infrastructure supports scalable RL across diverse agentic scenarios and that step-wise rewards are vital for long-horizon tasks.

## 5.1. Personal Agent Setup

Simulation Results Demonstrate Effectiveness of Our Optimization.

## 5.1.1. Student Who Uses OpenClaw to Do Homework

— does not want to be found using AI

In this setting, we use an LLM to simulate a student using OpenClaw to complete homework on a personal computer, while trying to avoid being perceived as relying on AI. Whether a response appears AI-generated depends entirely on the student’s personal preferences and writing style. The student continuously interacts with OpenClaw and asks it for help in completing the homework. The homework tasks are drawn from GSM8K (Cobbe et al., 2021). The OpenClaw policy model used in this setting is Qwen3-4B (Yang et al., 2025a). We set the learning rate to 1 × 10−5, the KL coefficient to 0, and trigger training after every 16 collected training samples.

## 5.1.2. Teacher Who Uses OpenClaw to Grade Homework

— wants the comments to be specific and friendly

After the student finishes the homework in the files, the teacher also uses OpenClaw to grade the AI-written assignments. The teacher wants the comments for the student to be specific and friendly. The OpenClaw policy model is again Qwen3-4B and uses the same optimization settings.

## 5.2. General Agent Setup

## 5.2.1. Models

We use Qwen3-8B (Team, 2025), Qwen3VL-8B-Thinking (Bai et al., 2025), Qwen3-32B (Team, 2025), and Qwen3-4B-SFT in terminal, GUI, SWE, and tool-call settings, respectively. Here, Qwen3-4B-SFT refers to the model provided by Zhu et al. (2025), which is fine-tuned on dataset of Feng et al. (2025a). The PRMs for GUI and tool-call agents are Qwen3VL-8B-Thinking and Qwen3-4B, respectively.

## 5.2.2. Datasets

We use SETA RL data (Shen et al., 2026), OSWorld-Verified (Xie et al., 2024), SWE-Bench-Verified (Jimenez et al., 2023), and DAPO RL data (Yu et al., 2025a) to train the terminal, GUI, SWE, and tool-call agents, respectively. The GUI agent is evaluated on the training set (excluded chrome and

10

OpenClaw-RL: Train Any Agent Simply by Talking

multi-apps tasks). The tool-call agent is evaluated on AIME 2024 (Mathematical Association of America, American Mathematics Competitions, 2024). For the terminal and SWE agents, we report the average rollout-task accuracy over a window of RL steps.

## 5.2.3. Hyperparameters

We set the learning rate to 10−6, the KL coefficient to 0.01, the lower clip ratio to 0.2, and the upper clip ratio to 0.28. We sample 8 tasks per step for the GUI and SWE settings, 16 for terminal, and 32 for the tool-call setting. For each task, we independently draw 8 samples. The maximum numbers of interaction steps for GUI, SWE, and terminal are 30, 20, and 10, respectively. See more details in Appendix D.

## 5.3. Personal Agent Track: Learning from Conversational Signals

To compare different methods, we use the same LLM as in the user simulation (for both the student and teacher settings) to assign quantitative personalization scores to OpenClaw’s first generated solution for each problem (see Appendix C.3). We report the average score over the first 36 problems in GSM8K. As shown in Table 3, the combined method achieves the strongest optimization performance. On-policy distillation shows delayed gains due to sparse training samples, while binary RL alone provides only marginal improvement.

Table 3 | Performance of different methods in optimizing OpenClaw. The base score is 0.17.

Updated 8 steps
Updated 16 steps

Binary RL 0.25 0.23 OPD 0.25 0.72 Combined 0.76 0.81

We also include concrete examples to illustrate how effective the optimization is and how quickly it takes effect. After 36 problem-solving interactions in the student setting, the agent learns to avoid obviously AI-like phrasing, such as using words like “bold” or producing overly structured, step-by-step responses (Figure 2). Instead, it shifts toward a more natural and casual style. In the teacher setting, after 24 grading interactions, the agent learns to write feedback that is friendlier and more detailed. Additional examples are provided in Appendix B.

11

OpenClaw-RL: Train Any Agent Simply by Talking

Figure 4 | Our framework supports scalable RL for general agents across terminal, GUI, SWE, and
tool-call settings.

## 5.4. General Agents: Unified RL Across Terminal, GUI, SWE, and Tool-Call

We conduct experiments across widely used, real-world agent settings, including terminal, GUI, SWE, and tool-call scenarios (Figure 4). Large-scale environment parallelization further improves the scalability of our RL training. Specifically, we use 128 parallel environments for terminal agents, 64 for GUI and SWE agents, and 32 for tool-call agents during our RL training.

We conduct RL training with integrated rewards in the tool-call (250 steps) and GUI (120 steps) settings, and find that combining outcome and process rewards further improves performance (Table 4). One trade-off is that hosting a PRM requires additional resources compared with outcomeonly optimization.

Table 4 | Performance of Integrating outcome and process rewards across different settings.

Integrated
Outcome only

Tool-call 0.30 0.17 GUI 0.33 0.31

## 6. Related Work

RL for LLMs. RLHF (Christiano et al., 2017; Ziegler et al., 2019) established the PPO-based alignment pipeline. DPO (Rafailov et al., 2023) further bypasses explicit reward modeling via closed-form preference optimization; GRPO (Shao et al., 2024) eliminates the critic network through group-relative advantage estimation, and was further scaled by DeepSeek-R1 (Guo et al., 2025) and DAPO (Yu et al., 2025a). ReasonFlux (Yang et al., 2025b) takes an orthogonal approach by

12

OpenClaw-RL: Train Any Agent Simply by Talking

applying hierarchical RL to optimize sequences of thought templates rather than raw token-level CoTs, achieving significant gains through structured reasoning. These systems operate in batch-offline mode, where data collection and training happen in separate phases with fixed datasets. OpenClaw-RL instead trains continuously from live interaction signals without any data pre-collection phase.

Agentic RL and tool-use. Foundational agent paradigms such as ReAct (Yao et al., 2023), Toolformer (Schick et al., 2023), and FireAct (Chen et al., 2023) enable multi-step interaction with external tools, but rely on demonstrations rather than online RL. Recent work applies RL to specific agent settings, SWE-agent (Yang et al., 2024a) and ReTool (Feng et al., 2025a) for code and tool-use, DigiRL (Bai et al., 2024) and WebRL (Qi et al., 2024) for GUI agents, ArCHer (Zhou et al., 2024) and LOOP (Chen et al., 2025) for multi-turn credit assignment, but each targets a single environment with a dedicated training pipeline. DemyAgent (Yu et al., 2025b), RLAnything (Wang et al., 2026), and CURE (Wang et al., 2025d) advance agentic RL further by investigating data quality and closed-loop reward model co-optimization.

Process reward models. PRMs demonstrate that step-level supervision outperforms outcome-only supervision for math reasoning. Math-Shepherd (Wang et al., 2024) automates step-wise supervision via Monte Carlo estimation without human annotations; GenPRM (Zhao et al., 2025) scales PRM with generative chain-of-thought verification. ReasonFlux-PRM (Zou et al., 2025) extends PRMs to trajectory-aware evaluation for long-CoT reasoning, providing both offline data selection and online dense process-level rewards. PRIME (Cui et al., 2025a) learns implicit process rewards from outcome labels. RLAnything (Wang et al., 2026) provides the large-scale evidence that step-wise PRM signals are essential for long-horizon agentic tasks, with jointly optimized reward model signals surpassing human-labeled supervision. We extend PRM-style judging to the online setting, where process rewards are inferred from live next-state signals rather than pre-collected ground truth, across heterogeneous long-horizon agentic settings.

On-policy distillation and hindsight methods. Context-enrichment approaches demonstrate that augmenting prompts with structured information yields fundamentally better token distributions: Buffer of Thoughts (Yang et al., 2024b) retrieves high-level thought templates, while SuperCorrect (Yang et al., 2025c) extracts hierarchical templates from a teacher for cross-model DPO-based error correction. Hindsight-based methods relabel past experience with retrospective information: HER relabels goals in classical RL; STaR (Zelikman et al., 2022) rationalizes failures with answer hints; HIR (Zhang et al., 2023) converts feedback into relabeled instructions; Self-Rewarding (Yuan et al., 2024) uses the LLM as its own judge for iterative improvement. On-policy distillation methods (Agarwal et al., 2024; Hübotter et al., 2026; Shenfeld et al., 2026) train LLMs on their own generations conditioned on execution feedback, achieving acceleration over GRPO but requiring pre-collected feedback-response pairs. OpenClaw-RL’s Hindsight-Guided OPD unifies these threads in the online setting: textual hints are extracted from live next-state signals (hindsight relabeling), the model serves as its own teacher under hint-enhanced context (self-distillation via context enrichment), and the resulting token-level log-probability gap provides directional advantage supervision, requiring no pre-collected data, no external teacher, and no paired preferences.

RL training infrastructure.
OpenRLHF (Hu et al., 2024), AReal (Fu et al., 2025), veRL (Sheng et al.,

2025), and slime (Zhu et al., 2025) decouple rollout and training engines for scalable RL training. Built on slime, OpenClaw-RL enables four fully decoupled asynchronous loops, serving, rollout, PRM judging, and training, allowing continuous training from live multi-stream interactions with zero

13

OpenClaw-RL: Train Any Agent Simply by Talking

interruption to serving. This capability is absent from prior RL infrastructure, which assumes batch
data collection rather than live deployment.

## 7. Conclusion

Every agent interaction produces a next-state signal that encodes how the agent performed and, often, how it should have acted differently. OpenClaw-RL is built on a single insight: these signals are stream-agnostic, and one policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces all flow into the same training loop. Binary RL converts evaluative signals into scalar process rewards, while OPD converts directive signals into token-level advantage supervision. Combining the two yields significant optimization gains. The result is a system where a model simultaneously personalizes to individual users and improves at long-horizon agentic tasks, trained entirely from the interactions it is already having.

## References

R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The twelfth international conference on learning representations, 2024.

Anthropic. Claude code overview. https://code.claude.com/docs/en/overview, 2026.
Official documentation, accessed 2026-03-10.

H. Bai, Y. Zhou, J. Pan, M. Cemri, A. Suhr, S. Levine, and A. Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024.

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl
technical report. arXiv preprint arXiv:2511.21631, 2025.

T. K. Buening, J. Hübotter, B. Pásztor, I. Shenfeld, G. Ramponi, and A. Krause. Aligning language
models from user interactions. https://thomasklbg.github.io/files/Aligning_Lang
uage_Models_from_User_Interactions.pdf, 2026. Preprint.

R. Cao, M. Chen, J. Chen, Z. Cui, Y. Feng, B. Hui, Y. Jing, K. Li, M. Li, J. Lin, et al. Qwen3-coder-next
technical report. arXiv preprint arXiv:2603.00729, 2026.

B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao. FireAct: Toward language agent
fine-tuning. arXiv preprint arXiv:2310.05915, 2023.

K. Chen, M. Cusumano-Towner, B. Huval, A. Petrenko, J. Hamburger, V. Koltun, and P. Krähenbühl. Reinforcement learning for long-horizon interactive LLM agents. arXiv preprint arXiv:2502.01600, 2025.

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning
from human preferences. Advances in neural information processing systems, 30, 2017.

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu, Q. Xu, W. Chen, et al. Process reward
models via implicit process rewards. arXiv preprint arXiv:2502.01456, 2025a.

14

OpenClaw-RL: Train Any Agent Simply by Talking

G. Cui, L. Yuan, Z. Wang, H. Wang, Y. Zhang, J. Chen, W. Li, B. He, Y. Fan, T. Yu, et al. Process
reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025b.

J. Feng, S. Huang, X. Qu, G. Zhang, Y. Qin, B. Zhong, C. Jiang, J. Chi, and W. Zhong. Retool:
Reinforcement learning for strategic tool use in LLMs. arXiv preprint arXiv:2504.11536, 2025a.

L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training. arXiv
preprint arXiv:2505.10978, 2025b.

W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, S. Xu, G. Wei, J. Mei, J. Wang, T. Yang, B. Yuan, and
Y. Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning,
2025. URL https://arxiv.org/abs/2505.24298.

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

J. Hu, X. Wu, Z. Zhu, W. Wang, D. Zhang, Y. Cao, et al. OpenRLHF: An easy-to-use, scalable and
high-performance RLHF framework. arXiv preprint arXiv:2405.11143, 6, 2024.

J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025.

J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026.

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can
language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. In The twelfth international conference on learning representations, 2023.

Mathematical Association of America, American Mathematics Competitions. American invitational
mathematics examination (aime) 2024: Aime i and aime ii. https://artofproblemsolving.
com/wiki/index.php/AIME_Problems_and_Solutions, 2024. Competition problems used
as an evaluation dataset; original problems by MAA AMC.

Z. Mei, W. Fu, K. Li, G. Wang, H. Zhang, and Y. Wu. Real: Efficient rlhf training of large language models with parameter reallocation. In Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025.

OpenAI. Codex cli. https://developers.openai.com/codex/cli/, 2026. Official documen-
tation, accessed 2026-03-10.

OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. Open-source personal
AI assistant, version 2026.3.8, accessed 2026-03-09.

Z. Qi, X. Liu, I. L. Iong, H. Lai, X. Sun, W. Zhao, Y. Yang, X. Yang, J. Sun, S. Yao, et al. WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024.

15

OpenClaw-RL: Train Any Agent Simply by Talking

Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. Ui-tars:
Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025.

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023.

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539–68551, 2023.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.

Q. Shen, J. Rainton, A. Aliev, A. Awelkair, B. Ma, Z. J. Huang, Y. Mao, W. Fan, P. Torr, B. Ghanem, C. Hu, U. Thakker, and G. Li. SETA: Scaling Environments for Terminal Agents, Jan. 2026. URL https: //github.com/camel-ai/seta. Blog: https://eigent-ai.notion.site/SETA-Scali ng-Environments-for-Terminal-Agents-2d2511c70ba280a9b7c0fe3e7f1b6ab8.

I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal. Self-distillation enables continual learning. arXiv
preprint arXiv:2601.19897, 2026.

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025.

## Q. Team. Qwen3 technical report. arXiv preprint, 2025

H. Wang, H. Zou, H. Song, J. Feng, J. Fang, J. Lu, L. Liu, Q. Luo, S. Liang, S. Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025a.

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024.

W. Wang, S. Xiong, G. Chen, W. Gao, S. Guo, Y. He, J. Huang, J. Liu, Z. Li, X. Li, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library. arXiv preprint arXiv:2506.06122, 2025b.

X. Wang, B. Wang, D. Lu, J. Yang, T. Xie, J. Wang, J. Deng, X. Guo, Y. Xu, C. H. Wu, et al. OpenCUA:
Open foundations for computer-use agents. arXiv preprint arXiv:2508.09123, 2025c.

Y. Wang, L. Yang, Y. Tian, K. Shen, and M. Wang. Co-evolving LLM coder and unit tester via
reinforcement learning. arXiv preprint arXiv:2506.03136, 2025d. NeurIPS 2025 Spotlight.

Y. Wang, T. Xie, K. Shen, M. Wang, and L. Yang. RLAnything: Forge environment, policy, and reward
model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026.

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024.

16

OpenClaw-RL: Train Any Agent Simply by Talking

T. Xue, C. Peng, M. Huang, L. Guo, T. Han, H. Wang, J. Wang, X. Zhang, X. Yang, D. Zhao, et al. EvoCUA: Evolving computer use agents via learning from scalable synthetic experience. arXiv preprint arXiv:2601.15876, 2026.

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3
technical report. arXiv preprint arXiv:2505.09388, 2025a.

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528–50652, 2024a.

L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. Advances in Neural Information Processing Systems, 37:113519–113544, 2024b.

L. Yang, Z. Yu, B. Cui, and M. Wang. ReasonFlux: Hierarchical LLM reasoning via scaling thought
templates. arXiv preprint arXiv:2502.06772, 2025b.

L. Yang, Z. Yu, T. Zhang, M. Xu, J. E. Gonzalez, B. Cui, and S. Yan. Supercorrect: Advancing small llm
reasoning with thought template distillation and self-correction. In ICLR, 2025c.

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning
and acting in language models. In The Eleventh International Conference on Learning Representations,
2023. URL https://openreview.net/forum?id=WE_vluYUL-X.

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. DAPO: An
open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025a.

Z. Yu, L. Yang, J. Zou, S. Yan, and M. Wang. Demystifying reinforcement learning in agentic reasoning.
arXiv preprint arXiv:2510.11701, 2025b.

W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. E. Weston. Self-rewarding language
models. In Forty-first International Conference on Machine Learning, 2024.

E. Zelikman, Y. Wu, J. Mu, and N. Goodman. STar: Bootstrapping reasoning with reasoning. In A. H.
Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems,
2022. URL https://openreview.net/forum?id=_3ELRdg2sgI.

T. Zhang, F. Liu, J. Wong, P. Abbeel, and J. E. Gonzalez. The wisdom of hindsight makes language models better instruction followers. In International Conference on Machine Learning, pages 41414– 41428. PMLR, 2023.

J. Zhao, R. Liu, K. Zhang, Z. Zhou, J. Gao, D. Li, J. Lyu, Z. Qian, B. Qi, X. Li, et al. GenPRM: Scaling testtime compute of process reward models via generative reasoning. arXiv preprint arXiv:2504.00891, 2025.

Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar. ArCHer: Training language model agents via
hierarchical multi-turn RL. In ICML, 2024.

## Z. Zhu, C. Xie, X. Lv, and slime Contributors. slime: An llm post-training framework for rl scaling

## https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving.
Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

J. Zou, L. Yang, J. Gu, J. Qiu, K. Shen, J. He, and M. Wang. ReasonFlux-PRM: Trajectory-aware PRMs
for long chain-of-thought reasoning in LLMs. arXiv preprint arXiv:2506.18896, 2025.

17

OpenClaw-RL: Train Any Agent Simply by Talking

## A. Algorithm Pseudocode

Algorithm 1 Binary RL Pipeline (per main-line turn, both tracks)

Require: Session/trajectory T, turn 𝑡, messages M𝑡

1: 𝑎𝑡, logpold ← SGLang(M𝑡)
// serve and collect log-probs

2: Buffer {prompt_ids, response_ids, logpold} for (T , 𝑡)

3: // — On next turn: extract 𝑠𝑡+1 and fire PRM —

4: 𝑠𝑡+1 ← first message of M𝑡+1
// user reply or env feedback

5: {𝑟𝑖}𝑚 𝑖=1 ← PRM(𝑎𝑡, 𝑠𝑡+1) // 𝑚 parallel votes, async

6: 𝑟 ← MajorityVote({𝑟𝑖})

7: 𝐴𝑡 ← 𝑟 broadcast

8: Apply at-least-one guarantee if zero effective samples in T

9: Submit Sample to trainer queue

Algorithm 2 OPD Pipeline (personal agent track)

Require: Turn data (𝑎𝑡, M𝑡, logpold), next state 𝑠𝑡+1

1: {(score𝑖, hint𝑖)}𝑚
𝑖=1 ← Judge(𝑎𝑡, 𝑠𝑡+1)

2: valid ← {ℎ𝑖 : score𝑖 = +1 ∧ |ℎ𝑖| > 10}

3: if valid = ∅ then

4:
Drop sample;

5:
return

6: end if

7: hint ← arg maxℎ∈valid |ℎ|

8: 𝑠enhanced ← M𝑡 ⊕ “[user’s hint]\n{hint}”

9: log 𝜋teacher ← Teacher(𝑎𝑡 | 𝑠enhanced)

10: 𝐴𝑡[𝑘] ← log 𝜋teacher[𝑘] − logpold[𝑘]

11: Submit Sample(teacher_log_probs=𝐴𝑡) to trainer queue

## B. More Optimization Examples

B.1. Student Setting

18

OpenClaw-RL: Train Any Agent Simply by Talking

B.2. Teacher Setting

19

OpenClaw-RL: Train Any Agent Simply by Talking

## C. Prompt Templates

C.1. Personal Agent: PRM Judge Prompt

C.2. Personal Agent: OPD Hindsight Hint Prompt

C.3. Personal Agent: Evaluative Prompt from Simulator

20

OpenClaw-RL: Train Any Agent Simply by Talking

C.4. General Agent: PRM Judge Prompt

21

OpenClaw-RL: Train Any Agent Simply by Talking

22

OpenClaw-RL: Train Any Agent Simply by Talking

23

OpenClaw-RL: Train Any Agent Simply by Talking

## D. Hyperparameters

Table 5 | Complete hyperparameter table across different settings.

Parameter Value Note

Optimizer Learning rate 1 × 10−6 constant decay Weight decay 0.1 Adam 𝛽1, 𝛽2 0.9, 0.98

Policy gradient KL coefficient 𝛽KL 0.01 k3 / low-var KL Clip 𝜀 / 𝜀high 0.2 / 0.28 asymmetric PPO Entropy coefficient 0.0 disabled

Rollout Batch size 8 (GUI, SWE), 16 (terminal), 32 (tool-call) Sample per task 8 Max response length 8192 tokens Max context length 16384 tokens Max interactive steps 30 (GUI), 20 (SWE), 10 (terminal) Temperature 1.0

PRM / judge Votes 𝑚 3 (GUI), 1 (the others) majority vote Temperature 0.6 Max new tokens 4096 (RL) / 8192 (OPD) Min hint length 10 chars OPD quality filter

24