AgentSkillsCN

rnow-rewards

为 ReinforceNow RL 训练编写奖励函数。适用于创建带有 @reward 装饰的函数、编写 rewards.py 文件、使用前置条件奖励、沙盒奖励、llm_judge 或 math-verify 时使用。触发条件包括“reward function”、“@reward”、“RewardArgs”、“precondition”、“llm_judge”、“math-verify”、“math reward”以及“latex”。

SKILL.md
--- frontmatter
name: rnow-rewards
description: Write reward functions for ReinforceNow RL training. Use when creating @reward decorated functions, writing rewards.py, using precondition rewards, sandbox rewards, llm_judge, or math-verify. Triggers on "reward function", "@reward", "RewardArgs", "precondition", "llm_judge", "math-verify", "math reward", "latex".
allowed-tools: Read, Edit, Write, Bash, Grep, Glob

Writing Reward Functions for ReinforceNow

Reward functions compute the training signal for reinforcement learning. They evaluate model responses and return a score between 0.0 and 1.0.

Basic Structure

Every reward function must:

  1. Be decorated with @reward
  2. Accept (args: RewardArgs, messages: list) as parameters
  3. Return a float between 0.0 and 1.0
python
from rnow.core import reward, RewardArgs

@reward
async def my_reward(args: RewardArgs, messages: list) -> float:
    """Evaluate the model's response."""
    response = messages[-1]["content"]
    # Your evaluation logic here
    return 1.0 if condition else 0.0

RewardArgs Object

args provides access to data from train.jsonl:

FieldDescriptionExample
args.metadataDict from metadata fieldargs.metadata["answer"]
args.variablesDict from variables fieldargs.variables["topic"]
args.secretsUser secrets from .envargs.secrets["OPENAI_API_KEY"]

Messages Format

messages is a list of conversation turns:

python
[
    {"role": "system", "content": "You are helpful"},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "The answer is 4"}
]

Get the last assistant response:

python
response = messages[-1]["content"]

Reward Patterns

1. Exact Match

python
@reward
async def exact_match(args: RewardArgs, messages: list) -> float:
    """Check if response exactly matches expected answer."""
    response = messages[-1]["content"].strip().lower()
    expected = args.metadata["answer"].strip().lower()
    return 1.0 if response == expected else 0.0

2. Contains Answer

python
@reward
async def contains_answer(args: RewardArgs, messages: list) -> float:
    """Check if response contains the expected answer."""
    response = messages[-1]["content"]
    expected = args.metadata["answer"]
    return 1.0 if expected in response else 0.0

3. Numerical Comparison (with tolerance)

python
import re

@reward
async def numerical_accuracy(args: RewardArgs, messages: list) -> float:
    """Check if extracted number is within 1% of expected."""
    response = messages[-1]["content"]
    expected = float(args.metadata["answer"])

    # Extract numbers from response
    numbers = re.findall(r'-?\d+\.?\d*', response)
    if not numbers:
        return 0.0

    predicted = float(numbers[-1])  # Take last number
    tolerance = abs(expected) * 0.01  # 1% tolerance
    return 1.0 if abs(predicted - expected) <= tolerance else 0.0

4. Math Verification

For math datasets, there are three approaches. Choose based on your needs.

Option A: math-verify (Fast, Deterministic)

Best for datasets with LaTeX expressions like \boxed{42} or \frac{1}{2}.

requirements.txt:

code
math-verify==0.5.0

rewards.py:

python
from math_verify import LatexExtractionConfig, parse, verify
from rnow.core import RewardArgs, get_response, reward

@reward
def accuracy(args: RewardArgs, messages: list) -> float:
    """Verify mathematical equivalence using math-verify."""
    gold = parse(args.metadata["expected_answer"])
    pred = parse(
        get_response(messages),
        extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
    )
    if not pred:
        return 0.0
    return 1.0 if verify(gold, pred) else 0.0

Note: expected_answer MUST have math delimiters ($...$ or \(...\)). Raw LaTeX like \sqrt{2} won't parse - use $\sqrt{2}$. Plain numbers like 42 work as-is.

Option B: llm_judge (Semantic Understanding)

Best for answers needing semantic understanding or complex text comparisons.

Requires OPENAI_API_KEY - see Secrets section below.

python
from rnow.core import RewardArgs, get_response, llm_judge, reward

@reward(timeout=120)
async def accuracy(args: RewardArgs, messages: list) -> float:
    """Judge if model's answer matches expected using LLM."""
    expected = args.metadata["expected_answer"]
    model_answer = get_response(messages)

    prompt = (
        f"Expected: {expected}\n"
        f"Model: {model_answer}\n\n"
        "Is the model's final answer mathematically equal to expected? "
        "Ignore formatting (\\boxed, LaTeX). Equivalent forms count (1/2=0.5=50%). "
        "Answer 1 if correct, 0 if incorrect."
    )

    return await llm_judge(prompt)

Option C: Combined (Recommended for Math)

Use math-verify first (fast, reliable), fall back to llm_judge for edge cases.

requirements.txt:

code
math-verify==0.5.0

rewards.py:

python
from math_verify import LatexExtractionConfig, parse, verify
from rnow.core import RewardArgs, get_response, llm_judge, reward

@reward(timeout=120)
async def accuracy(args: RewardArgs, messages: list) -> float:
    """Verify math with math-verify, fallback to LLM judge."""
    expected = args.metadata["expected_answer"]
    response = get_response(messages)

    # Try math-verify first (faster, more reliable for pure math)
    gold = parse(expected)
    pred = parse(
        response,
        extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
    )

    if gold and pred:
        return 1.0 if verify(gold, pred) else 0.0

    # Fallback to LLM judge for complex cases
    prompt = (
        f"Expected: {expected}\n"
        f"Model: {response}\n\n"
        "Is the model's final answer mathematically equal to expected? "
        "Ignore formatting. Answer 1 if correct, 0 if incorrect."
    )
    return await llm_judge(prompt)

5. JSON Structure Validation

python
import json

@reward
async def valid_json(args: RewardArgs, messages: list) -> float:
    """Check if response is valid JSON with required fields."""
    response = messages[-1]["content"]
    required_fields = args.metadata.get("required_fields", [])

    try:
        data = json.loads(response)
        for field in required_fields:
            if field not in data:
                return 0.0
        return 1.0
    except json.JSONDecodeError:
        return 0.0

6. Length-Based Reward

python
@reward
async def appropriate_length(args: RewardArgs, messages: list) -> float:
    """Reward responses within target length range."""
    response = messages[-1]["content"]
    min_len = args.metadata.get("min_length", 50)
    max_len = args.metadata.get("max_length", 500)

    length = len(response)
    if length < min_len or length > max_len:
        return 0.0
    return 1.0

7. Regex Pattern Match

python
import re

@reward
async def pattern_match(args: RewardArgs, messages: list) -> float:
    """Check if response matches required pattern."""
    response = messages[-1]["content"]
    pattern = args.metadata["pattern"]
    return 1.0 if re.search(pattern, response) else 0.0

Precondition Rewards

Preconditions act as gates. If ANY precondition returns 0, the total reward is 0.

python
@reward(precondition=True)
async def has_answer_tag(args: RewardArgs, messages: list) -> float:
    """GATE: Response must contain Answer: tag."""
    response = messages[-1]["content"]
    return 1.0 if "Answer:" in response else 0.0

@reward(precondition=True)
async def no_refusal(args: RewardArgs, messages: list) -> float:
    """GATE: Response must not be a refusal."""
    response = messages[-1]["content"].lower()
    refusals = ["i cannot", "i can't", "i'm unable", "i am unable"]
    return 0.0 if any(r in response for r in refusals) else 1.0

@reward
async def accuracy(args: RewardArgs, messages: list) -> float:
    """Main accuracy reward (only applied if preconditions pass)."""
    # This only runs if has_answer_tag AND no_refusal both return 1.0
    response = messages[-1]["content"]
    expected = args.metadata["answer"]
    return 1.0 if expected in response else 0.0

Sandbox Rewards

Use sandbox=True when rewards need to:

  • Execute code
  • Check files created by tools
  • Access isolated environment state

IMPORTANT: Entries using sandbox rewards MUST have docker field in train.jsonl (see rnow-train-jsonl skill).

python
@reward(sandbox=True, timeout=120)
async def code_runs(args: RewardArgs, messages: list) -> float:
    """Check if the generated code executes without errors."""
    import subprocess
    result = subprocess.run(
        ["python", "solution.py"],
        capture_output=True,
        timeout=60
    )
    return 1.0 if result.returncode == 0 else 0.0

@reward(sandbox=True)
async def file_created(args: RewardArgs, messages: list) -> float:
    """Check if expected file was created by tools."""
    import os
    expected_file = args.metadata["expected_file"]
    return 1.0 if os.path.exists(expected_file) else 0.0

@reward(sandbox=True)
async def test_passes(args: RewardArgs, messages: list) -> float:
    """Run pytest and check if tests pass."""
    import subprocess
    result = subprocess.run(
        ["pytest", "-q", "test_solution.py"],
        capture_output=True
    )
    return 1.0 if result.returncode == 0 else 0.0

LLM Judge

Use another LLM to evaluate responses:

python
from rnow.core import llm_judge, reward, RewardArgs

@reward
async def quality_score(args: RewardArgs, messages: list) -> float:
    """Use GPT to evaluate response quality."""
    response = messages[-1]["content"]
    question = args.metadata["question"]

    prompt = f"""Rate this response on a scale of 0-1.

Question: {question}
Response: {response}

Return 1 if good, 0 if bad."""

    return await llm_judge(prompt, secrets=args.secrets)

LLM Judge with Custom Schema

python
@reward
async def detailed_evaluation(args: RewardArgs, messages: list) -> float:
    """Detailed evaluation with custom schema."""
    response = messages[-1]["content"]

    custom_schema = {
        "type": "object",
        "properties": {
            "accuracy": {"type": "integer", "minimum": 0, "maximum": 10},
            "clarity": {"type": "integer", "minimum": 0, "maximum": 10},
            "completeness": {"type": "integer", "minimum": 0, "maximum": 10}
        },
        "required": ["accuracy", "clarity", "completeness"]
    }

    prompt = f"""Evaluate this response:
{response}

Rate accuracy, clarity, and completeness from 0-10."""

    # Returns average of scores normalized to 0-1
    result = await llm_judge(
        prompt,
        secrets=args.secrets,
        schema=custom_schema,
        model="gpt-4o-mini"
    )
    return result / 10.0  # Normalize to 0-1

LLM Judge Parameters

ParameterDefaultDescription
promptrequiredEvaluation prompt
secretsNoneDict with API keys (checks for OPENAI_API_KEY)
model"gpt-5-nano"Model to use
schemabinary 0/1Custom JSON schema
score_key"score"Field to extract from response
temperature0.0Sampling temperature
max_tokens1024Max response tokens
timeout60Request timeout in seconds

Combining Multiple Rewards

List multiple rewards in train.jsonl's rewards field (see rnow-train-jsonl skill).

The total reward is calculated based on preconditions:

  • If any precondition=True reward returns 0 → total = 0
  • Otherwise → weighted average of all rewards

Async vs Sync

Both work:

python
# Async (recommended for I/O operations)
@reward
async def my_async_reward(args: RewardArgs, messages: list) -> float:
    result = await some_async_operation()
    return result

# Sync (simpler for pure computation)
@reward
def my_sync_reward(args: RewardArgs, messages: list) -> float:
    return 1.0 if condition else 0.0

Common Mistakes

Wrong: Return value outside 0-1

python
@reward
async def bad(args: RewardArgs, messages: list) -> float:
    return 10  # ERROR: Must be 0.0-1.0

Wrong: Missing type hints

python
@reward
async def bad(args, messages):  # ERROR: Missing types
    return 1.0

Wrong: Using sandbox=True without docker field

python
@reward(sandbox=True)
async def check_file(args: RewardArgs, messages: list) -> float:
    # ERROR if train.jsonl entry lacks "docker" field
    return 1.0 if os.path.exists("output.txt") else 0.0

Right: Clamp values to valid range

python
@reward
async def safe_score(args: RewardArgs, messages: list) -> float:
    score = calculate_score()  # Might return any float
    return max(0.0, min(1.0, score))  # Clamp to 0-1

Testing Rewards Locally

bash
rnow test -n 3 --verbose

This runs rollouts and shows reward breakdowns for debugging.


Secrets and Environment Variables

For llm_judge or any reward function needing API keys:

Setup

  1. Create .env in your project directory:
code
OPENAI_API_KEY=sk-your-api-key-here
ANTHROPIC_API_KEY=sk-ant-...
CUSTOM_API_KEY=...
  1. Access in rewards.py via args.secrets:
python
@reward
def my_reward(args: RewardArgs, messages: list) -> float:
    api_key = args.secrets["OPENAI_API_KEY"]
    # Use the key...

Security

  • Secrets are encrypted and stored securely on the ReinforceNow platform
  • They are never logged or exposed in traces
  • Each run gets its own isolated copy of secrets

Using with llm_judge

The llm_judge function automatically uses OPENAI_API_KEY from secrets:

python
from rnow.core import llm_judge, reward, RewardArgs

@reward
async def quality(args: RewardArgs, messages: list) -> float:
    return await llm_judge("Is this response helpful? Return 1 if yes, 0 if no.")
    # Automatically uses args.secrets["OPENAI_API_KEY"]

To explicitly pass secrets:

python
return await llm_judge(prompt, secrets=args.secrets)