Skip to the content.

PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases

Dataset License: MIT Reproducibility


Official Implementation of “PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases”


1. Overview

PALADIN is a framework and dataset for building robust, self-recovering tool-using language model agents. It trains on 50K+ failure-injected, recovery-annotated trajectories to equip LLMs with fault tolerance, recovery reasoning, and resilience against real-world API and tool failures — outperforming CRITIC, ToolReflect, and ToolBench baselines by over +13% in Recovery Rate and +10% in Task Success Rate across Gemma, Qwen, and LLaMA backbones.

Key Idea: PALADIN treats failures as data.
By injecting and labeling diverse tool-level errors, it learns not only to perform tasks but to self-repair.


2. Repository Structure

├── data/
│ ├── API_prompts/
│ │ ├── annotate_clean.txt
│ │ ├── annotate_recovery.txt
│ │ ├── grader_prompt.txt
│ │ ├── simulator_prompt.txt
│ │ └── GPT-5_API_Config.md
│ ├── recovery_dictionary.json
│ ├── toolscan_taxonomy_map.json
│ └── docs/
│ ├── Eval_Protocol_PALADIN.md
│ └── appendix_links.md
│
├── examples/
│ ├── annotation_sample/
│ ├── evaluation_sample/
│ └── ...
│
├── src/paladin/
│ ├── data_creation/
│ ├── error_matching/
│ ├── simulation/
│ ├── evaluation/
│ └── training/
│
├── results/
│ ├── paladin_eval_sample.jsonl
│ └── reference_metrics.md
│
└── figures/

3. Core Components

Module Purpose
annotate_clean.py Repairs truncated or invalid ToolBench traces into clean, complete rollouts.
annotate_recovery.py Injects controlled tool errors and synthesizes labeled recovery trajectories.
simulation.py Executes multi-turn deterministic simulations between the agent and tools.
simulation_with_paladin_error_match.py Adds error-matching logic that maps unseen failures to known recovery exemplars.
eval.py Grades a full conversation using GPT-5 to compute TSR, RR, CSR, and ES.
train.py Simplified LoRA fine-tuning pipeline for recovery-aware SFT.

4. Metrics

Metric Definition Description
TSR #Successful / #Total Overall task completion rate.
RR #Recovered / #Failures Ability to repair after an error.
CSR 1 - (#Hallucinated_Success / #Failures) Penalizes silent failures.
ES 1 / Avg_Steps Efficiency of reasoning and recovery.

All metrics are produced by a deterministic GPT-5 grader (temperature = 0.0, seed = 42) defined in GPT-5_API_Config.md.


5. Quick Start

Environment

git clone https://github.com/HexaA2/paladin.git
cd paladin
pip install -r requirements.txt
export OPENAI_API_KEY="your_api_key_here"

6. Dataset Construction

Huggingface: https://huggingface.co/datasets/E-k-O/PaladinDataSet —

7. Training Configuration

| Parameter | Setting | |:–|:–| | Backbone Models | Gemma-27B, Qwen-14B-Instruct, AM-Thinking-V1, LLaMA-3.1-8B-Instruct | | Fine-tuning | LoRA rank 16, α = 32, bf16 | | Dataset Size | 50,000 | | Objective | L = L_SFT + λ L_REC | | Hard Drive | NVIDIA H200 SXM |


8. Evaluation Protocol

Detailed in: Eval_Protocol_PALADIN.md.*

All grading used the GPT-5 deterministic setup: model = gpt-5-chat-latest temperature = 0.0 seed = 42 max_tokens = 1000 response_format = json


9. Error Taxonomy (ToolScan Aligned)

Category Description
Tool Hallucination Tool not found or invalid
Argument Hallucination Incorrect or missing parameters
Invalid Tool Invocation Failed tool call
Partial Execution Incomplete tool output
Output Hallucination Fabricated or nonsensical responses
Invalid Intermediate Reasoning Faulty internal planning
Re-entrant Failures Infinite or repeated retries

These are mapped in toolscan_taxonomy_map.json.


10. Results Summary

Model TSR RR CSR↓ ES↑
ToolBench (base) 0.62 0.34 0.25 0.71
ToolReflect 0.68 0.52 0.20 0.79
PALADIN (ours) 0.91 0.86 0.03 0.94

11. Reproducibility Checklist


12. Citations

@inproceedings{paladin2026, title = {PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases}, author = {Anonymous Authors}, year = {2026} }


13. License

This repository is released under the MIT License for research and academic use.


14. Acknowledgements

PALADIN builds upon the foundations of ToolBench, ToolScan, and ToolReflect, unifying them under a framework for runtime robustness and recovery learning.


PALADIN demonstrates that execution-level resilience is learnable —
by turning every failure into a training signal.