Close Menu

    Subscribe to Updates

    Get the latest Tech news from SynapseFlow

    What's Hot

    This Supreme Court decision is bad news for Hollywood’s AI ambitions

    March 14, 2026

    How to Make a Killing review: a serial killer story should not be this boring

    March 14, 2026

    NASA Selects Finalists in Student Aircraft Maintenance Competition – NASA

    March 13, 2026
    Facebook X (Twitter) Instagram
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    Facebook X (Twitter) Instagram YouTube
    synapseflow.co.uksynapseflow.co.uk
    • AI News & Updates
    • Cybersecurity
    • Future Tech
    • Reviews
    • Software & Apps
    • Tech Gadgets
    synapseflow.co.uksynapseflow.co.uk
    Home»AI News & Updates»Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks
    Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks
    AI News & Updates

    Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

    The Tech GuyBy The Tech GuyNovember 29, 2025No Comments5 Mins Read0 Views
    Share
    Facebook Twitter LinkedIn Pinterest Email
    Advertisement



    Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

    Researchers at the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems such as math and coding. 

    Advertisement

    Their framework, Agent-R1, is compatible with popular RL algorithms and shows considerable improvement on reasoning tasks that require multiple retrieval stages and multi-turn interactions with tools. 

    The framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agentic applications that require interacting with evolving environments and imperfect information. This framing is much more similar to real-world applications and can have important uses for agentic tasks in enterprise settings.

    Rethinking reinforcement learning for agents

    RL has become a cornerstone of training LLMs for well-defined reasoning tasks. In areas like mathematics and coding, the model receives a clear signal: The answer is either right or wrong. This makes it relatively straightforward to reward or penalize its behavior. 

    But this approach struggles with agentic tasks that require models to work in interactive environments, develop dynamic memories across conversations, perform multi-step reasoning and respond to unpredictable feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.

    To address these challenges, the University of Science and Technology researchers revisited the fundamental framework of RL, known as the Markov Decision Process (MDP). An MDP models decision-making using four key components: a state space (the set of possible states an agent can be in); an action space (what the agent can do); a state transition probability (the state to which an action will likely lead); and a reward function (whether the outcome is good or bad). The paper proposes extending this framework to better suit LLM agents.

    In the new formulation, the state space is expanded to include not just the current state (the current sequence of tokens generated by the model) but the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific sequences of text can now trigger external tools, like an API call. State transitions become unpredictable, or "stochastic," because the outcome depends not just on the tokens the model predicts but also on the environment's response, which depends on external factors. Finally, the reward system becomes more granular, incorporating intermediate "process rewards" for successfully completing steps along the way, rather than just a single reward at the very end. This provides more frequent and precise guidance to the agent during training.

    This last bit is especially important and addresses the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps it has taken along the way. Process rewards solve this problem by providing feedback signals on these intermediate steps, making the learning process much more efficient.

    “These extensions are crucial for enabling reinforcement learning algorithms to train sophisticated Agents capable of complex, multi-step reasoning and interaction within dynamic environments,” the researchers write in their paper.

    The Agent-R1 framework

    Based on the extended MDP definition, the researchers developed Agent-R1, a flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, allowing for seamless integration with diverse environments. 

    The most significant difference lies in the "rollout phase," where the agent generates responses. In single-turn RL, the model generates a response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.

    Agent-R1 achieves this flexible multi-turn rollout with two core modules: Tool and ToolEnv. The Tool module acts as an executor for specific actions such as calling an API or accessing a database. When invoked, a Tool performs its action and returns the direct, raw outcome. In contrast, the ToolEnv module is the orchestrator and interpreter. It takes the output from the Tool and determines how that outcome affects the agent's state and the overall task progress. ToolEnv manages state transitions, calculates reward signals based on tool outcomes and packages the new state information for the agent. 

    In short, when an action is complete, the Tool reports "what happened," while ToolEnv dictates "what this outcome means for the agent and the task."

    Agent-R1 in action

    The researchers tested Agent-R1 on the challenging task of multi-hop question answering, which requires complex reasoning, information retrieval across multiple documents and multi-step decision-making. They trained Qwen2.5-3B-Instruct on QA datasets and evaluated its performance on the HotpotQA and 2WikiMultihopQA datasets. They also tested it on the Musique dataset, which was out of the domain of tasks the agent was trained on. 

    They compared various RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where an LLM answers based on one set of retrieved documents, and Base Tool Call, which uses the model's native function-calling ability without specialized RL training.

    The results demonstrated that all RL-trained agents substantially outperformed the baselines. GRPO, an RL algorithm used in advanced reasoning models like DeepSeek-R1, delivered the best overall performance. 

    “These results robustly validate Agent-R1’s efficacy in training powerful LLM agents via end-to-end RL, showing consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.

    These findings can be significant for the enterprise, where there is a strong push to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in real-world settings.

    “We hope Agent-R1 provides a foundation for future work on scalable and unified RL training for agentic LLMs,” the researchers conclude.

    Advertisement
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    The Tech Guy
    • Website

    Related Posts

    Railway secures $100 million to challenge AWS with AI-native cloud infrastructure

    January 22, 2026

    Claude Code costs up to $200 a month. Goose does the same thing for free.

    January 20, 2026

    Listen Labs raises $69M after viral billboard hiring stunt to scale AI customer interviews

    January 16, 2026

    Salesforce rolls out new Slackbot AI agent as it battles Microsoft and Google in workplace AI

    January 13, 2026

    Converge Bio raises $25M, backed by Bessemer and execs from Meta, OpenAI, Wiz

    January 13, 2026

    Anthropic launches Cowork, a Claude Desktop agent that works in your files — no coding required

    January 13, 2026
    Leave A Reply Cancel Reply

    Advertisement
    Top Posts

    The iPad Air brand makes no sense – it needs a rethink

    October 12, 202516 Views

    ChatGPT Group Chats are here … but not for everyone (yet)

    November 14, 20258 Views

    Facebook updates its algorithm to give users more control over which videos they see

    October 8, 20258 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Advertisement
    About Us
    About Us

    SynapseFlow brings you the latest updates in Technology, AI, and Gadgets from innovations and reviews to future trends. Stay smart, stay updated with the tech world every day!

    Our Picks

    This Supreme Court decision is bad news for Hollywood’s AI ambitions

    March 14, 2026

    How to Make a Killing review: a serial killer story should not be this boring

    March 14, 2026

    NASA Selects Finalists in Student Aircraft Maintenance Competition – NASA

    March 13, 2026
    categories
    • AI News & Updates
    • Cybersecurity
    • Future Tech
    • Reviews
    • Software & Apps
    • Tech Gadgets
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 SynapseFlow All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.