Build reliable scientific AI agents with NeMo Gym and NeMo RL
Research work often stalls on repetitive tasks-literature review, dataset wrangling, experiment orchestration, and report writing. Scientific AI agents can handle much of this, so you can spend time on ideas, not admin. The challenge: agents must plan over long horizons, use domain tools correctly, and verify outcomes across hours or days without losing context.
This is where NeMo Gym and NeMo RL help. They provide a unified, modular stack for training, evaluating, and scaling agentic AI-especially for science-using verifiable reinforcement learning. Both are open source and were key to post-training the latest Nemotron-3-Nano model for accurate, low-cost inference.
How RL extends LLMs for scientific work
Pre-training makes models knowledgeable, not skilled. Supervised fine-tuning (SFT) improves instruction following but depends on reference answers and limited datasets. Real scientific workflows require planning, tool use, and verification that SFT alone won't cover.
- RLHF: Trains policies using human preference rankings.
- RLAIF: Replaces human rankings with AI judges.
- RLVR: Uses objective checks (for example, executing code or validating results) to score outputs. This fits science because agents can run experiments, verify results, and optimize to concrete metrics.
Run RL in multi-step environments where an agent takes actions, observes outcomes, and learns from rewards at the step or trajectory level. This composes pre-trained knowledge and SFT skills into end-to-end workflows.
NeMo Gym + NeMo RL: the training pipeline
RL for agents needs two parts: a training framework and realistic environments. NeMo RL provides the training algorithms and infrastructure (including GRPO-style methods, asyncRL, on-policy distillation, and end-to-end FP8 RL). NeMo Gym provides scalable, isolated environments with clear APIs for tools, observations, and rewards.
NeMo Gym exposes three core server abstractions you can mix and match:
- Model: OpenAI-compatible endpoints with reasoning and tool calling. Works with backends like OpenAI, Azure, and vLLM, locally or in the cloud.
- Resources: Tools and verification logic. Offloads heavy computation and lets agents call tools asynchronously.
- Agents: Orchestrate conversations, route tool calls, and keep state consistent.
Environments are isolated and exposed via REST, so you can run many in parallel without dependency conflicts. NeMo Gym produces high-quality rollout data and rewards, which NeMo RL then uses to update model weights at scale.
Case study: Edison Scientific and Aviary
Edison Scientific uses NeMo Gym and NeMo RL to automate scientific discovery with Aviary-a suite of RL environments for biology, chemistry, math, literature research, data analysis, and more. Aviary manages state, tool execution, rewards, and observation formatting.
Example: a Jupyter-based bioinformatics agent that edits notebook cells step by step. Because notebooks can exceed context windows, they dropped past interaction text and trained GRPO at the step level instead of full trajectories. That lowers context length, supports transition-level rewards, and keeps training stable. They also introduced BixBench, a set of verifiable bioinformatics questions.
Practical workflow: from install to training
1) Install NeMo Gym
Clone the repo, create a Python 3.12 virtual environment, and install dependencies. Use the provided scripts to bring up resource, agent, and model servers locally.
2) Configure a model backend
Use a hosted endpoint or deploy locally via vLLM. Many teams start with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 on Hugging Face and enable tool-calling in vLLM. Set your policy base URL, API key, and model name in env.yaml so NeMo Gym can talk to the model.
3) Run a ready-made environment
Spin up the GSM8K math environment from Aviary through NeMo Gym. Launch the resources server, agent server, and model server with ng_run, then collect rollouts using ng_collect_rollouts. Use ng_viewer to inspect trajectories and average rewards.
4) Add a new environment (example: HotPotQA)
- Create a resources server by extending the Aviary base class for HotPotQA.
- Add a YAML config that wires the new resources server to an agent and the policy model.
- Provide a small example JSONL dataset for quick testing.
- Update requirements to include the proper Aviary extras (for example, hotpotqa).
With those pieces in place, you can launch the HotPotQA environment via NeMo Gym and start collecting verifiable rollouts for RL training.
Best practices for scientific agents
- Start simple: One agent, a small toolset, and outcome-based rewards. Add complexity only after the basics work.
- Profile rewards: For GRPO-style training, measure mean and standard deviation of rewards per task over multiple attempts. This improves sampling and training efficiency.
- Monitor training: Track stability and behavior (for example, sampling issues, collapse, truncated trajectories) with metrics logged to Weights & Biases.
- Train longer: RL with verifiable rewards can show slow starts and then a sharp improvement once the policy finds a working strategy.
Why this matters for your lab
Scientific agents that can plan, use tools, and verify outcomes move routine work off your plate. NeMo Gym and NeMo RL give you the infrastructure to build those agents, generate reliable training data, and iterate on performance at scale. The result: more time for hypotheses, experiments, and insights-less time spent on mechanical tasks.
Resources to get started
Your membership also unlocks: