TradingAgents:Open-Source LLM Trading Framework

Most multi-agent LLM frameworks promise more than they deliver. TradingAgents is one of the rare exceptions: open-sourced by Tauric Research alongside an arXiv paper, now at version 0.2.4, and shipping the kind of clean role decomposition other frameworks describe but rarely implement. The system mirrors a real research desk: fundamentals, sentiment, news, and technical analysts feeding a Bull/Bear research debate, then a Trader, then a Risk Management committee, ending in a structured decision logged for audit.

This review walks through what TradingAgents actually does, what shipped in v0.2.4, how it stacks up against LangGraph and CrewAI, and how to test the LLM and market-data layers underneath with Apidog. If you have already gone deep on the agent contract layer, our agents.md guide for API teams pairs naturally with this post.

TL;DR

TradingAgents is a multi-agent LLM trading framework from Tauric Research, arXiv 2412.20138, open-sourced in 2025 and now at version 0.2.4.
It splits trading into specialist agents: Fundamentals Analyst, Sentiment Analyst, News Analyst, Technical Analyst, Bull/Bear Researchers, Trader, and a Risk Management committee.
v0.2.4 added structured-output agents, LangGraph checkpoint resume, persistent decision logs, and provider support for DeepSeek, Qwen, GLM, and Azure OpenAI.
The framework runs on any OpenAI-compatible LLM endpoint, which makes hosted, local, and self-hosted models interchangeable.
Use Apidog to mock the underlying market-data APIs, replay LLM provider traffic, and benchmark thinking-mode cost across DeepSeek, OpenAI, and Anthropic.
Download Apidog to wire all of this into your CI before you trust an agent with real money.

What TradingAgents actually is

The framework is a Python package and CLI that decomposes the trading workflow into specialist roles. Each role is an LLM agent prompted with a job description, given access to a focused toolset, and orchestrated by LangGraph. Decisions flow through stages: gather data, debate, decide, log.

The README describes it as research code, not investment advice. That framing matters. The point is to study how multi-agent collaboration changes outcomes versus single-prompt setups, not to ship a production trading bot off your laptop.

What is interesting from an engineering standpoint is how clean the role separation is. The Fundamentals Analyst evaluates company financials. The Sentiment Analyst scores social media. The News Analyst monitors macroeconomic indicators. The Technical Analyst computes MACD and RSI. The Bull and Bear Researchers debate. The Trader reads everyone’s reports and decides. Risk Management checks the decision against constraints. Every agent has one job and one toolset.

This is the same pattern you would design for any complex agentic workflow: specialist roles, a debate phase, a decision phase, and a verification step. TradingAgents is a working reference implementation you can read in an afternoon.

What v0.2.4 shipped

The April 2026 release is meaningful for production-curious users.

Structured-output agents. The Research Manager, Trader, and Portfolio Manager now emit structured output through the OpenAI Responses API or Anthropic’s tool-use channel. This replaces the old free-text parsing with typed JSON, which makes downstream automation reliable.

LangGraph checkpoint resume. Long-running runs can pause and restart from a saved checkpoint. If a market-data API throttles or an LLM provider returns 429, the run does not start over from scratch.

Persistent decision log. Every decision the Trader makes lands in a SQLite log with reasoning, inputs, and timestamps. You get an audit trail you can review or feed back into evaluation.

Multi-provider support. v0.2.4 added DeepSeek, Qwen, GLM, and Azure OpenAI to the existing OpenAI, Anthropic, Gemini, and Grok matrix. If you want the cheapest reasoning per token, you can swap to DeepSeek V4 through its OpenAI-compatible endpoint. If you need long-context or vision, swap to Gemini.

Docker support and Windows UTF-8 fix. Boring but important: the framework now ships a Dockerfile, and the Windows path encoding bug from v0.2.3 is gone.

The agent architecture in detail

A complete TradingAgents run looks like this.

The CLI accepts a ticker symbol and date range.
The Analyst Team fans out: each of the four analysts independently fetches data for the ticker and writes a report.
The Research Team picks up the four reports. The Bull Researcher writes a long thesis. The Bear Researcher writes a short thesis. They debate.
The Research Manager synthesizes the debate into a recommendation.
The Trader takes the recommendation, checks against the persistent decision log, and produces a trade plan.
The Risk Management team reviews. Three risk agents (Aggressive, Conservative, Neutral) push back on the plan from different angles.
The Portfolio Manager either approves or sends the plan back for revision.
The final decision lands in the SQLite log.

Most of the LLM cost is in steps 3 and 6, where multiple agents debate. This is also where small models get exposed: a 7B model running the Bull/Bear debate produces noisy, repetitive arguments. A reasoning model (DeepSeek V4 thinking mode, GPT-5.5, Claude 4.5) produces structured back-and-forth that resembles a real research meeting.

Why test the LLM layer with an API tool

When you run TradingAgents, two surfaces fail in production: the market-data APIs (Yahoo Finance, FinnHub, Polygon, OpenBB) and the LLM provider APIs.

The market-data side is dirty. Free tiers have inconsistent rate limits, undocumented fields drop in and out, and trading-day boundaries differ across vendors. A run that worked on Tuesday silently breaks Wednesday because a vendor renamed regularMarketTime to regular_market_time.

The LLM side is also dirty, in a different way. DeepSeek V4 thinking mode doubles your cost; OpenAI Responses API has its own quirks; Anthropic’s tool use returns content blocks that some downstream parsers gag on.

Both surfaces want the same thing from you: a saved, replayable canonical request collection with assertions. That is exactly what Apidog is for. We covered the same testing pattern at the protocol level in MCP server testing playbook.

Mocking the market-data APIs in Apidog

Three steps to remove vendor flakiness from your TradingAgents test runs.

Step 1: define the upstream endpoints. In an Apidog project, add the Yahoo Finance, FinnHub, Polygon, or OpenBB endpoints TradingAgents calls. The README for each tool spec lists the exact URLs. Save each as a request with example response bodies pulled from real responses.

Step 2: turn on the mock server. Apidog’s mock server returns the example responses on the same URL paths the real vendor uses. Point TradingAgents’ tool config at the mock URL. The Fundamentals Analyst now runs against deterministic data; your tests are no longer at the mercy of Yahoo’s rate limit.

Step 3: capture vendor drift. Once a week, replay the live endpoints and diff the response shape against your saved fixtures. Apidog highlights any added, removed, or renamed fields. This is how you catch the regularMarketTime rename before it kills a run.

We use the exact pattern in contract-first API development, which describes the broader workflow.

Testing the LLM provider layer

The provider layer needs three things tested before you scale up runs.

Cost per role. Run a single ticker through all four analysts and the debate. Capture token counts per agent in Apidog’s request log. The Bull/Bear debate is usually 3-5x more expensive than the analysts; if not, the model is short-circuiting.

Output shape. v0.2.4’s structured-output agents (Research Manager, Trader, Portfolio Manager) should always return well-formed JSON. Add JSONPath assertions in Apidog to verify. A regression here is silent and devastating; you find out only when downstream code crashes.

Provider parity. When you swap from OpenAI to DeepSeek V4 to test cost, the Trader’s decisions should differ on individual runs but converge on similar conclusions across many runs. Run 50 tickers through both providers, compare the persistent decision log, and quantify the drift. Our DeepSeek V4 API guide covers the request shape; our GPT-5.5 API guide covers the OpenAI side. Apidog’s response diff makes the comparison visual.

A minimal TradingAgents run

The README quickstart looks roughly like this.

git clone https://github.com/TauricResearch/TradingAgents
cd TradingAgents
pip install -r requirements.txt

export OPENAI_API_KEY="sk-..."
export FINNHUB_API_KEY="..."

python -m tradingagents.cli \
  --ticker AAPL \
  --date 2026-04-30 \
  --models gpt-5.5 \
  --rounds 2

Two rounds of debate is the smallest meaningful run. The output lands in tradingagents/results/ as JSON plus a markdown decision summary.

To swap to DeepSeek V4 Pro for the reasoning-heavy roles, set the --models flag and point the OpenAI client at DeepSeek’s base URL through the framework’s provider config:

export DEEPSEEK_API_KEY="sk-..."

python -m tradingagents.cli \
  --ticker AAPL \
  --date 2026-04-30 \
  --models deepseek-v4-pro \
  --provider deepseek \
  --rounds 2

The same pattern works for Qwen 3.6, GLM 5, or any local model served by Ollama or vLLM. Our best local LLMs of 2026 post covers the local serving side.

Common pitfalls

These show up in the GitHub issues thread.

Running with a small model. A 7B local model produces a Bull/Bear debate that loops without resolving. The framework needs at least mid-tier reasoning quality. DeepSeek V4 Flash, Qwen 3.6 32B, GPT-5.5, and Claude 4.5 are the realistic floor.

Skipping market-data caching. Every analyst calls the data layer separately. Without caching, you fan out 4-8 vendor requests per run and burn rate-limit budget fast. The framework supports caching; turn it on.

Treating it as a trading bot. It is research code. Backtest performance is sensitive to model choice, prompt seed, debate length, and data quality. Treat any number it produces as a hypothesis, not a strategy.

Forgetting to log token spend. A single ticker run can cost $0.10 to $5 depending on model and rounds. Log per-run cost in Apidog’s replay history; a runaway loop in the debate phase can rack up real money in minutes.

Hardcoding one provider. v0.2.0 added multi-provider support precisely so you can swap. Use it. Run a small batch through three providers and compare the decision log before committing.

Where Apidog fits in the dev loop

Three concrete places Apidog earns its keep across a TradingAgents project.

The first is the design surface. Before you wire the framework to live vendors, sketch each market-data endpoint in Apidog as a request with example bodies. The schema view forces you to be honest about which fields the framework actually uses. Many teams discover they were paying for a Polygon plan they barely consumed.

The second is local CI. Apidog’s mock server stands in for every vendor while unit tests run, so the test suite stays under five seconds and stops depending on weekend market hours. We covered this exact pattern in API testing without Postman.

The third is regression diffing. Every weekly run, replay the live endpoints against your saved fixtures. Apidog highlights field renames and shape drift. This is the cheapest possible alarm for “the data layer broke and the agents started hallucinating numbers.”

Why this matters beyond trading

TradingAgents is the clearest open-source example of agentic decomposition we have right now. The pattern transfers directly to:

Customer support triage (analyst agents per ticket type, debate, decision)
Code review (security, performance, style agents, then a synthesizer)
Compliance review (data analysts, risk reviewers, decision committee)
Research summarization (multiple specialist readers, debate, synthesis)

If you are designing any multi-step agent workflow, read the TradingAgents code first. The role separation, the debate stage, the structured-output decisions, and the persistent log are reusable patterns. They are also testable patterns, which is the point of pairing the framework with Apidog.

Real-world use cases

A quant research student uses TradingAgents to compare DeepSeek V4 vs GPT-5.5 vs Claude 4.5 on the same 30-ticker basket. Apidog captures every request and response so the comparison is reproducible.

A fintech engineer uses the multi-agent pattern (not the trading code) to run code reviews on internal services. Specialist agents check security, performance, naming. A synthesizer writes the PR comment. Total review cost per PR: about $0.04.

A solo developer running TradingAgents nightly on a watchlist of 10 tickers logs every decision into Postgres for later inspection. The Apidog mock server stands in for the live market-data vendors during weekend test runs.

Conclusion

TradingAgents is a working, well-architected example of how to build a multi-agent LLM system that produces structured decisions instead of chat. v0.2.4 makes it production-curious: structured outputs, checkpoint resume, audit trail, multi-provider. None of that matters if you cannot test the LLM and market-data layers underneath. That is where pairing it with Apidog earns its keep.

Five takeaways:

TradingAgents decomposes trading into specialist agents with clear roles and a debate phase.
v0.2.4 adds structured outputs, LangGraph checkpoints, and DeepSeek/Qwen/GLM/Azure providers.
Mock the market-data vendors in Apidog so test runs are deterministic.
Test LLM provider parity before swapping models in production.
The pattern (specialists, debate, decision, log) transfers to every non-trading agent workflow you build.

Next step: clone the repo, run a single ticker against your preferred LLM, and pipe the upstream calls through an Apidog mock server. You will know within an hour whether the framework fits your workflow.

FAQ

Is TradingAgents safe to use with real money?

The repo is explicit that it is research code and not financial advice. Treat its output as a hypothesis. Anyone shipping it against a live brokerage takes on the risk personally; the maintainers do not endorse that.

Which LLM provider gives the best cost-quality tradeoff?

For most workloads in early 2026, DeepSeek V4 Flash with thinking mode beats GPT-5.5 on cost by a wide margin and matches it on Bull/Bear debate quality. See our DeepSeek V4 API guide for the request shape.

Can I run TradingAgents on local models?

Yes. v0.2.0 added multi-provider support; Ollama, vLLM, and LM Studio all serve OpenAI-compatible endpoints the framework consumes. See our best local LLMs of 2026 post for model picks.

How do I mock the market-data APIs?

Define each vendor endpoint in Apidog, turn on the mock server, and point the framework’s tool config at the mock URL. The same pattern is documented in API testing tools for QA engineers.

What’s the minimum hardware to run this?

If you are calling hosted LLMs (OpenAI, Anthropic, DeepSeek), any laptop with Python 3.10+ runs it. If you serve local models, the minimum hardware tracks the model: a 24 GB GPU runs DeepSeek V4 Flash or Qwen 3.6 32B; an 8 GB GPU runs Llama 5.1 8B. Quality drops with smaller models.

Does it support after-hours and weekend simulation?

The market-data vendors return historical data; the framework can run any date you pick. Live trading is a different problem the framework explicitly does not solve.

How does it compare to other multi-agent frameworks?

TradingAgents is opinionated for the trading domain. CrewAI, AutoGen, and LangGraph itself are general-purpose. If you want to learn the pattern and apply it elsewhere, read TradingAgents; if you want to build a generic agent system, start with the underlying LangGraph code.