Qwen3.7: The Agent Frontier | Qwen3.7-Max AI Model

What is Qwen 3.7?

Alibaba's most advanced agent model to date - and a deliberate bet that the next frontier is sustained autonomy, not single-shot answers.

Qwen 3.7 is the newest generation of large language models from the Qwen team at Alibaba Cloud, the AI division of Alibaba Group. It was formally announced at the 2026 Alibaba Cloud Summit in Hangzhou on May 20, 2026, alongside a self-developed AI accelerator chip and a broad slate of agent products. The generation is led by two preview models released together: Qwen3.7-Max-Preview, the text-only reasoning flagship, and Qwen3.7-Plus-Preview, a balanced multimodal variant that accepts vision input. Both arrived first as quiet entries on a public preference leaderboard before any press release - a rollout pattern Alibaba has now used across multiple Qwen generations to validate real-world quality before marketing it.

The framing throughout Alibaba's messaging is unmistakably agentic. Where earlier chat models were optimized to give a strong answer in a single pass, Qwen 3.7 is pitched less as a chatbot and more as an engine for long-running, multi-step work: running hundreds of iterative code edits, chaining tool calls across hours, automating office workflows, and carrying a problem forward without a human steering every step. That is a different design target, and it shows up in nearly every choice the team made - from the enormous context window to the extended-thinking reasoning layer to the headline internal demo of the model writing software for Alibaba's own chip across a 35-hour run.

Qwen by Alibaba Cloud — Qwen (Tongyi Qianwen) is Alibaba Cloud's family of large language models, first released in 2023.

It is worth being precise about what Qwen 3.7 is and is not right now, because the early coverage has been noisy. As of late May 2026, Qwen 3.7 exists only as a preview. The flagship is proprietary and closed-weight - there is no open-weight, downloadable Qwen 3.7 release, and no Apache-2.0 license attached to it the way there has been for many smaller Qwen models. Access is through Alibaba Cloud's Model Studio API and the Qwen Chat web interface. The "-Preview" suffix is not cosmetic: Alibaba treats these as early builds, and benchmark scores, behavior, and pricing can all shift before a stable release. Anyone planning to build on it should treat today's numbers as provisional and validate against their own workload.

For context, Qwen 3.7 sits directly atop the Qwen 3.6 line, which is still the family's shipping, generally available tier. Qwen 3.6 introduced the closed-weight "Max" preview approach and a 256K-token context window; Qwen 3.7 quadruples that context to a full million tokens and pushes hard on reasoning depth and agentic reliability rather than simply scaling parameters. The result is a model that is meaningfully smarter on hard reasoning and coding tasks, while making some deliberate tradeoffs - discussed below - that buyers should understand before integrating.

The Qwen 3.7 lineup

Two previews, two jobs: a text reasoning flagship and a multimodal generalist.

Across recent generations, Alibaba has consistently shipped a top-tier "Max" model alongside more accessible variants. The 3.7 preview follows that shape. The two models target genuinely different needs, and choosing between them comes down to one question first: do you need image and video understanding, or pure text reasoning?

Qwen3.7-Max-Preview

Reasoning flagship · text-only

ModalityText in, text out
Context1,000,000 tokens
ModeExtended thinking
WeightsClosed / proprietary
Best forCode, agents, math
API stringqwen3.7-max

Qwen3.7-Plus-Preview

Balanced · multimodal

ModalityText + vision
ContextLarge (preview)
FocusReasoning + logic
ToolchainRolling out
Best forVision + multimodal
Arena (vision)#16 overall

The clean dividing line is modality. Qwen3.7-Max is the model Alibaba formally announced with API access, and it is strictly text - there is no image input. If you need the model to read diagrams, screenshots, charts, or video frames, you reach for Qwen3.7-Plus instead, which handles vision and multimodal inputs and is described as a high-performance balanced preview focused on reasoning and logical expression, with its toolchain opening up gradually. In the blind preference rankings, Plus placed Alibaba as the #5 lab in vision capability - a strong showing for a preview that had not yet been formally documented.

Picking one: Default to Max for coding agents, long-horizon automation, and step-heavy math or logic where you only feed text. Switch to Plus the moment images or video enter the input. They are siblings, not interchangeable tiers.

How it thinks: extended reasoning & a million tokens

Two architectural choices define the model - a reasoning layer that plans before it answers, and a context window large enough to hold a whole repository.

Extended-thinking mode

Qwen3.7-Max is a reasoning model in the now-standard frontier sense: before it commits to a final answer, it generates an internal chain of thought - a sequence of steps where it plans, checks its own work, and corrects course. On the Qwen Chat interface this surfaces as a "Thinking" toggle that lets you watch the reasoning trace unfold; through the API it is enabled with an enable_thinking flag. The tradeoff is concrete and worth internalizing: reasoning models emit dramatically more tokens. In one independent evaluation, Qwen3.7-Max produced roughly 97 million tokens where the average model on that same benchmark generated about 24 million. For a short rewrite or a simple classification, that overhead is pure latency and cost with no quality payoff. For multi-step planning, large refactors, or long agent chains, it is exactly where the model earns its keep. The practical rule: turn thinking on for hard, multi-step problems and off for quick, shallow ones.

The 1M-token context window

The headline capacity figure is a one-million-token context window, up from 256K on the prior Qwen3.6-Max preview - a fourfold jump. A million tokens is enough to hold a full mid-sized code repository, a large stack of legal or technical documents, or an entire long agent history in a single request, without the brittle chunking and retrieval gymnastics smaller windows force. For agentic work specifically, this matters: you can keep full task history, prior tool outputs, and current code state all resident in context, so the model reasons over the complete picture instead of a lossy summary.

A ceiling, not a guarantee. A 1M-token window is the maximum the model accepts, not a promise that it reasons equally well across all of it. Models frequently degrade as the window fills, and independent long-context reliability testing for Qwen3.7-Max is not yet available. If your use case depends on retrieving a needle from deep in a huge context, test that specifically - and remember every token in context is billed, so trim aggressively when you do not need the full history.

Benchmarks: where it lands

Fifth overall on a leading aggregate index - strong on reasoning and code, with one honest caveat on factual recall.

On the Artificial Analysis Intelligence Index (v4.0, which aggregates ten evaluations including GDPval-AA, Terminal-Bench Hard, SciCode, Humanity's Last Exam, and GPQA Diamond), Qwen3.7-Max scored 56.6, placing it fifth overall. That is a 4.8-point gain over its predecessor Qwen3.6-Max-Preview (51.8) and edges out Google's Gemini 3.5 Flash (55.3). Ahead of it sit GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2). For a preview model from a lab that was, not long ago, considered a step behind the Western frontier, landing within a few points of the leaders is the real story.

GPT-5.5

60.2

Claude Opus 4.7

57.3

Gemini 3.1 Pro

57.2

Qwen3.7-Max

56.6

Gemini 3.5 Flash

55.3

Qwen3.6-Max

51.8

Qwen 3.7 Other frontier models

Artificial Analysis Intelligence Index v4.0. Bars scaled relative to the top score (60.2).

Where the gains concentrated

The improvement over Qwen 3.6 was not spread evenly - it clustered in scientific reasoning, agentic capability, and coding. On CritPt, the model jumped 9.7 points (3.7% → 13.4%). On Humanity's Last Exam it rose 9.2 points (28.9% → 38.1%). Terminal-Bench Hard, an agentic coding benchmark, climbed 6.9 points (43.9% → 50.8%), and GDPval-AA added 42 Elo. Several other benchmarks stayed roughly flat, which tells you where the team spent its effort: making the model a better reasoner and agent, not a broadly better trivia engine.

On LM Arena

In blind, crowd-sourced preference comparisons on LM Arena's text leaderboard, Qwen3.7-Max-Preview reached #13 overall with an Elo of about 1,475, placing Alibaba as the #6 lab in text. Its category rankings were stronger than the overall: #7 in math, #9 in expert prompts, #9 in software and IT, and #10 in coding. Arena rankings reflect what real users actually prefer in head-to-head matchups rather than what a press release claims, which is precisely why Alibaba deploys there first.

The factual-recall tradeoff - read this carefully. On the AA-Omniscience benchmark, Qwen3.7-Max's raw accuracy fell 7.6 points (37.7% → 30.1%), while its hallucination rate dropped sharply by 21.3 points (44.2% → 22.9%). What happened: the model now chooses to say "I don't know" far more often instead of guessing. Its attempt rate fell from 67.3% to 48.0% - the lowest among the frontier models compared. It posts the lowest hallucination rate in the frontier tier, but partly by answering fewer questions. For agent and reasoning work this caution is often a virtue; for broad knowledge-recall use cases, test it carefully against your needs.

The agentic headline: 35 hours, 1,000+ tool calls

The demo that defined the launch - and the caveat that should travel with it.

The most striking claim of the launch came from Alibaba's own internal testing. On a new chip platform, Qwen3.7-Max reportedly ran autonomously for up to 35 hours, performing more than 1,000 tool calls and iterative code modifications to optimize a key compute kernel. Alibaba says the process improved that kernel's inference speed by roughly 10x compared with the previous version - the model, in effect, writing meaningful parts of the software stack for the company's own hardware. It is a vivid illustration of what "long-horizon autonomous execution" is meant to mean in practice: not one clever answer, but thousands of small, verified steps sustained over a day and a half.

Qwen 3.7 was unveiled at the 2026 Alibaba Cloud Summit in Hangzhou, alongside Alibaba's self-developed AI accelerator chip.

Take the numbers with appropriate salt. The 35-hour and 1,000+ tool-call figures come from Alibaba's internal testing only - no independent verification exists for these specific claims. They are a credible signal of design intent and a strong marketing artifact, but not yet a reproducible third-party result. Treat them as a hypothesis to validate, not a guarantee.

What is more defensible is the direction. The model natively supports function calling and iterative tool invocation, the 1M context lets it carry full state across a long loop, and the extended-thinking layer gives it room to plan and self-correct between actions. Those are the right ingredients for agentic reliability, and the benchmark gains in agentic coding (Terminal-Bench Hard) line up with the story. Whether your particular 35-hour job survives intact is, as always, an empirical question for your own harness.

Use cases that fit

Where Qwen 3.7's strengths - depth, context, persistence - actually pay off.

Repository-scale coding & debugging

With a million tokens of context and strong agentic-coding benchmarks, the natural fit is large-codebase work: reading a whole repo into context, planning a multi-file refactor, running an edit-test-fix loop, and chasing a bug across many modules. The extended-thinking mode shines on exactly this kind of step-heavy, verify-as-you-go task.

Long-horizon automation & agents

Office workflow automation, multi-step data pipelines with iterative verification, and kernel- or systems-level optimization loops are the model's pitched home turf. The combination of persistent context, function calling, and self-correction is built for chains that would overwhelm a single-pass model.

Hard reasoning: math, science, expert prompts

The biggest benchmark gains landed in scientific reasoning and math, and the Arena category rankings (#7 math) back that up. For proof-style problems, complex analysis, and expert-level questions, the reasoning layer is the differentiator.

Multimodal understanding (via Plus)

When the input includes images or video - reading charts, interpreting screenshots, reasoning over diagrams - Qwen3.7-Plus is the model to reach for, given Max's text-only constraint.

Where to think twice: short, latency-sensitive, high-volume tasks (where reasoning overhead is wasted) and broad factual-recall applications (given the AA-Omniscience abstention behavior). Match the tool to the job.

How to access & build

Two front doors: the free chat interface, and an OpenAI-compatible API.

1 · The fastest path - Qwen Chat

The quickest way to test the model needs no API key and no setup. Go to chat.qwen.ai, create a free account, and choose Qwen3.7-Max in the model selector (it may appear as Qwen3.7-Max-Preview during the preview period). Toggle on Thinking Mode to activate chain-of-thought reasoning and watch the trace. The advice that pays off: bring your hardest real prompts - multi-step math, gnarly refactors, ambiguous expert questions - because trivial prompts reveal almost nothing about a frontier model's edge.

2 · The API - OpenAI & Anthropic compatible

Qwen3.7-Max is exposed through Alibaba Cloud's Model Studio (DashScope) and is compatible with both the OpenAI and Anthropic API specifications, so it usually drops into existing pipelines with minimal changes. Grab a key from Model Studio; the international base URL is dashscope-intl.aliyuncs.com.

# OpenAI-compatible call
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DASHSCOPE_API_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[
        {"role": "user", "content": "Explain chain-of-thought reasoning."}
    ],
    extra_body={"enable_thinking": True}   # turn on reasoning
)
print(response.choices[0].message.content)

3 · Pricing

Official pricing for Qwen3.7-Max had not been published at announcement. For a reference point, the prior Qwen3.6-Max-Preview was priced at roughly $1.30 per million input tokens and $7.80 per million output tokens on Alibaba Cloud - and because reasoning models generate far more output tokens, output cost tends to dominate the bill. Budget accordingly, and use thinking mode selectively. Confirm current rates in Model Studio before committing.

Build-time checklist: define tools clearly via the standard tools parameter; pass full task state into the 1M window but trim what you don't need; and when writing tests, assert on the final answer, not the exact wording of the variable, longer reasoning trace.

Where Qwen 3.7 fits in the bigger picture

A flagship is easier to read once you see the family - and the strategy - it belongs to.

Qwen began life in 2023 as Alibaba Cloud's answer to the first wave of large language models, released under the Chinese name Tongyi Qianwen. For its first couple of years the series built a reputation primarily on open-weight releases: capable models, in a range of sizes, that developers could download and run themselves, often under permissive licenses. That openness is a large part of why Qwen became one of the most widely deployed model families in the world, particularly across Asia, and why so many fine-tunes and derivatives are built on Qwen bases. Understanding that history matters, because it explains the friction in how Qwen 3.7 has been received: a community accustomed to downloading Qwen weights met a flagship that, this time, is closed.

Through early 2026, Alibaba's release cadence accelerated dramatically - to roughly bi-weekly at its peak, matching the tempo of the fastest Western labs for the first time. Qwen 3.5 arrived, then Qwen 3.6, then a stream of specialized models: translation systems, image generators, audio-video reasoning variants, and dense models tuned for coding. Each generation tightened the gap with the frontier. Qwen 3.6 introduced the "Max" preview concept - a top-tier, closed-weight model validated quietly on public leaderboards before any announcement - and a 256K-token context window. Qwen 3.7 is the direct continuation of that line: the same Max-preview strategy, but with the context window quadrupled and the training effort pointed squarely at reasoning depth and agentic reliability.

The strategic context unveiled at the Summit is just as important as the model itself. Alibaba did not announce Qwen 3.7 in isolation; it announced a self-developed AI accelerator chip and more than fifty agent-oriented products as part of a single, coherent "full-stack agentic AI" message. The 35-hour demo - the model writing optimization software for Alibaba's own silicon - is the thesis statement of that strategy made literal: a flywheel where the model improves the hardware, the hardware runs the model, and the whole stack is owned end to end. This is also unfolding against the backdrop of tightening export controls on advanced chips, which gives Alibaba a strong incentive to reduce its dependence on foreign semiconductors. Qwen 3.7 is, in that sense, as much a geopolitical and industrial artifact as a technical one.

For a practitioner deciding what to actually use, the takeaways are concrete. If you need stable, generally available, and in many cases open-weight models today, the Qwen 3.6 line remains the pragmatic choice and is unaffected by the 3.7 preview. If you want to evaluate the frontier of Alibaba's reasoning and agentic capability - and you are comfortable with a closed, preview-stage, API-only model whose numbers may shift - Qwen 3.7 is where to look. And if the eventual stable release follows the pattern of prior generations, expect the preview window to be measured in weeks, not months, with refinements to behavior and the first official pricing arriving alongside it. Bookmark the official Qwen blog and Model Studio docs, because in a release cadence this fast, the authoritative answer changes often.

How the launch unfolded

A now-familiar Alibaba pattern: deploy quietly, validate publicly, announce after.

May 14, 2026Quiet Arena debut. Two preview models - Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview - appear on the public preference leaderboard with no press release, no blog post, and no API announcement. Developers spot them and start comparing.
~May 18, 2026The single tweet. Alibaba teases the launch on X: a one-line "can't wait to release the 3.7 series, stay tuned" - confirming what the community had already inferred.
May 19, 2026API access lands. The model arrives on Alibaba Cloud's Model Studio platform for developers.
May 20, 2026Formal reveal. At the Alibaba Cloud Summit in Hangzhou, Qwen3.7-Max is announced alongside Alibaba's self-developed AI chip and 50+ agent products, framed around a full-stack "agentic AI" strategy.

This is the same playbook Alibaba ran for Qwen3.6-Max-Preview in April 2026: validate first on blind human preference, market later. The community has learned to read an Arena appearance as a one-to-two-week preview ahead of a formal launch.

Frequently asked questions

Is Qwen 3.7 open-weight or open-source?

No. The Qwen3.7-Max flagship is proprietary and closed-weight. There is no downloadable, open-weight Qwen 3.7 release and no Apache-2.0 license attached to it as of late May 2026. Access is via Alibaba Cloud's Model Studio API and Qwen Chat. Many smaller Qwen models in earlier generations were open-weight, but that does not extend to the 3.7 Max preview.

Does Qwen3.7-Max support images or video?

No - Qwen3.7-Max is text-in, text-out only. For vision and multimodal inputs, use Qwen3.7-Plus-Preview, the balanced variant that accepts image input.

How big is the context window?

One million tokens, up from 256K on the previous Qwen3.6-Max preview. Note that a maximum window size is not a guarantee of uniform reasoning quality across the whole window; independent long-context testing isn't yet available.

How does it compare to GPT-5.5, Claude Opus 4.7, and Gemini?

On the Artificial Analysis Intelligence Index it scored 56.6 (#5 overall), behind GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2), and ahead of Gemini 3.5 Flash (55.3). It is competitive at the frontier, especially on reasoning and agentic coding, while remaining a preview.

Why did its factual accuracy drop versus Qwen 3.6?

On AA-Omniscience, raw accuracy fell from 37.7% to 30.1%, but its hallucination rate dropped from 44.2% to 22.9%. The model abstains far more often ("I don't know") rather than guessing - lowering hallucinations at the cost of raw recall. Good for agentic reliability, worth testing for knowledge-recall tasks.

What's the difference between Qwen 3.7 and Qwen 3.6?

Qwen 3.6 is the shipping, generally available generation. Qwen 3.7 is the newer preview that quadruples context to 1M tokens and concentrates its gains on reasoning, science, and agentic coding rather than across-the-board scaling. If you need stable, available models today, 3.6 is the practical choice; 3.7 is the frontier preview.

Can I trust the 35-hour autonomous-run claim?

It comes from Alibaba's internal testing and has no independent verification yet. It's a credible statement of design intent and the agentic-coding benchmark gains are consistent with it, but treat the specific figure as a vendor claim to validate, not an established result.