Preview · Announced May 20, 2026

Qwen3.7-Plus: the balanced multimodal flagship

Vision-capable, reasoning-first, and built to read the world as readily as it reads text. The multimodal half of Alibaba's Qwen 3.7 generation.

Qwen3.7-Plus-Preview - an early build. Specs, behavior, and toolchain may change before stable release.

Try on Qwen Chat View Benchmarks

Text + Vision

Input modalities

Lab rank, vision

#16

Arena, vision overall

Large

Context (preview)

Overview

What is Qwen3.7-Plus?

The model you reach for the moment an image, a chart, a screenshot, or a video frame enters the conversation.

Qwen3.7-Plus is the multimodal member of Alibaba's Qwen 3.7 generation, released as a preview alongside its text-only sibling, Qwen3.7-Max. Both models come from the Qwen team at Alibaba Cloud - the AI division of Alibaba Group - and both were unveiled at the 2026 Alibaba Cloud Summit in Hangzhou on May 20, 2026, the same event where Alibaba announced a self-developed AI accelerator chip and a sweeping slate of more than fifty agent-oriented products. Where Qwen3.7-Max is the pure reasoning flagship that takes text in and gives text out, Qwen3.7-Plus is the balanced generalist: a high-performance preview that accepts both text and vision input and is described by the team as focused on reasoning and logical expression across modalities.

That single difference - vision - is the whole reason Plus exists as a distinct model. Modern work is rarely text-only. People paste screenshots of error messages, photograph whiteboards, drop in financial charts, share UI mockups, and ask a model to reason over what it sees. A text-only flagship, however capable, simply cannot participate in that workflow. Qwen3.7-Plus is the Qwen team's answer: keep the reasoning depth that defines the 3.7 generation, but wrap it in a model that can natively perceive images and video frames rather than relying on a separate captioning step or an external vision encoder bolted on afterward.

Qwen by Alibaba Cloud — Qwen (Tongyi Qianwen) is Alibaba Cloud's family of large language models, first released in 2023.

It is worth being precise about the model's status, because the early coverage has been noisy. As of late May 2026, Qwen3.7-Plus exists only as a preview. Like the Max flagship, it is proprietary and closed-weight - there is no open-weight, downloadable Qwen3.7-Plus release, and no Apache-2.0 license attached to it the way there has been for many smaller Qwen models in earlier generations. Access runs through Alibaba Cloud's Model Studio API and the Qwen Chat web interface. The "-Preview" suffix is not decorative: Alibaba treats these as early builds, and benchmark scores, behavior, pricing, and even the toolchain can shift before a stable release lands. Plus in particular is described as having its tool and function-calling support "rolling out" gradually rather than fully finished at launch, so anyone building on it today should expect the surface area to expand over the coming weeks.

Plus first surfaced the way recent Qwen models tend to: quietly. Before any press release, both 3.7 preview models appeared as anonymous-feeling entries on a public preference leaderboard, where developers could compare them blind against the rest of the frontier. That rollout pattern - deploy first, validate publicly, announce after - has become an Alibaba signature, and it is precisely why an Arena appearance from the Qwen team now reads as the opening move of a launch rather than a casual experiment.

For positioning, Qwen3.7-Plus sits one tier below the Max flagship on raw reasoning but one tier above it on flexibility: it can do things Max structurally cannot. It builds directly on the Qwen 3.6 line - still the family's shipping, generally available generation - and inherits the 3.7 generation's central bet that the next frontier is sustained, multi-step reasoning and agentic reliability rather than simply piling on parameters. The difference is that Plus carries that bet into a world of pixels as well as tokens.

The lineup

Plus vs. Max: two previews, two jobs

Choosing between them comes down to one question, asked first: do you need the model to see, or only to read?

Across recent generations, Alibaba has shipped a top-tier "Max" model alongside more accessible variants, and the 3.7 preview follows that shape. The two models are siblings, not interchangeable tiers - they target genuinely different needs, and the clean dividing line between them is modality.

▣ MULTIMODAL · THIS PAGE

Qwen3.7-Plus-Preview

Balanced · text + vision

ModalityText + vision input

FocusReasoning + logic

ContextLarge (preview)

ToolchainRolling out

WeightsClosed / proprietary

Best forVision + multimodal

Arena (vision)#16 overall

◰ TEXT REASONING FLAGSHIP

Qwen3.7-Max-Preview

Reasoning flagship · text-only

ModalityText in, text out

ModeExtended thinking

Context1,000,000 tokens

WeightsClosed / proprietary

Best forCode, agents, math

API stringqwen3.7-max

Index score56.6 (#5)

How to pick. Reach for Qwen3.7-Max when your inputs are pure text and you want the deepest possible reasoning - coding agents, long-horizon automation, step-heavy math and logic. Reach for Qwen3.7-Plus the instant images or video enter the input: reading diagrams, interpreting screenshots, reasoning over charts, or analysing video frames. The decision is rarely about which model is "better" in the abstract; it is about which modality your task actually requires. If your pipeline mixes both - say, an agent that mostly handles text but occasionally needs to read a screenshot - Plus is the safer default, because it is a superset on input even if Max edges it on the hardest text-only reasoning.

One practical note. Plus is positioned as the balanced model, not the maximal one. If a task is entirely text and entirely reasoning-bound - a gnarly multi-file refactor, a long proof - Max's extended-thinking layer and confirmed 1M-token window give it the edge. Plus trades a sliver of that peak text performance for the ability to perceive, which for a huge share of real-world work is the trade that actually matters.

How it sees & thinks

Native vision, reasoning-first

Plus does not bolt a vision module onto a chatbot; it reasons over what it sees as part of one model.

The defining capability of Qwen3.7-Plus is that vision is a first-class input, not an afterthought. You can hand it an image - a chart, a UI screenshot, a photo of a handwritten page, a frame from a video - and ask it to reason about the content, extract structure, answer questions, or feed what it sees into a longer chain of logic. The Qwen team frames Plus explicitly around reasoning and logical expression, which is the throughline that connects it to the rest of the 3.7 generation: this is a model designed to think carefully about its inputs, whether those inputs are sentences or pixels.

That framing matters because multimodal models historically fell into two camps. Some were strong at perception - describing an image accurately - but weak at the downstream reasoning that makes perception useful. Others reasoned well over text but treated images as a thin captioning layer. The 3.7 generation's emphasis on extended, deliberate reasoning is meant to close that gap on the Plus side: perceive the image, then reason about it with the same care the family brings to text. In the blind preference rankings on the vision leaderboard, that ambition showed up as a strong result - Plus helped place Alibaba as the #5 lab in vision capability, with the model itself landing around #16 overall on the vision board. For a preview that had not even been formally documented when developers first started comparing it, that is a genuinely competitive showing.

Why "balanced" is the right word

Alibaba describes Plus as a high-performance balanced preview, and the word is doing real work. Plus is not trying to be the single best model at any one thing. It is trying to be broadly excellent across a wide range of inputs and tasks - strong reasoning, solid coding, capable vision, sensible logic - so that a developer can route a varied workload to one model rather than juggling a fleet of specialists. That generalist posture is exactly what you want when you don't know in advance whether the next request will be a paragraph of text, a screenshot, or a diagram. The cost of that breadth is that a dedicated text reasoner like Max will out-reason it on the very hardest text-only problems, and a dedicated coding model will out-code it on pure code. The benefit is flexibility, and for many production systems flexibility beats peak specialism.

A preview-stage caveat worth internalizing. Because Plus is an early build, several of its specifics are deliberately fuzzy. Its exact context window is described as "large" rather than pinned to a confirmed number the way Max's million-token window is, and its function-calling toolchain is still rolling out. Independent, third-party evaluation of Plus's vision reliability - especially on hard, adversarial images - is not yet available. Treat the vision leaderboard placement as an encouraging signal of relative quality, not as a guarantee for your specific images, and validate against your own data before you depend on it.

One more architectural point connects Plus to the broader 3.7 story. The generation as a whole is built around the idea that good answers often require deliberate, multi-step thinking rather than a single reflexive pass. On the Max flagship that surfaces as an explicit extended-thinking mode. Plus inherits the same reasoning-first philosophy, applied across modalities - the goal being a model that looks at a complex chart and works through what it implies, rather than merely naming the chart type. How fully that deliberate-reasoning behavior is exposed and controllable on Plus is one of the things likely to firm up as the preview matures.

Benchmarks

Where the generation lands

Read Plus through two lenses: its own vision-leaderboard standing, and the reasoning ceiling its sibling Max sets for the generation.

Plus has not yet been run through the full aggregate intelligence indices that the text-only Max flagship has, which is normal for a multimodal preview whose evaluation surface is still being built out. What we do have is its standing in blind, crowd-sourced preference comparisons on the vision leaderboard, where it reached roughly #16 overall and helped place Alibaba as the #5 lab in vision. Arena-style rankings reflect what real users actually prefer in head-to-head matchups rather than what a press release claims - which is exactly why Alibaba deploys there first, and why a strong placement for an undocumented preview is meaningful.

For the reasoning ceiling of the generation, the clearest reference point is the Max flagship. On the Artificial Analysis Intelligence Index (v4.0, which aggregates ten evaluations), Qwen3.7-Max scored 56.6, placing it fifth overall - a 4.8-point gain over its predecessor and within striking distance of the very top of the frontier. That score sets the upper bound for what the 3.7 reasoning approach can do on pure text; Plus brings a meaningful share of that reasoning quality into a multimodal package, while trading some peak text performance for the ability to see.

GPT-5.5

60.2

Claude Opus 4.7

57.3

Gemini 3.1 Pro

57.2

Qwen3.7-Max

56.6

Gemini 3.5 Flash

55.3

Qwen3.6-Max

51.8

Artificial Analysis Intelligence Index v4.0 - Qwen3.7-Max shown as the generation's text-reasoning reference. Bars scaled relative to the top score (60.2).

Reading the numbers honestly

Two cautions carry over from the generation as a whole. First, the gains in the 3.7 line concentrated in reasoning, science, and agentic coding rather than spreading evenly - the team clearly spent its effort making a better reasoner, not a broader trivia engine. Second, on a knowledge-recall benchmark the Max flagship's raw accuracy actually fell while its hallucination rate dropped sharply: the model now abstains - says "I don't know" - far more often instead of guessing. For agentic and reasoning work that caution is frequently a virtue; for broad factual-recall use cases it is something to test deliberately. Whether and how strongly that same abstention behavior appears in Plus is a sensible thing to check on your own prompts, since Plus shares the generation's reasoning-first lineage.

Bottom line on benchmarks. Plus has a credible, independently-observed vision standing (#16 / #5 lab) and inherits a reasoning lineage that scored at the frontier on text. But it is a preview without a full published benchmark suite of its own. Use the numbers as direction, not as a contract - and run your own evaluation on the modality mix you actually care about.

Use cases

Where Qwen3.7-Plus fits

The model earns its place wherever perception and reasoning have to happen together.

📊

Document & chart understanding

Reading financial charts, scientific figures, dashboards, and infographics - then reasoning over what they imply rather than merely describing them. The reasoning-first framing is exactly what turns "this is a bar chart" into "revenue stalled in Q3 because of X."

🖥️

Screenshot & UI analysis

Interpreting application screenshots, error dialogs, and interface mockups. A developer can paste a broken UI and ask what's wrong; a designer can hand over a mockup and ask for critique - work that a text-only model structurally cannot touch.

🎞️

Video-frame reasoning

Reasoning over frames extracted from short video clips - following a sequence of events, spotting changes between frames, or answering questions about visual content over time, with the logical care the 3.7 generation is built around.

🔀

Mixed-modality workflows

The generalist sweet spot: pipelines where a single request might be text, an image, or both. Routing a varied workload to one balanced model is simpler and cheaper to operate than orchestrating a fleet of single-purpose specialists.

🧮

Visual problem-solving

Geometry diagrams, annotated equations, handwritten notes, and worked examples - combining perception of the image with step-by-step reasoning to reach and explain an answer rather than guessing from a caption.

🌐

Accessibility & translation

Describing images for accessibility, extracting and reasoning over text embedded in pictures, and bridging visual content into language - useful across the wide multilingual footprint the Qwen family is known for, especially across Asia.

Where to think twice. If your task is entirely text and entirely reasoning-bound - a long proof, a repository-scale refactor, a multi-hour agent loop - the Max flagship's confirmed million-token window and explicit extended-thinking mode make it the better tool. And for short, latency-sensitive, high-volume requests, a reasoning-heavy model of either kind can be overkill: the deliberate-thinking overhead is wasted on shallow prompts. Match the model to the job; Plus is the answer for "perceive and reason," not for "answer instantly at scale."

Access & build

How to try and integrate Plus

Two front doors: a free chat interface for hands-on testing, and an OpenAI-compatible API for production.

The fastest path - Qwen Chat

The quickest way to test Plus needs no API key and no setup. Go to chat.qwen.ai, create a free account, and choose the Plus model in the selector (it may appear as Qwen3.7-Plus-Preview during the preview period). Because Plus is the multimodal model, the test that actually reveals its character is to upload an image - a chart, a screenshot, a diagram - and ask it to reason about what it sees, not merely describe it. Trivial text prompts tell you almost nothing about a frontier multimodal model's real edge.

The API - OpenAI & Anthropic compatible

The Qwen 3.7 models are exposed through Alibaba Cloud's Model Studio (DashScope) and are compatible with both the OpenAI and Anthropic API specifications, so they usually drop into existing pipelines with minimal changes. Grab a key from Model Studio; the international base URL is dashscope-intl.aliyuncs.com. For Plus specifically, you send image content alongside text in the standard multimodal message format - and remember the toolchain is still rolling out, so check the current docs for the exact model string and supported features.

# OpenAI-compatible call with an image (multimodal)
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DASHSCOPE_API_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.7-plus",   # confirm exact string in Model Studio
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": "What trend does this chart show, and why?"},
            {"type": "image_url",
             "image_url": {"url": "https://example.com/chart.png"}}
        ]
    }]
)
print(response.choices[0].message.content)

Pricing

Official pricing for the Qwen 3.7 previews had not been fully published at announcement. As a reference point, the prior-generation Qwen3.6-Max-Preview was priced at roughly $1.30 per million input tokens and $7.80 per million output tokens on Alibaba Cloud. Multimodal input adds its own token accounting - images consume tokens based on resolution - so budget for that on top of text. Confirm current Plus rates in Model Studio before committing, since preview pricing can change.

Build-time checklist. Send images in the standard multimodal content array; keep image resolution sensible, since higher resolution means more tokens billed; treat the tool/function-calling surface as still expanding and re-check the docs; and when writing tests, assert on the conclusion the model draws from an image, not on the exact wording of its description.

The bigger picture

Where Plus fits in Alibaba's strategy

A multimodal flagship is easier to read once you see the family - and the full-stack ambition - it belongs to.

Qwen began life in 2023 as Alibaba Cloud's answer to the first wave of large language models, released under the Chinese name Tongyi Qianwen. For its first couple of years the series built its reputation primarily on open-weight releases - capable models, in a range of sizes, that developers could download and run themselves, often under permissive licenses. That openness is a large part of why Qwen became one of the most widely deployed model families in the world, particularly across Asia, and why so many fine-tunes and derivatives are built on Qwen bases. It also explains a point of friction in how the 3.7 generation has been received: a community accustomed to downloading Qwen weights met flagships that, this time, are closed - and Plus is no exception.

Through early 2026, Alibaba's release cadence accelerated dramatically - to roughly bi-weekly at its peak, matching the tempo of the fastest Western labs for the first time. Qwen 3.5 arrived, then Qwen 3.6, then a stream of specialized models: translation systems, image generators, audio-video reasoning variants, and dense models tuned for coding. Each generation tightened the gap with the frontier. Qwen 3.6 introduced the "Max" preview concept - a top-tier, closed-weight model validated quietly on public leaderboards before any announcement - and a 256K-token context window. The 3.7 generation is the direct continuation of that line, pushing context and reasoning depth further. Plus extends the strategy into multimodality: the same preview-first, validate-publicly approach, applied to a model that can see.

Alibaba Cloud Summit 2026 — Qwen 3.7 was unveiled at the 2026 Alibaba Cloud Summit in Hangzhou, alongside Alibaba's self-developed AI accelerator chip.

The strategic context unveiled at the Summit is as important as any single model. Alibaba did not announce the 3.7 generation in isolation; it announced a self-developed AI accelerator chip and more than fifty agent-oriented products as part of a single, coherent "full-stack agentic AI" message. The thesis is a flywheel: the models improve the software stack, the hardware runs the models, and the whole thing is owned end to end. This is also unfolding against the backdrop of tightening export controls on advanced chips, which gives Alibaba a strong incentive to reduce its dependence on foreign semiconductors. In that frame, Plus is one piece of a much larger industrial bet - the multimodal interface through which a great deal of real-world, perception-heavy work will flow into that stack.

For a practitioner deciding what to actually use, the takeaways are concrete. If you need stable, generally available, and in many cases open-weight models today, the Qwen 3.6 line remains the pragmatic choice and is unaffected by the 3.7 preview. If you want to evaluate the frontier of Alibaba's multimodal reasoning - and you are comfortable with a closed, preview-stage, API-only model whose numbers and toolchain may shift - Qwen3.7-Plus is where to look for anything involving images or video. And if the eventual stable release follows the pattern of prior generations, expect the preview window to be measured in weeks rather than months, with refinements to behavior, an expanded toolchain, and the first official pricing arriving alongside it. Bookmark the official Qwen blog and Model Studio docs, because in a release cadence this fast, the authoritative answer changes often.

How the launch unfolded

A now-familiar Alibaba pattern

Deploy quietly, validate publicly, announce after - Plus rode in on exactly this playbook.

May 14, 2026

Quiet Arena debut

Two preview models - Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview - appear on public preference leaderboards with no press release, no blog post, and no API announcement. Developers spot them and start comparing them blind against the rest of the frontier.

~May 18, 2026

The single tweet

Alibaba teases the launch on X with a one-line "can't wait to release the 3.7 series, stay tuned" - confirming what the community had already inferred from the leaderboard sightings.

May 19, 2026

API access lands

The models arrive on Alibaba Cloud's Model Studio platform for developers, opening up programmatic access ahead of the formal reveal.

May 20, 2026

Formal reveal at the Summit

At the Alibaba Cloud Summit in Hangzhou, the 3.7 generation is announced alongside Alibaba's self-developed AI chip and 50+ agent products, framed around a full-stack "agentic AI" strategy.

This is the same playbook Alibaba ran for Qwen3.6-Max-Preview in April 2026: validate first on blind human preference, market later. The community has learned to read an Arena appearance from the Qwen team as a one-to-two-week preview ahead of a formal launch - which is exactly how the 3.7 generation, Plus included, played out.

FAQ

Frequently asked questions

What is the difference between Qwen3.7-Plus and Qwen3.7-Max?

Modality is the dividing line. Qwen3.7-Max is text-in, text-out only - the pure reasoning flagship with a confirmed 1M-token context window and an explicit extended-thinking mode. Qwen3.7-Plus accepts both text and vision input and is positioned as the balanced, multimodal generalist. Use Max for the hardest text-only reasoning, coding, and long agent loops; use Plus the moment images or video are part of the input.

Is Qwen3.7-Plus open-weight or open-source?

No. Like the Max flagship, Qwen3.7-Plus is proprietary and closed-weight. There is no downloadable, open-weight Qwen3.7-Plus release and no Apache-2.0 license attached to it as of late May 2026. Access is via Alibaba Cloud's Model Studio API and Qwen Chat. Many smaller Qwen models in earlier generations were open-weight, but that does not extend to the 3.7 previews.

Does Qwen3.7-Plus support images and video?

Yes - that is its whole reason for existing. Plus accepts text and vision input, including images and video frames, and is built to reason over what it sees rather than just caption it. This is precisely the capability Qwen3.7-Max lacks, since Max is text-only.

How big is the Plus context window?

It is described as "large" for the preview rather than pinned to a confirmed number, unlike Max's stated 1,000,000-token window. Because Plus is an early build, treat its exact context capacity as provisional and check the current Model Studio documentation before relying on a specific figure.

How does Plus rank against other vision models?

In blind preference comparisons on the vision leaderboard, Qwen3.7-Plus-Preview reached roughly #16 overall and helped place Alibaba as the #5 lab in vision capability - a strong showing for a preview that had not yet been formally documented. It has not yet been run through the full aggregate intelligence indices, so treat the Arena standing as the most reliable current signal and validate on your own images.

Does Plus support tool use and function calling?

It is rolling out. Alibaba describes the Plus toolchain as opening up gradually rather than being fully finished at launch, which is typical for a preview model. If your build depends on function calling, check the current Model Studio docs for exactly which features are live before committing.

What does Qwen3.7-Plus cost?

Official pricing for the 3.7 previews was not fully published at announcement. For reference, the prior Qwen3.6-Max-Preview ran roughly $1.30 per million input tokens and $7.80 per million output tokens. Multimodal input adds its own token accounting - images cost tokens based on resolution - so budget for that on top. Confirm current Plus rates in Model Studio.

Should I use Plus or wait for a stable release?

If you need stable, generally available models today, the Qwen 3.6 line remains the pragmatic choice and is unaffected by the 3.7 preview. Use Plus now if you want to evaluate Alibaba's frontier multimodal reasoning and are comfortable with a closed, preview-stage, API-only model whose specs and toolchain may shift. If prior generations are a guide, the stable release - with firmer specs and official pricing - is likely weeks, not months, away.