Is OpenClaw Worth It? We Tested the "AI Worker" Against 6 Top Models—Here's What Actually Worked

Key Points

OpenClaw is a command center, not an LLM itself; its performance depends entirely on the connected model.
Testing with a multi-step journalism workflow revealed a wide performance gap: Qwen3-Max and Kimi-K2.5 struggled significantly, while MiniMax-M2.5 and Zhipu GLM-4.7 performed well after initial hurdles.
OpenAI’s GPT-4-mini consistently outperformed all Chinese models, handling the entire workflow smoothly with “almost zero human intervention.”
Current barriers to mainstream adoption include technical setup difficulties for non-developers, high operational costs (e.g., ¥200 RMB for 20 interactions), and severe security risks due to required system permissions.
Experts largely consider OpenClaw an “advanced prototype” for developers, not yet suitable for general productivity due to immaturity and vulnerabilities.

Six Foundation Models Connected to OpenClaw

Qwen3-Max (Qianwen 千问)
Kimi-K2.5 (Yuezhi Anmian 月之暗面) from Moonshot AI
MiniMax-M2.1
MiniMax-M2.5
Zhipu GLM-4.7 (Zhipu 智谱)
OpenAI’s GPT-4-mini

The hype around OpenClaw is real.

It’s pitched as your personal “AI worker” that can take over your computer, write articles, send emails, and handle tasks hands-free.

Sounds incredible, right?

But does it actually deliver?

Or is it just another shiny tech toy that looks good in demos but falls apart in real-world use?

We decided to find out.

Working with technical developers, we ran an extensive hands-on test of OpenClaw connected to five major Chinese large language models plus OpenAI’s GPT-4-mini.

The results?

Messy. Inconsistent. And honestly pretty revealing.

What We Actually Tested

Here’s what matters to understand upfront: OpenClaw itself isn’t a large language model (LLM).

Think of it more like a command center.

It receives your instructions, decides which tools to use, and coordinates the workflow.

The actual brainpower comes from whatever model you plug into it.

We connected it to:

Qwen3-Max (Qianwen 千问)
Kimi-K2.5 (Yuezhi Anmian 月之暗面) from Moonshot AI
MiniMax-M2.1
MiniMax-M2.5
Zhipu GLM-4.7 (Zhipu 智谱)
OpenAI’s GPT-4-mini

We gave each combo a real-world challenge.

The Challenge: A Multi-Step Workflow

To keep things practical, we simulated an actual journalism scenario:

Find a specific interview transcript on the local computer
Summarize that content
Perform web research to add context
Draft a complete feature article
Send the finished piece via email

This wasn’t just testing one skill—we needed instruction comprehension, file retrieval, browser control, web searching, content writing, and email functionality.

In other words, real work.

The Results: A Wide Performance Gap

Performance Summary of Different LLM Models on OpenClaw
Model Name	Performance Category	Key Issues / Strengths
GPT-4-mini	🟢 Winner	Smooth and stable from start to finish. Almost zero human intervention required.
MiniMax-M2.5	🟡 Middle Ground	Handled the entire workflow with minimal friction. Successfully retested.
Zhipu GLM-4.7	🟡 Middle Ground	Fast processing overall. Required one small correction for email URL.
MiniMax-M2.1	🟡 Middle Ground	Suggested workarounds for email issues. Only included quote in email text.
Kimi-K2.5	🔴 Struggles	Hit ‘429 Error’ during web search. Total failure on browser control for emails.
Qwen3-Max	🔴 Struggles	Failed on file retrieval and email sending. Repeated commands without execution.

The difference between models was stark.

🔴 The Struggles

Qwen3-Max (Qianwen 千问):

Failed immediately on file retrieval.

Even with hints about where the file was located, it spent roughly 5 minutes searching and couldn’t find it.

Then it couldn’t send emails—it just repeated the command without executing.

Kimi-K2.5 (Yuezhi Anmian 月之暗面):

Better than Qwen, but still problematic.

It found and summarized the file in about 5 minutes (good).

But then it hit a “429 Error” (too many requests) during web search.

Browser control for email sending?

Total failure.

🟡 The Middle Ground

MiniMax-M2.1:

No major issues with file retrieval, searching, or writing.

Hit browser control problems during email sending but was smart enough to suggest a workaround.

With manual intervention based on its suggestion, it sent the email—though it only included a key quote instead of the full text.

MiniMax-M2.5 (released February 12):

This is where things got better.

The newer model handled the entire workflow with minimal friction.

Retrieval → search → writing → emailing.

No human intervention needed.

Zhipu GLM-4.7 (Zhipu 智谱):

Fast processing overall.

Tried to use an incorrect email URL at first, so we needed to correct it.

Otherwise solid performance once that was fixed.

🟢 The Winner

GPT-4-mini:

Smooth and stable from start to finish.

The entire workflow ran with almost zero human intervention.

A few network hiccups, but nothing that broke the process.

What Changed in Round 2 & 3?

We retested everything to check consistency.

Qwen3-Max and Kimi-K2.5 continued failing at email tasks.

MiniMax-M2.1/2.5, Zhipu GLM-4.7, and GPT-4-mini all completed the full workflow successfully.

So reliability was correlated with quality—better models = better results.

What Experts Are Saying

Industry voices echo what we found.

On Model Quality:

One e-commerce operator using OpenClaw told us they stick with OpenAI’s GPT-4-mini or Google’s Gemini 3 Pro.

“Much more effective than domestic Chinese models,” they noted.

Huan Jiachen, head of research at Extraordinary Research (Feifan Chan-yan 非凡产研), added nuance:

“The model’s impact on OpenClaw depends on task complexity. Leading international models have a higher ceiling, but for common tasks, domestic models like Zhipu GLM-4.7 and Kimi-K2.5 are good options—especially since Claude is so expensive that ‘wallets can’t take it.'”

On Maturity:

Zhang He, founder of ExcelMaster.ai and former AI product manager at Xiaomi (Xiaomi 小米), was candid:

“OpenClaw is largely a wrapper around Anthropic’s Claude Code.”

It lowers the barrier to entry with a chat interface, but the underlying capabilities don’t surpass what inspired it.

Dr. Zhang Lu, Cloud and AI Product Manager at Akamai, added:

“For OpenClaw to be production-ready, it requires secondary development and fine-tuning. The current version is still somewhat immature and frequently freezes.”

The Real Barriers to Mainstream Adoption

Even if OpenClaw worked perfectly with every model, three factors would still hold it back.

1. Technical Barriers = Limited Audience

There’s no one-click install.

Setup requires command-line configuration and permission settings.

You need a development background.

Non-technical users are immediately locked out.

Cloud providers like Alibaba Cloud (Aliyun 阿里云), Tencent Cloud (Tengxun Yun 腾讯云), and Amazon (Yaxun 亚马逊) offer cloud deployment, but here’s the catch: cloud instances can’t control your local computer.

So you’re back to local setup or accepting limited functionality.

2. Costs: Token Bleeding at Scale

OpenClaw is expensive because it constantly calls the underlying model.

Real examples:

One user spent ¥200 RMB ($28 USD) for just over 20 interactions using Zhipu GLM-4.7
Dr. Zhang Lu reported spending dozens of ¥ RMB daily with DeepSeek, noting stronger models could easily cost hundreds of ¥ RMB ($40-$70 USD) per day

At those burn rates, OpenClaw becomes an expensive experiment rather than a practical tool.

3. Security: A Nightmare Waiting to Happen

This one should concern you.

To function, OpenClaw needs high-level system permissions.

Amy Chang, lead of AI Threat Research at Cisco (Sisi-ke 思科), was direct:

“OpenClaw is a nightmare from a security perspective because it can run shell commands and read/write files arbitrarily.”

Jamieson O’Reilly, founder of cybersecurity firm Dvuln, discovered actual vulnerabilities where attackers could steal API keys or account credentials stored locally.

Peter Steinberger, OpenClaw’s developer, was honest about it:

“It is not suitable for non-technical users.”

He’s right.

Any misconfiguration or malicious instruction could be catastrophic.

Bottom Line: Interesting, But Not Ready for Prime Time

OpenClaw shows promise.

When connected to strong models like GPT-4-mini or MiniMax-M2.5, it can handle complex multi-step workflows.

But it’s held back by three hard truths:

Performance is entirely dependent on which model you plug into it
Setup barriers exclude most non-technical users
Costs and security risks make it risky for production use

For founders and engineers experimenting with AI automation?

Sure, worth exploring.

For everyday productivity?

Not yet.

OpenClaw is still very much an advanced prototype for developers—not a finished productivity platform.

Is OpenClaw Worth It? We Tested the “AI Worker” Against 6 Top Models—Here’s What Actually Worked

Key Points

What We Actually Tested

The Challenge: A Multi-Step Workflow

The Results: A Wide Performance Gap

🔴 The Struggles

🟡 The Middle Ground

🟢 The Winner

What Changed in Round 2 & 3?

What Experts Are Saying

The Real Barriers to Mainstream Adoption

1. Technical Barriers = Limited Audience

2. Costs: Token Bleeding at Scale

3. Security: A Nightmare Waiting to Happen

Bottom Line: Interesting, But Not Ready for Prime Time

References

Is OpenClaw Worth It? We Tested the “AI Worker” Against 6 Top Models—Here’s What Actually Worked

Key Points

What We Actually Tested

The Challenge: A Multi-Step Workflow

The Results: A Wide Performance Gap

🔴 The Struggles

🟡 The Middle Ground

🟢 The Winner

What Changed in Round 2 & 3?

What Experts Are Saying

The Real Barriers to Mainstream Adoption

1. Technical Barriers = Limited Audience

2. Costs: Token Bleeding at Scale

3. Security: A Nightmare Waiting to Happen

Bottom Line: Interesting, But Not Ready for Prime Time

References

Related Freshness