DeepSeek Released a Multimodal Research Paper Then Deleted It Overnight – Here’s What It Revealed About the Future of AI Vision

Key Points

  • DeepSeek (Shenyan 深言) launched new multimodal capabilities allowing its models to “see” and understand images, with a “Image Recognition Mode” appearing on its homepage.
  • A research paper titled “Thinking with Visual Primitives” was released on April 30, detailing a framework that helps models “point” to specific locations in an image, anchoring linguistic logic to concrete spatial coordinates.
  • The paper and associated repository were deleted overnight by DeepSeek, leading to speculation that it revealed too much information about their advancements.
  • The multimodal model, built on DeepSeek-V4-Flash (284 billion parameters), reportedly achieved reasoning accuracy that matched or exceeded GPT, Claude, and Gemini on challenging visual tasks.
  • The sudden release and deletion signal DeepSeek’s rapid progress in multimodal AI, despite its previous perceived delay and suggests new financing has boosted its capabilities.
DeepSeek Multimodal Breakthrough Timeline
  • April 24: DeepSeek-V4 release (text-only)
  • April 29: Team lead announces “Now we can see you”
  • April 30: “Thinking with Visual Primitives” paper published
  • May 1: GitHub repository and paper deleted (404 Error)
Decorative Image

DeepSeek (Shenyan 深言) just made a major move in the AI race.

The Chinese AI powerhouse quietly launched multimodal capabilities—meaning its models can now “see” and understand images—through a grayscale testing phase.

Users who visited the DeepSeek homepage spotted a fresh “Image Recognition Mode” entry point.

Upload an image, and DeepSeek can now comprehend visual content with human-like understanding.

Chen Xiaokang (陈小康), head of the DeepSeek multimodal team, announced on April 29 with a simple but powerful message: “Now, we can see you.”

This marked the first time DeepSeek’s chat product integrated multimodal features directly into its platform.

On April 30, DeepSeek dropped a technical research paper titled “Thinking with Visual Primitives” that laid out the engineering details behind this breakthrough.

The timing was quintessentially DeepSeek—a major paper release right before the May Day holiday.

But here’s where it gets interesting.

Users quickly discovered that the official team deleted the entire multimodal repository and original paper overnight.

By May 1, the GitHub page displayed a “404” error.

No official explanation.

No press release.

Just gone.

Industry speculation suggests the deletion wasn’t about problematic content—it was about revealing too much information.

What the Deleted Paper Actually Revealed About Multimodal AI

Before disappearing, the research made waves in the tech community.

Feedback from industry insiders suggested the paper aligned perfectly with DeepSeek’s pragmatic, engineering-focused approach: reducing costs while adopting innovative new paradigms.

Here’s the core insight from the paper:

Current multimodal models don’t fail because they can’t “see”—they fail because they can’t “point” accurately.

Think about it this way.

Natural language is inherently ambiguous.

When you ask an AI model to handle complex spatial layouts or intricate visual tasks, relying solely on text descriptions creates confusion.

It’s like trying to count a pile of scattered coins without using your finger to mark each one as you go.

Without that pointing mechanism, you miscounts or double-counts.

Same problem for AI vision models—they lacked the spatial precision to reliably reference specific parts of an image during reasoning.

TeamedUp China Logo

Find Top Talent on China's Leading Networks

  • Post Across China's Job Sites from $299 / role
  • Qualified Applicant Bundles
  • One Central Candidate Hub
Get 20% Off
Your First Job Post
Use Checkout Code 'Fresh20'
Decorative Image

The “Visual Primitives” Framework: DeepSeek’s Solution

DeepSeek’s answer?

Give the model a virtual finger.

The research introduced the “Visual Primitives” framework—a new approach that elevates spatial markers (like points and bounding boxes) into the fundamental units of thought.

Here’s how it works:

  • During reasoning, the model can literally “point” to specific locations in an image while it “thinks”
  • This anchors abstract linguistic logic to concrete spatial coordinates
  • The model thinks more like humans do—combining language with spatial awareness

The inspiration came directly from human cognition research.

When humans navigate mazes or count dense objects, they use indicative pointing (literally pointing with a finger) to reduce cognitive load and maintain logical consistency.

By embedding visual primitives into the thinking process, DeepSeek’s model mimics this human “pointing-reasoning” synergy.

ExpatInvest China Logo

ExpatInvest China

Grow Your RMB in China:

  • Invest Your RMB Locally
  • Buy & Sell Online in CN¥
  • No Lock-In Periods
  • English Service & Data
  • Start with Only ¥1,000
View Funds & Invest
Decorative Image

The Technical Foundation: Built on DeepSeek-V4-Flash

Multimodal Reasoning Performance Comparison
Model Core Architecture Reasoning Method Relative Accuracy
DeepSeek-V4-Flash 284B MoE Visual Primitives (Pointing) High / Leading
GPT-4o Proprietary Native Multimodal Competitive
Claude 3.5 Sonnet Proprietary Vision-Text Alignment Competitive

The multimodal model is constructed on top of DeepSeek-V4-Flash, which contains a massive 284 billion parameters.

DeepSeek’s internal testing showed something remarkable:

The visual primitives approach achieved significant breakthroughs in reasoning accuracy.

On challenging tasks like:

  • Spatial reasoning (understanding complex layouts and relationships)
  • Visual Question Answering (VQA) (answering questions about images)
  • Complex visual understanding tasks

The performance matched or exceeded the latest versions of GPT, Claude, and Gemini.

Let that sink in.

A Chinese AI company quietly developed a multimodal approach that competes with or beats OpenAI, Anthropic, and Google’s best offerings.

The research proved something important about the future of AI:

The future of multimodal intelligence isn’t about “seeing more pixels”—it’s about building a precise, unambiguous bridge between language and vision.

Resume Captain Logo

Resume Captain

Your AI Career Toolkit:

  • AI Resume Optimization
  • Custom Cover Letters
  • LinkedIn Profile Boost
  • Interview Question Prep
  • Salary Negotiation Agent
Get Started Free
Decorative Image

The Bigger Picture: Why the Multimodal Delay Matters

On April 24, DeepSeek released its V4 series flagship models.

But conspicuously missing?

Multimodal capabilities.

At the time, the official V4 definition focused on:

  • Supporting a million-word ultra-long context
  • Leading in Agent capabilities (both domestic and open-source)
  • Superior world knowledge and reasoning performance

But multimodal capability has become the critical direction for current large model updates.

DeepSeek’s delay in this area was previously viewed as a major competitive weakness.

Industry rumors pointed to two main constraints:

  • Computing power limitations
  • Cash flow constraints

Following new rounds of financing (potentially involving billions of ¥ RMB and hundreds of millions of $ USD), training in the multimodal direction is now expected to proceed more smoothly.

Translation: DeepSeek just got the financial runway to compete seriously in multimodal AI.

The deleted paper wasn’t a setback—it was a signal that DeepSeek is moving faster than most observers realized in the multimodal AI space.


Decorative Image

References

In this article
Scroll to Top