DeepSeek AI: Real Innovation or Just Propaganda?

DeepSeek AI stuns by outperforming GPT-4 in coding and multimodal benchmarks. But is this real innovation or just a technically engineered illusion?

DeepSeek AI: Real Innovation or Just Propaganda?
Photo by Solen Feyissa / Unsplash

Shocking Performance in Programming Benchmarks

The DeepSeek-Coder 33B-Instruct model has drawn attention for allegedly outperforming GPT-4 in several key programming benchmarks. In HumanEval, it outscored CodeLlama-34B by 7.9%, and in MBPP the margin was 10.8%. Additionally, DeepSeek-Coder-V2 demonstrated significant improvements after being trained on 2 trillion tokens — a scale on par with leading frontier models.

Vision-Language AI Rivaling GPT-4o

Beyond coding, DeepSeek-VL2 impressed with a DocVQA score of 93.3%, edging past GPT-4o’s 92.8%. Its OCRBench score reached 834 points, showcasing advanced visual and textual comprehension. While still a newcomer, DeepSeek achieved high scores with fewer active parameters compared to major competitors.

Reality Check: Are These Claims Valid?

Despite these impressive metrics, most of the results come from internal reports or community presentations. Without independent evaluations, there’s a possibility that the model was optimized to score well on specific benchmarks — a common industry strategy to generate hype and attract attention.

Final Verdict: Innovation or Illusion?

DeepSeek has proven that Chinese AI players are not to be underestimated, especially in programming and multimodal performance. However, caution remains necessary. Its strengths have yet to fully extend into areas like reasoning, factual QA, or integration with a global ecosystem like GPT-4 Turbo and Claude 3. The world still awaits broader validation and real-world adoption.