Comparing Reasoning LLMs: DeepSeek and OpenAI o3 for Text Evaluation

```html

Artificial Intelligence in Text Evaluation: DeepSeek and o3-mini Compared

Large language models (LLMs) have made impressive progress in various areas of artificial intelligence in recent years. So-called "Reasoning LLMs," specifically trained for logical thinking and deduction, are particularly promising. A recent study investigates how effectively these Reasoning LLMs can evaluate natural language texts, especially in machine translation (MT) and text summarization (TS).

The study compares the performance of Reasoning LLMs like DeepSeek-R1 and OpenAI o3 with their conventional, non-reasoning-based counterparts. A total of eight models from three architectural categories were examined: State-of-the-art Reasoning models, their distilled variants (with 8B to 70B parameters), and comparable conventional LLMs.

The results of the experiments, conducted on the WMT23 benchmarks for machine translation and SummEval for text summarization, paint a mixed picture. It was found that the benefit of reasoning capabilities strongly depends on the respective model and the task. While the OpenAI o3-mini models showed consistent performance improvement with increasing reasoning intensity, DeepSeek-R1 performed worse than its non-reasoning variant in most tests. An exception was certain aspects of text summarization evaluation, where DeepSeek-R1 achieved comparable results.

A correlation analysis revealed that a higher number of reasoning tokens in the o3-mini models correlates positively with the evaluation quality. This suggests that the reasoning capabilities do contribute to improving evaluation performance. Interestingly, however, it was also shown that the distillation of reasoning capabilities, i.e., the transfer of knowledge to smaller models, led to acceptable results for medium-sized models (32B parameters), but a significant drop in performance for smaller variants (8B parameters).

The study thus provides a first comprehensive evaluation of Reasoning LLMs for the evaluation of natural language texts and offers valuable insights into their practical application. The results highlight that while the integration of reasoning capabilities into LLMs is promising, it also comes with challenges. The choice of model and the fine-tuning of the reasoning parameters play a crucial role in the achievable performance.

For companies like Mindverse, which specialize in the development of AI-based content tools, these findings are of great importance. The development of customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems requires a deep understanding of the strengths and weaknesses of different LLM architectures. The results of this study can contribute to advancing the development and optimization of such systems and further improving the performance of AI in text processing.

The research underlines the need for further investigation to fully exploit the potential of Reasoning LLMs in text evaluation. Future studies could, for example, focus on the development of new training methods and architectures that enable more efficient use of reasoning capabilities. The investigation of further application areas, such as the automated evaluation of essays or the quality control of translations, is also of great interest.

Bibliography:
https://arxiv.org/abs/2504.08120
https://arxiv.org/html/2504.08120v1
https://www.analyticsvidhya.com/blog/2025/02/openai-o3-mini-vs-deepseek-r1/
https://www.reddit.com/r/LocalLLaMA/comments/1iks9cl/notes_on_openai_o3mini_how_good_is_it_compared_to/
https://www.flowhunt.io/blog/openai-o3-mini-vs-deepseek-agentic-use/
https://medium.com/@LakshmiNarayana_U/o3-mini-vs-deepseek-the-inevitable-comparison-b00e0573640a
https://www.linkedin.com/posts/reuvencohen_the-difference-between-o3mini-and-deepseek-activity-7291465649745211392--say ```