InfiGUIAgent: A New Multimodal GUI Agent with Native Reasoning Abilities

```html

InfiGUIAgent: A New Approach for Multimodal GUI Agents with Native Reasoning Ability

Automating tasks on digital devices like computers and smartphones using graphical user interfaces (GUIs) has made significant progress in recent years. GUI agents based on multimodal large language models (MLLMs) show great potential. However, previous approaches reach their limits with multi-step reasoning processes and the dependence on text annotations. InfiGUIAgent, a novel MLLM-based GUI agent, promises a remedy.

The Training: Two Stages to Success

InfiGUIAgent is characterized by a two-stage, supervised fine-tuning procedure. In the first stage, basic skills such as understanding GUIs and linking visual elements with corresponding actions are improved. The second stage integrates hierarchical thinking and a reflection capability using synthetic data. This allows the agent to think independently and make decisions without relying solely on text descriptions.

Native Reasoning Ability: The Key to Automation

The native reasoning ability of InfiGUIAgent allows it to solve complex tasks that go beyond simple clicks and text inputs. For example, the agent can understand contexts, plan intermediate steps, and react to unexpected events. These capabilities are crucial for automating complex workflows on various platforms.

Hierarchical Thinking and Reflection

Hierarchical thinking allows InfiGUIAgent to break down tasks into smaller sub-goals and process them strategically. The reflection capability enables the agent to evaluate its own actions and make adjustments in case of errors or unexpected results. Through this iterative process, InfiGUIAgent continuously improves its performance.

Synthetic Data: Efficient Training for Complex Scenarios

The use of synthetic data in the second training phase offers decisive advantages. Complex scenarios can be simulated and trained without the need for time-consuming manual data acquisition. This accelerates the training process and allows adaptation to specific use cases.

Success in Benchmarks

InfiGUIAgent has achieved promising results in various GUI benchmarks. This underscores the potential of the approach to significantly improve the automation of GUI interactions. The agent's native reasoning ability contributes significantly to this performance increase.

Mindverse: AI Partner for Customized Solutions

Mindverse, a German company, offers an all-in-one platform for AI-powered text and image generation, research, and more. In addition to providing AI tools, Mindverse also develops customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems. The development of innovative GUI agents like InfiGUIAgent highlights the importance of AI for the automation and optimization of workflows.

Outlook: The Future of GUI Automation

InfiGUIAgent represents an important step towards more efficient and effective GUI automation. The combination of multimodal language models, native reasoning, and the use of synthetic data opens up new possibilities for the development of intelligent agents. Future research could focus on expanding the agent's capabilities, for example, by integrating learning capabilities and adapting to dynamic GUI environments.

Bibliography

Liu, Y. et al. (2025). InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection. arXiv preprint arXiv:2501.04575.

Wang, S. et al. (2024). GUI Agents with Foundation Models: A Comprehensive Survey. arXiv preprint arXiv:2411.04890v1.

Xu, Y. et al. (2024). Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. arXiv preprint arXiv:2412.04454.

Nguyen, D. et al. (2024). GUI Agents: A Survey. arXiv preprint arXiv:2412.13501v1.

https://github.com/showlab/Awesome-GUI-Agent

https://www.chatpaper.com/chatpaper/zh-CN/paper/96797

```