AI Models Advance Person Identification in Images Using Natural Language

Person Referring in Image Processing: Challenges and New Solutions

The interaction between humans and computers has made enormous progress in recent years. One area that is steadily gaining importance is image processing. The ability of computers to identify people within images based on natural language descriptions plays a central role. This process, referred to as "person referring," holds enormous potential for various applications, from surveillance technology to assistance systems for the visually impaired.

However, current models often reach their limits in practice. Previous benchmarks mostly focus on "one-to-one references," where a single description refers to exactly one person. This simplification inadequately reflects the complexity of real-world scenarios, where multiple people often need to be identified simultaneously. Furthermore, descriptions can be ambiguous and refer to different characteristics of a person, such as clothing, position, or relationship to other people in the image.

HumanRef: A New Dataset for Realistic Scenarios

To address these challenges, "HumanRef" was developed, a new dataset that better represents the complexity of person referring in real-world situations. HumanRef considers various aspects of referable entities and focuses on ambiguous and complex descriptions. The dataset includes a variety of images with different scenarios and numbers of people, thus providing a challenging test environment for existing and future models.

RexSeek: A Robust Model for Person Referring

Parallel to the development of HumanRef, a new model called "RexSeek" was developed, specifically tailored to the requirements of person referring. RexSeek combines a multimodal, large language model with an object detection framework. This architecture allows RexSeek to efficiently process both the semantic meaning of the natural language description and the visual information of the image. The integration of the language model enables RexSeek to handle even ambiguous and complex descriptions.

Evaluation and Results

Tests with HumanRef have shown that established models, which achieve good results on conventional benchmarks, struggle with the complexity of HumanRef. In particular, identifying multiple people based on a single description poses a significant challenge for these models. RexSeek, on the other hand, achieves significantly better results on HumanRef and proves robust against ambiguous and complex descriptions. Furthermore, RexSeek can also be successfully applied to general object referring, which underscores its broad applicability in various perception tasks.

Outlook

The development of HumanRef and RexSeek represents an important step towards more robust and realistic person referring systems. Future research could focus on further improving the model architecture and developing even more comprehensive datasets. The ability to reliably and accurately identify people in images opens the door to a variety of applications in the field of human-computer interaction and contributes to making communication between humans and machines more intuitive and efficient.

Bibliographie: Jiang, Q., Wu, L., Zeng, Z., Ren, T., Xiong, Y., Chen, Y., Liu, Q., & Zhang, L. (2025). Referring to Any Person. *arXiv preprint arXiv:2503.08507*. Academia Stack Exchange. How should you refer to yourself in a paper? American Psychological Association. (2020). *Publication manual of the American Psychological Association* (7th ed.). Wordvice. First-Person Pronouns in Research Paper: Use “I” or “We”? English Stack Exchange. How to write a proper research paper with the use of pronouns referring to myself? Otio. Can You Use “We” in a Research Paper? Walden University. Writing Center. Using First Person in Academic Writing. Quora. Are personal pronouns allowed in APA format papers (psychology)? Pinterest. Person Holding Paper Reference. Reddit. When a paper is 5 pages, what do you take that to mean?