ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

CVPR 2025

Yuejiao Su Yi Wang   Qiongyang Hu   Chuang Yang   Lap-Pui Chau

The Hong Kong Polytechnic University

Abstract

Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.


Ego-IRG Task

Illuatration of Ego-IRG task. This task allows for unified analyzing, answering, and pixel grounding regarding interactions within egocentric images based on various user queries.

Ego-IRGBench dataset

Illuatration of Ego-IRGBench dataset. This extensive dataset is built upon a collection of 20,504 RGB-D egocentric image pairs extracted from the HOI4D dataset, including various interactions and environments from a first-person view. Specifically, each egocentric RGB image is paired with a depth map ((a.ii)) and an interaction description (a), providing spatial information about the scene and outlining the specific interaction taking place. Furthermore, multiple queries are labeled for each image, allowing for various inquiries related to the interactions depicted. Each query is paired with an answer and corresponding pixel-level grounding mask, forming systematic and ultimate feedback to the query. It is worth emphasizing that the query is not limited to inferring a single target (b) involving interactions. Multi-target (c) and no-target (d) queries are also provided in the Ego-IRGBench dataset, which shows the diversity of the dataset.

ANNEXE Model

Overview of the proposed ANNEXE framework. By incorporating the mask generation module, the proposed ANNEXE model can predict precise and query-orientated pixel-level masks regarding egocentric interactions, which can be applied to various downstream tasks with different requirements more efficiently.

Qualitative Results

Qualitative results of our ANNEXE model (b) and Ground Truth (c).

Citation

@inproceedings{su:annexe:2025,
    title = {ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction},
    author = {Yuejiao Su, Yi Wang, Qiongyang Hu, Chuang Yang, and Lap-Pui Chau},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025}
  }
  

Data

News: The Ego-IRGBench dataset will be released soon.

Ego-IRGBench_Egocentric_Image. Egocentric_Image.

Ego-IRGBench_depth. Depth_Map.

Ego-IRGBench_Query_Answer. Query_Answer.

Ego-IRGBench_Mask. Mask.

Baidu Cloud download link. Baidu Cloud download link..