arxiv:2504.01476

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

Published on Apr 2

Authors:

Abstract

A tri-modal approach using multi-view images, point clouds, and text improves cross-modal 3D retrieval through reconstruction and fine-grained 2D-3D fusion, achieving better performance than existing methods.

AI-generated summary

Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.01476 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.01476 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.01476 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.