
We explore the use of Vision-Language Models (VLMs), particularly CLIP, for predicting visual object relationships, simplifying complex graphical models that combine visual and language cues. Our approach, termed CREPE (CLIP Representation Enhanced Predicate Estimation), employs the UVTransE framework to generate translational embeddings for subject, object, and union boxes in scenes using CLIP’s language capabilities. CREPE introduces a novel contrastive training method to refine union-box prompts and significantly improves performance on the Visual Genome benchmark, achieving a 15.3% increase over the previous state-of-the-art with mR@20 scores of 31.95. This demonstrates the potential of CLIP in object relation prediction and suggests further avenues for research in VLM applications.
Predicate Estimation Performance: This chart compares the performance of our proposed CREPE method with other state-of-the-art methods on the Visual Genome (VG) dataset, using mean Recall@K (mR@K). The best performing method is highlighted in red, while the second best is in blue. It’s worth noting that we are, to our knowledge, the first to report mR@ {5,10,15}, and hence those scores for other methods are not presented.
The R@50 performance of two models CREPE, UVTransE (vision only), while also showing the frequency of each predicate. Predicates are color-coded based on their categories: Head' (purple),
Mid’ (olive), and `Tail’ (orange). The recall values are shown as dotted lines, while the predicate frequencies are displayed as blue bars.
Each sub-figure illustrates the relationship between the subject (yellow box) and the object (green box), accompanied by the top five predictions made by CREPE. The accurate prediction is emphasized in red. Notably, in the first column of the third row, although the ground truth label is <flag, on, pole>, CREPE makes a more suitable prediction with <flag, hanging from, pole>, thus indicating that the evaluation metrics can be conservative.
Using CREPE to estimate predicates for the Unrel dataset which contains unseen entities and relationship
@INPROCEEDINGS{ICML_SISTA,
author={Subramanyam, Rakshith and Jayram, T.S. and Anirudh, Rushil and Thiagarajan, Jayaraman J.},
booktitle={International Conference on Machine Learning},
title={Single-Shot Domain Adaptation via Target-Aware Generative Augmentations},
year={2024}}
}
# Contact
If you have any questions, please feel free to contact us via email: rakshith.subramanyam@asu.edu; jjayaram@llnl.gov"