Challenges in consistently identifying original and attribute-modified objects are illustrated in (a) and (b): while LLaVA-NeXT-Qwen-32B and GPT-4o correctly recognize the original "mango", both fail with a blue variant. Panels (c) and (d) show average accuracy scores of LLaVA-NeXT-Qwen-32B and GPT-4o across objects in our NEMO, comparing original and attribute-modified versions. (e) provides a comparative overview of average scores for representative MLLMs on original (upper) and attribute-modified (lower) objects.
@misc{li2024nemo,
title={NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?},
author={Jiaxuan Li and Junwen Mo and MinhDuc Vo and Akihiro Sugimoto and Hideki Nakayama},
year={2024},
eprint={2411.17794},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17794},
}