Abstract: Despite achieving impressive results in general-purpose semantic segmentation with strong generalization on natural images, the Segment Anything Model (SAM) has shown less precision and ...
Abstract: Videos contain multimodal content, and exploring multi-branch cross-modal interactions with natural language queries can be of benefit to the text-video retrieval task (TVR). However, recent ...