Click-based interactive segmentation aims to segment target objects conditioned on user-provided clicks. Existing methods typically interpret user intention by learning multiple click prompts to generate corresponding prompt-activated masks, and selecting one from these masks. However, directly matching each prompt to the same visual feature often leads to homogeneous prompt-activated masks, as it pushes the click prompts to converge to one point.
To address this problem, we propose Click Prompt Learning with Optimal Transport (CPlot), which leverages optimal transport theory to capture diverse user intentions with multiple click prompts. Specifically, we first introduce a prompt-pixel alignment module (PPAM), which aligns each click prompts with the visual features in the same feature space by plain transformer blocks. In such way, PPAM enables all click prompts to encode more general knowledge about regions of interest, indicating a consistent user intention.
To capture diverse user intentions, we further propose the click prompt optimal transport module (CPOT) to match click prompts and visual features. CPOT is designed to learn an optimal mapping between click prompts and visual features. Such unique mapping facilities click prompts to effectively focus on distinct visual regions, which reflect underlying diverse user intentions. Furthermore, CPlot learns click prompts with a two-stage optimization strategy: the inner loop optimizes the optimal transport distance to align visual features with click prompts through the Sinkhorn algorithm, while the outer loop adjusts the click prompts from the supervised data.
Extensive experiments on eight interactive segmentation benchmarks demonstrate the superiority of our method for interactive segmentation
@article{jie2024click,
author = {Liu, Jie and Wang, Haochen and Yin, Wenzhe and Sonke, Jan-Jakob and Gavves, Efstratios},
title = {Click Prompt Learning with Optimal Transport for Interactive Segmentation},
journal = {ECCV},
year = {2024},
}