Click Prompt Learning with Optimal Transport for Interactive Segmentation

1University of Amsterdam, 2Netherlands Cancer Institute
Interpolate start reference image.

In this work, we propose Click Prompt learning with Optimal Transport (CPlot) for interactive segmentation. With the key component Click Prompt Optimal Transport (CPOT), our model captures diverse user intentions, leading to more accurate mask prediction.

Abstract

Click-based interactive segmentation aims to segment target objects conditioned on user-provided clicks. Existing methods typically interpret user intention by learning multiple click prompts to generate corresponding prompt-activated masks, and selecting one from these masks. However, directly matching each prompt to the same visual feature often leads to homogeneous prompt-activated masks, as it pushes the click prompts to converge to one point.

To address this problem, we propose Click Prompt Learning with Optimal Transport (CPlot), which leverages optimal transport theory to capture diverse user intentions with multiple click prompts. Specifically, we first introduce a prompt-pixel alignment module (PPAM), which aligns each click prompts with the visual features in the same feature space by plain transformer blocks. In such way, PPAM enables all click prompts to encode more general knowledge about regions of interest, indicating a consistent user intention.

To capture diverse user intentions, we further propose the click prompt optimal transport module (CPOT) to match click prompts and visual features. CPOT is designed to learn an optimal mapping between click prompts and visual features. Such unique mapping facilities click prompts to effectively focus on distinct visual regions, which reflect underlying diverse user intentions. Furthermore, CPlot learns click prompts with a two-stage optimization strategy: the inner loop optimizes the optimal transport distance to align visual features with click prompts through the Sinkhorn algorithm, while the outer loop adjusts the click prompts from the supervised data.

Extensive experiments on eight interactive segmentation benchmarks demonstrate the superiority of our method for interactive segmentation

Method

Interpolate start reference image.

Framework of Click Prompt Learning with Optimal Transport (CPlot). Given input image, click disk maps, and previous mask, the Image Encoder extracts visual features. The Click Encoder initializes click prompts with click coordinates. (a) The Prompt-Pixel Alignment Module aims to align click prompts with the visual features in the feature space. (b) Click Prompt Optimal Transport adopts optimal transport plan to generate optimized mask from vanilla prompt-activated mask. A lightweight mask decoder is used to implicitly analyze optimized prompt-activated mask with visual features and make mask predictions.

Intearctive Segmentation Examples

Interpolate start reference image.

We show (a) a challenge case on the natural image, (b) a challenge case on the medical image, and (c) five normal cases. The segmentation probability maps are shown in golden; the segmentation maps are overlaid in red on the original images. Positive and negative clicks are marked with green and blue dots on the image, respectively.

BibTeX

@article{jie2024click,
  author    = {Liu, Jie and Wang, Haochen and Yin, Wenzhe and Sonke, Jan-Jakob and Gavves, Efstratios},
  title     = {Click Prompt Learning with Optimal Transport for Interactive Segmentation},
  journal   = {ECCV},
  year      = {2024},
}