Given the list of user-defined target categories \(\mathcal{C}\), we use Stable Diffusion to generate a synthetic single-object image dataset which is encoded by SAM's image encoder and a uniformly spaced grid of \(\mathbf{d}\) points are generated across the image to prompt SAM. The image and point embeddings are passed into a transformer decoder, the output mask embeddings \(\mathbf{m}_i\) (here, \(\mathbf{m}_i \in \mathbb{R}^{d \times 1024}\), corresponding to the \(\mathbf{d}\) masks predicted by SAM) of which are used to train a classifier head \(\theta\) to predict objects using a Multiple Instance Learning (MIL) setup and uncertainty losses.
Qualitative Results:The original image, the ground truth (GT) mask, all of the SAM generated masks overlayed on top of each other, and U-SAM predicted masks are shown respectively in the four columns. For the GT mask and U-SAM predicted masks, the colors indicate the class labels, while random colors were used for the SAM masks since it does not provide class-labels.
Qualitative results with changed granularity: The third column shows the U-SAM predictions on PASCAL classes, and the fourth column shows the predictions when the granularity level has been changed to categorize
"dog", "cat", and "sheep" classes as a single
"animals" class. The colors represent the different class labels:
Brown: "sheep",
Violet: "dog",
Dark Brown: "cat",
Red: "animals",
Pink: "person".
GC: Granularity Changed.
Quantitative Results obtained by U-SAM.
@inproceedings{kundu2025repurposing,
title={Repurposing SAM for User-Defined Semantics Aware Segmentation},
author={Kundu, Rohit and Paul, Sudipta and Dutta, Arindam and Roy-Chowdhury, Amit K.},
booktitle={CVPR workshops},
year={2025}
}
Copyright: CC BY-NC-SA 4.0 © Rohit Kundu | Last updated: 05 April 2025 | Website credits to Nerfies