Repurposing SAM for User-Defined Semantics Aware Segmentation

Abstract

The Segment Anything Model (SAM) excels at generating precise object masks from input prompts but lacks semantic awareness, failing to associate its generated masks with specific object categories. To address this limitation, we propose U-SAM, a novel framework that imbibes semantic awareness into SAM, enabling it to generate targeted masks for user-specified object categories. Given only object class names as input from the user, U-SAM provides pixel-level semantic annotations for images without requiring any labeled/unlabeled samples from the test data distribution. Our approach leverages synthetically generated or web crawled images to accumulate semantic information about the desired object classes. We then learn a mapping function between SAM's mask embeddings and object class labels, effectively enhancing SAM with granularity-specific semantic recognition capabilities. As a result, users can obtain meaningful and targeted segmentation masks for specific objects they request, rather than generic and unlabeled masks. We evaluate U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements of +17.95% and +5.20%, respectively, over state-of-the-art methods. By transforming SAM into a semantically aware segmentation model, U-SAM offers a practical and flexible solution for pixel-level annotation across diverse and unseen domains in a resource-constrained environment.

Method

Given the list of user-defined target categories \(\mathcal{C}\), we use Stable Diffusion to generate a synthetic single-object image dataset which is encoded by SAM's image encoder and a uniformly spaced grid of \(\mathbf{d}\) points are generated across the image to prompt SAM. The image and point embeddings are passed into a transformer decoder, the output mask embeddings \(\mathbf{m}_i\) (here, \(\mathbf{m}_i \in \mathbb{R}^{d \times 1024}\), corresponding to the \(\mathbf{d}\) masks predicted by SAM) of which are used to train a classifier head \(\theta\) to predict objects using a Multiple Instance Learning (MIL) setup and uncertainty losses.

Results

Qualitative Results:The original image, the ground truth (GT) mask, all of the SAM generated masks overlayed on top of each other, and U-SAM predicted masks are shown respectively in the four columns. For the GT mask and U-SAM predicted masks, the colors indicate the class labels, while random colors were used for the SAM masks since it does not provide class-labels.

Qualitative results with changed granularity: The third column shows the U-SAM predictions on PASCAL classes, and the fourth column shows the predictions when the granularity level has been changed to categorize "dog", "cat", and "sheep" classes as a single "animals" class. The colors represent the different class labels: Brown: "sheep", Violet: "dog", Dark Brown: "cat", Red: "animals", Pink: "person". GC: Granularity Changed.

Quantitative Results obtained by U-SAM.

Authors

Rohit Kundu

Sudipta Paul

Arindam Dutta

Amit K. Roy-Chowdhury

BibTeX

@inproceedings{kundu2025repurposing,
    title={Repurposing SAM for User-Defined Semantics Aware Segmentation},
    author={Kundu, Rohit and Paul, Sudipta and Dutta, Arindam and Roy-Chowdhury, Amit K.},
    booktitle={CVPR workshops},
    year={2025}
}

U-SAM: