The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy

\ Mask-text alignment. Compared to the straightforward baseline, as shown in Tab. 3, our proposed Uni-OVSeg achieves significant gains of 4.8% PQ and 9.5% mIoU on the COCO dataset, and 11.2% mIoU on the PASCAL Context-59 dataset. This demonstrates our method effectively align objects in images and entities in text descriptions, generalising the CLIP embedding space from the image level to pixel level. By resorting to the refinement of text descriptions, new texts are more correlated with the corresponding images, improving the mIoU from 34.5% to 37.3% on the COCO dataset. Compared to the traditional NLP toolkit (NLTK) [3], ChatGPT-based parser extracts more reliable entities from text descriptions, which achieves obvious improvements of 3.1% and 3.7% mIoU on the COCO and PASCAL Context-59 datasets, respectively. Finally, the proposed multi-scale ensemble strategy that leverages the multi-scale information of objects within the images, stabilise the mask-text matching, which achieves a performance gain of 1.8% PQ on the COCO datasets.

\ Multi-scale ensemble in mask-text matching. The quality of correspondence between masks and entities is an essential part of mask-text matching. To investigate the impact of multi-scale information on this correspondence, as illustrated in Tab. 4, we use masks and semantic classes from

\ Figure 4. Visualisaton of point-promptable automatic mask generation. We adopt a 20 × 20 point grid as a visual prompt and select the output masks with max IoU by calculating the IoU with the ground truth masks.

\ Table 3. Ablation study on mask-text alignment. “Refine.” denotes the text refinement by the LVLM. “Parser.” denotes the text parser, which extracts entities from text descriptions. “NLTK” and “GPT” denote the natural language toolkit and ChatGPT-based parser. “M.S.” denotes the multi-scale ensemble strategy.

\ the ADE20K and COCO datasets, reporting the Top1 accuracy and forward time per sample. We first resize input images to multiple resolutions and extract visual features via the clip visual encoder. Given ground-truth masks, regional features are pooled from CLIP visual features and projected into the clip embedding space. Each regional embedding is classified by text embeddings. Taking into account the trade-off between performance and latency, we adopt the sizes of 869 × 896 and 1024 × 1024 as default.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Segmentation

Print Share Comment Cite Upload Translate Updates

APA

Segmentation | Sciencx (2024-11-12T22:27:06+00:00) The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy. Retrieved from https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/

MLA

" » The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy." Segmentation | Sciencx - Tuesday November 12, 2024, https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/

HARVARD

Segmentation | Sciencx Tuesday November 12, 2024 » The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy., viewed ,<https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/>

VANCOUVER

Segmentation | Sciencx - » The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/

CHICAGO

" » The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy." Segmentation | Sciencx - Accessed . https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/

IEEE

" » The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy." Segmentation | Sciencx [Online]. Available: https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/. [Accessed: ]

rf:citation

» The Impact of Mask-Text Alignment and Multi-Scale Ensemble on Uni-OVSeg’s Segmentation Accuracy | Segmentation | Sciencx | https://www.scien.cx/2024/11/12/the-impact-of-mask-text-alignment-and-multi-scale-ensemble-on-uni-ovsegs-segmentation-accuracy/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

4.3. Ablation study

Related Posts