Open-Vocabulary Panoptic Segmentation via Multi-modal Meta-knowledge Transferring

Abstract

Conventional panoptic segmentation methods can only recognize classes from a predefined close-set of categories and require a large amount of annotated samples to learn each semantic concept. However, it is exponentially challenging to enumerate, collect, and then annotate all object categories in the real open-world scenario. To eliminate such limitations, we propose a realistic yet challenging task, namely Open-Vocabulary Panoptic Segmentation (OVPS), which can segment novel categories by exploiting only the natural language text prompts without needing any mask annotations. We propose a framework called Multi-Modal Meta-Knowledge Transferring (M3KT) to panoptically segment the novel categories with zero-shot training samples. By decoupling the grouping and recognition, M3KT effectively generates visual meta-weights for each class-agnostic object. By enforcing the similarity of visual meta-weights and textual meta-weights, M3KT gradually adapts the semantic prior knowledge from the language domain to the vision domain. In this way, M3KT learns localization and recognition ability trained on the seen base categories, and able to generalize on unseen novel categories described by the open-vocabulary text prompts. Extensive experiments demonstrate the effectiveness of the proposed M3KT network evaluated on COCO-Panoptic benchmark. Our proposed M3KT also shows strong generalization ability on novel categories when evaluated on the cross-domain benchmarks from COCO-Panoptic dataset to Cityscapes-Panoptic dataset.

Publication
Preprint