With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation from scratch. However, finer-grained control over multi-track generation remains an open challenge. …
Large models for text-to-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user …
Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures …
Existing audio-visual segmentation datasets typically focus on short-trimmed videos with only one pixel-map annotation for a per-second video clip. In contrast, for untrimmed videos, the sound duration, start- and endsounding time positions, and …
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are …
In this paper, we focus on the unsupervised Video Object Segmentation (VOS) task which learns visual correspondence from unlabeled videos. Different from the previous methods which are mainly based on the contrastive learning paradigm, we propose …
The fashion industry has diverse applications in multi-modal image generation and editing. It aims to create a desired high-fidelity image with the multi-modal conditional signal as guidance. Most existing methods learn different condition guidance …
Conventional face super-resolution methods usually assume testing low-resolution (LR) images lie on the same domain as the training ones. Due to different lighting conditions and imaging hardware, domain gaps between training and testing images …
Zero-shot semantic segmentation aims to recognize the semantics of pixels from unseen categories with zero training samples. Previous practice [1] proposed to train the classifiers for unseen categories using the visual features generated from …
Recent progress in few-shot segmentation usually aims at performing novel object segmentation using a few annotated examples as guidance. In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, …