1 | Peike Li

JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation

With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation from scratch. However, finer-grained control over multi-track generation remains an open challenge. …

JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning

Large models for text-to-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user …

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures …

Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

Existing audio-visual segmentation datasets typically focus on short-trimmed videos with only one pixel-map annotation for a per-second video clip. In contrast, for untrimmed videos, the sound duration, start- and endsounding time positions, and …

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are …

🍔In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

In this paper, we focus on the unsupervised Video Object Segmentation (VOS) task which learns visual correspondence from unlabeled videos. Different from the previous methods which are mainly based on the contrastive learning paradigm, we propose …

👗M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing

The fashion industry has diverse applications in multi-modal image generation and editing. It aims to create a desired high-fidelity image with the multi-modal conditional signal as guidance. Most existing methods learn different condition guidance …

Super-Resolving Cross-Domain Face Miniatures by Peeking at One-Shot Exemplar

Conventional face super-resolution methods usually assume testing low-resolution (LR) images lie on the same domain as the training ones. Due to different lighting conditions and imaging hardware, domain gaps between training and testing images …

Consistent Structural Relation Learning for Zero-Shot Segmentation

Zero-shot semantic segmentation aims to recognize the semantics of pixels from unseen categories with zero training samples. Previous practice [1] proposed to train the classifiers for unseen categories using the visual features generated from …

Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning

Recent progress in few-shot segmentation usually aims at performing novel object segmentation using a few annotated examples as guidance. In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, …