Supplementary Material (Anonymous)
Modern audio content is often created by mixing stems from multiple sources. This work studies separation-first multi-stream audio watermarking, where each stem carries an independent watermark. We demonstrate that naive combinations of watermarking and source separation lead to poor post-separation decoding accuracy. By jointly training the watermarking system with the separator, watermark robustness after separation can be significantly improved while maintaining perceptual audio quality.
The following diagram shows the separation-first watermarking pipeline.
The examples below demonstrate the separation-first watermarking pipeline shown above. In each case, the left column corresponds to the Accompaniment stem and the right column corresponds to the Vocal stem. Each stem is independently watermarked with a different key and bit sequence. The watermarked stems are mixed, separated using the Demucs separator, and decoded from each separated stem. For each method (Baseline, Joint, Finetuned), we show the original stems, watermarked stems, embedded watermark bits, separated stems, and the decoded bits with BER. Incorrect bits are highlighted in red.