realtime_solo-to-tutti_audio_alignment_separation-by-humming_for_realtime_karaoke_generation

Automatic accompaniment (e.g. a karaoke that follows your playing) kicks ass. It further kicks ass when the karaoke track is generated from that favorite recording of yours, with the soloist separated out.

One thing that bugs me, though, is that existing methods require digital score data (e.g. standard MIDI file, musicXML, etc.). Preparing a SMF is annoying, so I want an accompaniment system that does not need SMFs to work. SMF is used for two purposes: (1) Making a karaoke track from your favorite recording (informed source separation), (2a) Tracking where you are playing in the music (score following), and (2b) Understanding which part in the karaoke track the system should be playing back (offline alignment).

So, my goal is to circumvent the use of SMF for (1) generating the karoake track, and (2) synchronizing your playing to the karaoke track.

I basically want to (1) load a favorite violin concerto, (2) play the violin concerto on my violin, then (3) the track from (1) plays in sync with me, with the violin solo part separated out.

- Input

- Amplitude spectrogram of some audio signal
**U**(realtime input) - Amplitude spectrogram of some audio signal
**X**(wav file)

**X** contains an accompaniment track
mixed with acoustic rendition of
the underlying music score behind **U**.
e.g., **X** is a track of violin concerto,
and **U** is an end-user playing the
violin part into the microphone

- Output

- Realtime alignment between
**X**and**U** - Realtime audio output (a karaoke version of
**X**aligned to**S**). i.e.**X**-**U**

I split the method into two stages: online alignment and separation.

I use an HMM, where each frame of **X** is treated as a single state.
Then, I decode the state sequence of **U** in an online manner.
I filter the posterior distribution with forward algorithm to infer the MAP state (=position of **X**) at the
current time.

The HMM is left-to-right, allowing a current state to (1) stay in the same state, (2) advance to next state.
The key here is that the overlap used for computing **X** is smaller than that used for **U**.
For example, **X** is computed at 10 frames per second, whereas **U** is computed at 50 frames per second.
This way, the left-to-right architecture permits the user to play faster than **X**, and the number of states becomes manageable enough to accommodate for a moderately long piece of music.

Aside 1: Elaborate schemes using semi-HMM wasn't worth the effort, at least for simple duration pdf.

Aside 2: I tried first modeling the state dynamics using Particle filter but it didn't quite work. With finite particles, once it gets “stuck,” simple proposal distribution is insufficient to recover the right position.

For observation pdf, I assume that at state s, **U**(t) follows a Poisson distribution with mean **X**(s).
I found normalizing **U** and **X** to be effective, making it more like multinomial observations.

I use a Poisson-Gamma NMF to separate the accompaniment.
The basic idea is to concatenate the solo **U**(t) and the aligned tutti **X**(s) and treat it like a single magnitude spectrogram, and decompose it:

[ | | ] [ | ] [ | | ] [ | ] [ ] [ **X**(s) **U**(t) ] = [ H(t,i) ] [ --- W(i,f) --- ] =(def)= **M**(t,f) [ | | ] [ | ] [ ] [ | | ] [ | ]

Then, if I were to perform NMF on this guy, component shared by **U** and **X** is likely to be treated as one basis, and the components unique to **U** or **X** as other components.
Because NMF is overdetermined in this case, this thinking won't work unless we use additional constraints.
Namely, we want a parsimonious representation in the spectral basis, so it pays to “overexplain” the observation using shared component (filled with zeros for non-shared parts), and non-shared components (filled with zeros for shared components).

Thus, we use a Poisson-Gamma (Bayesian) NMF, which allows us to incorporate sparse prior distribution on the decomposed spectral basis. Basically, we assume **M**(t,f) ~ Poisson($\Sigma_i$ **H**(t,i)**W**(i,f)), and H ~ Gamma(a, b), W ~ Gamma(c,d). By setting $c<1$, we have sparse constraints.
This kind of NMF can be inferred using variational Bayesian method in conjunction with minorization maximization.

Once we infer the posterior distribution of **H** and **W** and find their MAP estimates $H_{MAP}$ and $W_{MAP}$, we generate a frequency mask $\sum_i H'(i)W_{MAP}(i,f)$, where $H'(i)$ is the MAP estimate of **H** for the tutti part, minus the MAP estimate of **H** for the solo part times a gain coefficient $\alpha$, i.e. $H'(i) = H_{MAP}(0,i)-\alpha H_{MAP}(1,i)$.

Finally, we apply the frequency mask to the audio playback, which is time-stretched using a phase vocoder.

In implementation, I prepared a few “detuned” version of **X**(s) as well, as to compensate for small tuning variations.

realtime_solo-to-tutti_audio_alignment_separation-by-humming_for_realtime_karaoke_generation.txt · Last modified: 2014/10/26 05:21 by maezawa