HAMR 2013 Proceedings

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Remixavier ====== | Authors | Colin Raffel, Dan Ellis | | Affiliation | Columbia University | | Code | [[https://github.com/craffel/remixavier|Github Link]] | We propose a technique for removing certain sources from an audio mixture when a separate recording of those sources is available. ===== - Introduction ===== A common task for source separation is the isolation of the a single instrument from a digital recording of a piece of music. Because of the inherent difficulty of blind source separation, various attempts have been made to exploit additional information. In the present work, we consider the case where we can obtain a recording of a subset of the instruments present in a given piece. Our task is then to isolate those instruments which are not included in this recording. This task is motivated by the fact that many pieces of popular music are released in instrumental (without vocals) and a cappella (vocals only) forms. A search on the comprehensive music catalog [[http://discogs.com|Discogs]] reveals [[http://www.discogs.com/search?type=all&title=&credit=&artist=&genre=&label=&style=&track=acapella+OR+acappella+OR+%22a+cappella%22+OR+%22a+capella%22&country=&catno=&year=&barcode=&submitter=&anv=&contributor=&format=|37,127 releases]] which include an a cappella version and [[http://www.discogs.com/search?type=all&title=&credit=&artist=&genre=&label=&style=&track=instrumental&country=&catno=&year=&barcode=&submitter=&anv=&contributor=&format=|192,688 releases]] which include an instrumental version. However, it is often the case that a song is not released in both instrumental and a cappella form (illustrated by the disparity between the number of releases with instrumentals and the number with a cappella recordings). Therefore, given a recording of a subset of instruments present in a recording, we seek to isolate the other instruments. If we have recordings of both all instruments and a subset of instruments, and the recordings are perfectly aligned in time, released on the same media, and mixed identically, separating the remaining instruments is a simple matter of subtracting the time-domain waveforms. However, these conditions rarely hold. It is more often the case that the recordings were released on different media (CD and vinyl, for example), mastered((Mastering is the process of applying equalization, compression, reverb, and other nonlinearities to an audio recording to improve it sound.)) differently, and are not aligned to the same timebase. As a result, we propose an algorithm which can cope with these discrepancies and achieve high-quality separation. ==== - Algorithm ==== Denote the time-domain digitized signal of a piece of music as $m$. We assume we can obtain a signal $s$ which represents a recording of the same piece of music but only including a subset of the instruments included in the original recording. We then seek the signal $r$ which represents a recording of the instruments in $m$ which are not included in $s$. If $m$ and $s$ are perfectly aligned in time and have no channel distortion, we can retrieve $r$ by computing $r = m - s$. However, this is rarely the case. Our algorithm therefore carries out the following steps: * Alignment of $m$ and $s$ * Estimation of channel distortion present in $s$ * Generating and enhancing an estimate of $r$ We outline each of these steps in the following sections. === - Alignment === To align the signals $m$ and $s$, we compute a cross-correlation of short segments of each signal. For each segment, we find the peak correlation value and perform a linear fit to the peak locations across the song. This line represents the relative offset of the two recordings and their "skew", or the extent to which they have been recorded on different timescales. Once we have estimated the offset and skew, we remove samples and resample so that the signals are aligned in time. === - Channel Distortion === To estimate the channel distortion, we assume that there is some filter $r$ such that $$m = h\ast s + r$$ Note that this does not always hold true in practice (for example, when different nonlinearities have been applied to $m$ and $s$). However, we find this approximation works well in practice. To estimate $h$, we first compute the magnitude of the short-time Fourier transform (DFT) of each signal, which gives $$M = H \cdot S + R$$ or $$R = M - H \cdot S$$ where $\cdot$ represents element-wise multiplication (the Hadamard product). We then assume that the optimal $H$ is the one for which the frequency-domain representation of $r$ is maximally sparse. This gives rise to the optimization problem $$\min_H |M - H \cdot S|_1$$ Note that each element of $H$ is independent and depends only on a single row of $M$ and $S$. We can therefore instead solve $$H_i = \min_{H_i} |M_i - H_i S_i|_1$$ where $H_i$ is the $i$th element of $H$ (a scalar) and $M_i$ and $S_i$ are the magnitudes of the $i$th frequency bin of each spectrogram across time. This problem can be efficiently solved via a scalar optimization, giving us the magnitude of our filter $H$. === - Enhancement === Once we have estimated the channel filter $H$, we can apply it to the short-time Fourier transform of $s$ to normalize the channel distortion. We can then compute $$\hat{r} = m - h\ast s$$ which gives us an approximation to $r$, denoted $\hat{r}$. If we have successfully aligned the signals and estimated the channel distortion, $\hat{r}$ will be a good perceptual estimate to $r$. However, we can attempt to further enhance our estimate $\hat{r}$ using Weiner filtering. This process examines the log-magnitude short-time Fourier transforms of $\hat{r}$ and $s$ (denoted $L[\hat{R}]$ and $L[S]$ respectively) and only retains the energy in those bins in $L[\hat{R}]$ which are substantially larger than those in $L[S]$. ==== - Results ==== We find that we achieve good perceptual separation when the channel distortion is small, and that Weiner filtering does not help in this case. When the channel distortion is large (eg, different masterings on different media) the Weiner filter can improve the perceptual quality dramatically.

HAMR 2013 Proceedings

User Tools

Site Tools

Page Tools