HAMR 2013 Proceedings

====== Remixavier ====== | Authors | Colin Raffel, Dan Ellis | | Affiliation | Columbia University | | Code | [[https://github.com/craffel/remixavier|Github Link]] | | Matlab | [[http://labrosa.ee.columbia.edu/~dpwe/resources/matlab/remixavier/|Output of 'Publish']] | We propose a technique for removing certain sources from an audio mixture when a separate recording of those sources is available. ===== - Introduction ===== A common task for source separation is the isolation of the a single instrument from a digital recording of a piece of music. Because of the inherent difficulty of blind source separation, various attempts have been made to exploit additional information. In the present work, we consider the case where we can obtain a recording of a subset of the instruments present in a given piece. Our task is then to isolate those instruments which are not included in this recording. This task is motivated by the fact that many pieces of popular music are released in instrumental (without vocals) and a cappella (vocals only) forms. A search on the comprehensive music catalog [[http://discogs.com|Discogs]] reveals [[http://www.discogs.com/search?type=all&title=&credit=&artist=&genre=&label=&style=&track=acapella+OR+acappella+OR+%22a+cappella%22+OR+%22a+capella%22&country=&catno=&year=&barcode=&submitter=&anv=&contributor=&format=|37,127 releases]] which include an a cappella version and [[http://www.discogs.com/search?type=all&title=&credit=&artist=&genre=&label=&style=&track=instrumental&country=&catno=&year=&barcode=&submitter=&anv=&contributor=&format=|192,688 releases]] which include an instrumental version. However, it is often the case that a song is not released in both instrumental and a cappella form (illustrated by the disparity between the number of releases with instrumentals and the number with a cappella recordings). Therefore, given a recording of a subset of instruments present in a recording, we seek to isolate the other instruments. If we have recordings of both all instruments and a subset of instruments, and the recordings are perfectly aligned in time, released on the same media, and mixed identically, separating the remaining instruments is a simple matter of subtracting the time-domain waveforms. However, these conditions rarely hold. It is more often the case that the recordings were released on different media (CD and vinyl, for example), mastered((Mastering is the process of applying equalization, compression, reverb, and other nonlinearities to an audio recording to improve it sound.)) differently, and are not aligned to the same timebase. As a result, we propose an algorithm which can cope with these discrepancies and achieve high-quality separation. ===== - Algorithm ===== Denote the time-domain digitized signal of a piece of music as $m$. We assume we can obtain a signal $s$ which represents a recording of the same piece of music but only including a subset of the instruments included in the original recording. We then seek the signal $r$ which represents a recording of the instruments in $m$ which are not included in $s$. If $m$ and $s$ are perfectly aligned in time and have no channel differences, we can retrieve $r$ by computing $r = m - s$. However, this is rarely the case. Our algorithm therefore carries out the following steps: * Identifying temporal alignment of $m$ and $s$, and resampling $s$ to match $m$'s timebase. * Estimation of channel differences present in $s$, and creating an equalized version. * Generating and enhancing an estimate of $r$. We outline each of these steps in the following sections. ==== - Alignment ==== We use cross-correlation of unequalized signal to find temporal alignment. First we calcuiate a cross-correlation between the entire durations of $m$ and $s$, possibly downsampled (e.g. to as low as 1 kHz) to reduce the total computation. The global peak of this correlation is taken as the average time alignment, but small differences in the sampling rates (clock drift) will lead to changes in the effective time difference throughout the track. Although many digitally-mastered signals may have perfect time alignment, in our experience it is not unusual to see clock rate differences of 0.5\% or more where analog processing (such as magnetic tape playback) is involved. For a 200 sec track, 0.5\% time skew will cause relative timing to drift by a full second over the duration of the track. To detect such drift, we compute a cross-correlation of short segments of each signal, typically 8 sec segments every 4 sec, with correlation performed out to $\pm 2$ sec. For each segment, we find the peak correlation value and perform a linear fit to the relative timing implied by these peaks' locations across the song. This line represents the relative offset of the two recordings and their "drift", or the extent to which they have been recorded on different timescales. Once we have estimated the offset and drift, we remove samples and resample so that the signals are aligned in time. We find that repeating this operation can further correct residual timing errors, bringing the timebases to within 10 parts per million (or 2 ms drift over a 200 sec track). ==== - Channel Differences ==== To estimate the channel differences -- i.e., a difference in the stationary linear filtering between the two tracks -- we assume that there is some filter $r$ such that $$m = h\ast s + r$$ Note that this does not always hold true in practice (for example, when nonlinearities have been applied to $m$ and/or $s$). However, we find this approximation works well in practice. To estimate $h$, we first compute the magnitude of the short-time Fourier transform (DFT) of each signal, which gives $$M = H \cdot S + R$$ or $$R = M - H \cdot S$$ where $\cdot$ represents element-wise multiplication (the Hadamard product). We then assume that the optimal $H$ is the one for which the frequency-domain representation of $r$ is maximally sparse. This gives rise to the optimization problem $$\min_H |M - H \cdot S|_1$$ Note that each element of $H$ is independent and depends only on a single row of $M$ and $S$. We can therefore instead solve $$H_i = \min_{H_i} |M_i - H_i S_i|_1$$ where $H_i$ is the $i$th element of $H$ (a scalar) and $M_i$ and $S_i$ are the magnitudes of the $i$th frequency bin of each spectrogram across time. This problem can be efficiently solved via a scalar optimization, giving us the magnitude of our filter $H$. ==== - Enhancement ==== Once we have estimated the channel filter $H$, we can apply it to the short-time Fourier transform of $s$ to normalize the channel distortion. We can then compute $$\hat{r} = m - h\ast s$$ which gives us an approximation to $r$, denoted $\hat{r}$. If we have successfully aligned the signals and estimated the channel distortion, $\hat{r}$ will be a good perceptual estimate to $r$. However, we can attempt to further enhance our estimate $\hat{r}$ using Weiner filtering. This process examines the log-magnitude short-time Fourier transforms of $\hat{r}$ and $s$ (denoted $L[\hat{R}]$ and $L[S]$ respectively) and only retains the energy in those bins in $L[\hat{R}]$ which are substantially larger than those in $L[S]$. ===== - Results ===== We find that we achieve good perceptual separation when the channel distortion is small, and that Weiner filtering does not help in this case. When the channel distortion is large (eg, different masterings on different media) the Weiner filter can improve the perceptual quality dramatically.

HAMR 2013 Proceedings

User Tools

Site Tools

Page Tools