HAMR 2013 Proceedings

====== Differences ====== This shows you the differences between two versions of the page.

--- remixavier [2013/06/30 16:45]
craffel
+++ remixavier [2013/07/02 00:57] (current)
dpwe
@@ Line 4: / Line 4: @@
 | Affiliation | Columbia University |
 | Code | [[https://github.com/craffel/remixavier|Github Link]] |
+| Matlab | [[http://labrosa.ee.columbia.edu/~dpwe/resources/matlab/remixavier/|Output of 'Publish']] |
 We propose a technique for removing certain sources from an audio mixture when a separate recording of those sources is available.
@@ Line 15: / Line 16: @@
 If we have recordings of both all instruments and a subset of instruments, and the recordings are perfectly aligned in time, released on the same media, and mixed identically, separating the remaining instruments is a simple matter of subtracting the time-domain waveforms.  However, these conditions rarely hold.  It is more often the case that the recordings were released on different media (CD and vinyl, for example), mastered((Mastering is the process of applying equalization, compression, reverb, and other nonlinearities to an audio recording to improve it sound.)) differently, and are not aligned to the same timebase.  As a result, we propose an algorithm which can cope with these discrepancies and achieve high-quality separation.
-==== - Algorithm ====
+===== - Algorithm =====
 Denote the time-domain digitized signal of a piece of music as $m$.  We assume we can obtain a signal $s$ which represents a recording of the same piece of music but only including a subset of the instruments included in the original recording.  We then seek the signal $r$ which represents a recording of the instruments in $m$ which are not included in $s$.
-If $m$ and $s$ are perfectly aligned in time and have no channel distortion, we can retrieve $r$ by computing $r = m - s$.  However, this is rarely the case.  Our algorithm therefore carries out the following steps:
+If $m$ and $s$ are perfectly aligned in time and have no channel differences, we can retrieve $r$ by computing $r = m - s$.  However, this is rarely the case.  Our algorithm therefore carries out the following steps:
-  * Alignment of $m$ and $s$
+  * Identifying temporal alignment of $m$ and $s$, and resampling $s$ to match $m$'s timebase.
-  * Estimation of channel distortion present in $s$
+  * Estimation of channel differences present in $s$, and creating an equalized version.
-  * Generating and enhancing an estimate of $r$
+  * Generating and enhancing an estimate of $r$.
 We outline each of these steps in the following sections.
-=== - Alignment ===
+==== - Alignment ====
-To align the signals $m$ and $s$, we compute a cross-correlation of short segments of each signal.  For each segment, we find the peak correlation value and perform a linear fit to the peak locations across the song.  This line represents the relative offset of the two recordings and their "skew", or the extent to which they have been recorded on different timescales.  Once we have estimated the offset and skew, we remove samples and resample so that the signals are aligned in time.
+We use cross-correlation of unequalized signal to find temporal alignment. First we calcuiate a cross-correlation between the entire durations of $m$ and $s$, possibly downsampled (e.g. to as low as 1 kHz) to reduce the total computation. The global peak of this correlation is taken as the average time alignment, but small differences in the sampling rates (clock drift) will lead to changes in the effective time difference throughout the track.  Although many digitally-mastered signals may have perfect time alignment, in our experience it is not unusual to see clock rate differences of 0.5\% or more where analog processing (such as magnetic tape playback) is involved. For a 200 sec track, 0.5\% time skew will cause relative timing to drift by a full second over the duration of the track.
-=== - Channel Distortion ===
+To detect such drift, we compute a cross-correlation of short segments of each signal, typically 8 sec segments every 4 sec, with correlation performed out to $\pm 2$ sec.  For each segment, we find the peak correlation value and perform a linear fit to the relative timing implied by these peaks' locations across the song.  This line represents the relative offset of the two recordings and their "drift", or the extent to which they have been recorded on different timescales.  Once we have estimated the offset and drift, we remove samples and resample so that the signals are aligned in time.  We find that repeating this operation can further correct residual timing errors, bringing the timebases to within 10 parts per million (or 2 ms drift over a 200 sec track).
-To estimate the channel distortion, we assume that there is some filter $r$ such that
+==== - Channel Differences ====
+To estimate the channel differences -- i.e., a difference in the stationary linear filtering between the two tracks -- we assume that there is some filter $r$ such that
 $$m = h\ast s + r$$
-Note that this does not always hold true in practice (for example, when different nonlinearities have been applied to $m$ and $s$).  However, we find this approximation works well in practice.  To estimate $h$, we first compute the magnitude of the short-time Fourier transform (DFT) of each signal, which gives
+Note that this does not always hold true in practice (for example, when nonlinearities have been applied to $m$ and/or $s$).  However, we find this approximation works well in practice.  To estimate $h$, we first compute the magnitude of the short-time Fourier transform (DFT) of each signal, which gives
 $$M = H \cdot S + R$$
@@ Line 55: / Line 58: @@
 where $H_i$ is the $i$th element of $H$ (a scalar) and $M_i$ and $S_i$ are the magnitudes of the $i$th frequency bin of each spectrogram across time.  This problem can be efficiently solved via a scalar optimization, giving us the magnitude of our filter $H$.
-=== - Enhancement ===
+==== - Enhancement ====
 Once we have estimated the channel filter $H$, we can apply it to the short-time Fourier transform of $s$ to normalize the channel distortion.  We can then compute
@@ Line 63: / Line 66: @@
 which gives us an approximation to $r$, denoted $\hat{r}$.  If we have successfully aligned the signals and estimated the channel distortion, $\hat{r}$ will be a good perceptual estimate to $r$.  However, we can attempt to further enhance our estimate $\hat{r}$ using Weiner filtering.  This process examines the log-magnitude short-time Fourier transforms of $\hat{r}$ and $s$ (denoted $L[\hat{R}]$ and $L[S]$ respectively) and only retains the energy in those bins in $L[\hat{R}]$ which are substantially larger than those in $L[S]$.
-==== - Results ====
+===== - Results =====
 We find that we achieve good perceptual separation when the channel distortion is small, and that Weiner filtering does not help in this case.  When the channel distortion is large (eg, different masterings on different media) the Weiner filter can improve the perceptual quality dramatically.

HAMR 2013 Proceedings

User Tools

Site Tools

Page Tools