User Tools

Site Tools


remixavier

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

remixavier [2013/06/30 16:45]
craffel
remixavier [2013/07/02 00:57] (current)
dpwe
Line 4: Line 4:
 | Affiliation | Columbia University | | Affiliation | Columbia University |
 | Code | [[https://​github.com/​craffel/​remixavier|Github Link]] | | Code | [[https://​github.com/​craffel/​remixavier|Github Link]] |
 +| Matlab | [[http://​labrosa.ee.columbia.edu/​~dpwe/​resources/​matlab/​remixavier/​|Output of '​Publish'​]] |
  
 We propose a technique for removing certain sources from an audio mixture when a separate recording of those sources is available. We propose a technique for removing certain sources from an audio mixture when a separate recording of those sources is available.
Line 19: Line 20:
 Denote the time-domain digitized signal of a piece of music as $m$.  We assume we can obtain a signal $s$ which represents a recording of the same piece of music but only including a subset of the instruments included in the original recording. ​ We then seek the signal $r$ which represents a recording of the instruments in $m$ which are not included in $s$. Denote the time-domain digitized signal of a piece of music as $m$.  We assume we can obtain a signal $s$ which represents a recording of the same piece of music but only including a subset of the instruments included in the original recording. ​ We then seek the signal $r$ which represents a recording of the instruments in $m$ which are not included in $s$.
  
-If $m$ and $s$ are perfectly aligned in time and have no channel ​distortion, we can retrieve $r$ by computing $r = m - s$.  However, this is rarely the case.  Our algorithm therefore carries out the following steps:+If $m$ and $s$ are perfectly aligned in time and have no channel ​differences, we can retrieve $r$ by computing $r = m - s$.  However, this is rarely the case.  Our algorithm therefore carries out the following steps:
  
-  * Alignment ​of $m$ and $s$ +  * Identifying temporal alignment ​of $m$ and $s$, and resampling $s$ to match $m$'s timebase. ​ 
-  * Estimation of channel ​distortion ​present in $s$ +  * Estimation of channel ​differences ​present in $s$, and creating an equalized version. 
-  * Generating and enhancing an estimate of $r$+  * Generating and enhancing an estimate of $r$.
  
 We outline each of these steps in the following sections. We outline each of these steps in the following sections.
Line 29: Line 30:
 ==== - Alignment ==== ==== - Alignment ====
  
-To align the signals ​$m$ and $s$, we compute a cross-correlation of short segments of each signal For each segment, we find the peak correlation value and perform a linear fit to the peak locations across the song.  This line represents the relative offset ​of the two recordings and their "​skew"​or the extent ​to which they have been recorded on different timescales.  ​Once we have estimated the offset and skewwe remove samples and resample so that the signals are aligned ​in time.+We use cross-correlation of unequalized signal to find temporal alignment. First we calcuiate a cross-correlation between ​the entire durations of $m$ and $s$, possibly downsampled (e.g. to as low as 1 kHz) to reduce ​the total computation. The global ​peak of this correlation is taken as the average time alignmentbut small differences in the sampling rates (clock drift) will lead to changes in the effective time difference throughout the track.  ​Although many digitally-mastered signals may have perfect time alignment, in our experience it is not unusual to see clock rate differences of 0.5\% or more where analog processing (such as magnetic tape playback) is involved. For a 200 sec track, 0.5\% time skew will cause relative timing to drift by a full second over the duration of the track.
  
-==== Channel Distortion ====+To detect such drift, we compute a cross-correlation of short segments of each signal, typically 8 sec segments every 4 sec, with correlation performed out to $\pm 2$ sec.  For each segment, we find the peak correlation value and perform a linear fit to the relative timing implied by these peaks' locations across the song.  This line represents the relative offset of the two recordings and their "​drift",​ or the extent to which they have been recorded on different timescales. ​ Once we have estimated the offset and drift, we remove samples and resample so that the signals are aligned in time.  We find that repeating this operation can further correct residual timing errors, bringing the timebases to within 10 parts per million (or 2 ms drift over a 200 sec track).
  
-To estimate the channel ​distortion, we assume that there is some filter $r$ such that+==== - Channel Differences ==== 
 + 
 +To estimate the channel ​differences -- i.e.a difference in the stationary linear filtering between the two tracks -- we assume that there is some filter $r$ such that
  
 $$m = h\ast s + r$$ $$m = h\ast s + r$$
  
-Note that this does not always hold true in practice (for example, when different ​nonlinearities have been applied to $m$ and $s$).  However, we find this approximation works well in practice. ​ To estimate $h$, we first compute the magnitude of the short-time Fourier transform (DFT) of each signal, which gives+Note that this does not always hold true in practice (for example, when nonlinearities have been applied to $m$ and/or $s$).  However, we find this approximation works well in practice. ​ To estimate $h$, we first compute the magnitude of the short-time Fourier transform (DFT) of each signal, which gives
  
 $$M = H \cdot S + R$$ $$M = H \cdot S + R$$
remixavier.1372625138.txt.gz · Last modified: 2013/06/30 16:45 by craffel