User Tools

Site Tools


deepunlearning

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

deepunlearning [2013/06/30 16:02]
bmcfee created
deepunlearning [2013/06/30 17:04] (current)
craffel
Line 1: Line 1:
-====== ​Idea ======+====== ​Deep Unlearning ​======
  
 +MIR techniques rely upon accurate representations of acoustic content in order to produce high-quality results. ​ Over the past few decades, most research has operated on hand-crafted features, which work well up to a point, but may discard important information from the representation,​ thereby degrading performance.
  
 +In recent years, deep neural networks have emerged as an effective approach to automatically learn representations of complex signals such as images, video, and speech. ​ Most recently, supervised (discriminative) training of deep networks has been demonstrated to outperform comparable unsupervised methods (such as restricted boltzmann machines).
 +
 +However, in order to do supervised training of features, we require a large pool of accurately labeled data.  While this is relatively easy to come by for images, it can be problematic in music.
 +
 +We propose to use artist recognition as a supervised proxy task for training deep representations of musical content. ​ There are two key motivating factors to this idea:
 +  - Even if meta-data/​tags are unavailable for a particular track, an artist identifier is almost always available; thus it becomes easier to obtain a large-scale training set for discriminative feature learning.
 +  - If we build features which can accurately characterize the acoustic signature of an artist, those features may well generalize to other tasks, such as semantic annotation or instrument recognition.
 +
 +
 +====== Implementation ======
 +
 +Our implementation is written in Python, using the LibROSA library for low-level audio analysis, and Theano for feature learning.
 +
 +The model architecture is based upon the ''​convolutional_mlp.py''​ example from the DeepLearningTutorial,​ with the following modifications:​
 +  - The input layer operates on a short fragment of audio (~0.5s) represented as a $64\times 40$-dimensional Mel power spectrum.
 +  - Layer 1 consists of a bank of 2-dimensional convolutional filters. ​ Each filter is convolved with the input layer, and the resulting filter responses are downsampled by spatial max-pooling.
 +  - Layer 2 consists of a linear transformation of the pooled filter responses, followed by a bank of rectified linear units
 +  - Layer 3 is the output layer, which is implemented as a logistic regression classifier to predict which of $k$ known artists generated the input patch.
 +
 +The model is trained by full stochastic gradient descent using a learning rate of $0.05$ and batches of 80 randomly selected input patches. ​ The objective function is cross-entropy of the output layer against the true label, combined with $\ell_2$-regularization of the filter weights and output layer parameters.
 ====== Our stuff ====== ====== Our stuff ======
  
Line 12: Line 33:
   * [[https://​github.com/​Theano/​Theano|Theano]]   * [[https://​github.com/​Theano/​Theano|Theano]]
   * [[https://​github.com/​bmcfee/​librosa|LibROSA]]   * [[https://​github.com/​bmcfee/​librosa|LibROSA]]
 +
 +====== Authors ======
 +  * Brian McFee
 +  * Nicola Montecchio
 +
deepunlearning.1372622547.txt.gz ยท Last modified: 2013/06/30 16:02 by bmcfee