HAMR 2013 Proceedings

====== Deep Unlearning ====== MIR techniques rely upon accurate representations of acoustic content in order to produce high-quality results. Over the past few decades, most research has operated on hand-crafted features, which work well up to a point, but may discard important information from the representation, thereby degrading performance. In recent years, deep neural networks have emerged as an effective approach to automatically learn representations of complex signals such as images, video, and speech. Most recently, supervised (discriminative) training of deep networks has been demonstrated to outperform comparable unsupervised methods (such as restricted boltzmann machines). However, in order to do supervised training of features, we require a large pool of accurately labeled data. While this is relatively easy to come by for images, it can be problematic in music. We propose to use artist recognition as a supervised proxy task for training deep representations of musical content. There are two key motivating factors to this idea: - Even if meta-data/tags are unavailable for a particular track, an artist identifier is almost always available; thus it becomes easier to obtain a large-scale training set for discriminative feature learning. - If we build features which can accurately characterize the acoustic signature of an artist, those features may well generalize to other tasks, such as semantic annotation or instrument recognition. ====== Implementation ====== Our implementation is written in Python, using the LibROSA library for low-level audio analysis, and Theano for feature learning. The model architecture is based upon the ''convolutional_mlp.py'' example from the DeepLearningTutorial, with the following modifications: - The input layer operates on a short fragment of audio (~0.5s) represented as a $64\times 40$-dimensional Mel power spectrum. - Layer 1 consists of a bank of 2-dimensional convolutional filters. Each filter is convolved with the input layer, and the resulting filter responses are downsampled by spatial max-pooling. - Layer 2 consists of a linear transformation of the pooled filter responses, followed by a bank of rectified linear units - Layer 3 is the output layer, which is implemented as a logistic regression classifier to predict which of $k$ known artists generated the input patch. The model is trained by full stochastic gradient descent using a learning rate of $0.05$ and batches of 80 randomly selected input patches. The objective function is cross-entropy of the output layer against the true label, combined with $\ell_2$-regularization of the filter weights and output layer parameters. ====== Our stuff ====== * [[https://github.com/bmcfee/deep-artists|Source code]] * [[https://github.com/bmcfee/deep-artists/wiki/Model-architecture|Model architecture]] ====== Resources ====== * [[http://deeplearning.net/tutorial/contents.html|Deep learning tutorial]] * [[https://github.com/Theano/Theano|Theano]] * [[https://github.com/bmcfee/librosa|LibROSA]] ====== Authors ====== * Brian McFee * Nicola Montecchio

HAMR 2013 Proceedings

User Tools

Site Tools

Page Tools