Dan Ellis : Music Content Analysis : Practical :

A Practical Investigation of Singing Detection:
2. Gaussian Mixture Models

(If you didn't go through the steps in part 1, you can download this file: ftrslabs.mat, then load it into Matlab with load ftrslabs, to set up the ftrs and labs variables).

Now we have our training set nicely prepared in terms of feature vectors (the rows of ftrs) and their corresponding labels as singing or not (the values of labs, one for each row of ftrs, where 1 indicates singing). We're ready to build a classifier!

If we knew the probability distribution functions (PDFs) of the signal features under the two possible conditions, singing or no singing, we could classify a new test feature vector according to the likelihood ratio - the ratio of the values of the PDFs for the two classes at that point. In order to do this, we are going to estimate PDFs from the individual samples present in our training data. To get an idea of what we're trying to do, we'll first look at the data in just 2 dimensions (so we can plot it), but in practice we can use all 13 cepstra, or all 39 dimensions including the deltas and double-deltas.

We will be using the Netlab functions for this part, so you will need to make sure that it is downloaded and installed on your machine, and that Matlab knows where to find it. You'll need to type something like this (substituting in the appropriate local path):

 path('/homes/dpwe/matlab/netlab', path);

Now let's compare the distribution of the first two cepstra for all the frames labeled as sung versus all the others:

 % Sort training data into sung and unsung frames 
 ddS = ftrs(labs==1,:);
 ddM = ftrs(labs==0,:);
 % Compare scatter plots of 1st 2 dimensions 
 % First with sung frames (red dots) in front: 
 subplot(221)
 plot(ddM(:,1),ddM(:,2),'.b',ddS(:,1),ddS(:,2),'.r')
 % then with unsung frames (blue dots) in front: 
 subplot(222)
 plot(ddS(:,1),ddS(:,2),'.r',ddM(:,1),ddM(:,2),'.b')
 % Heavily overlapped, but some difference...

Now we estimate a continuous PDF over these two datasets by fitting a weighted set of 2-dimensional Gaussians - a Gaussian Mixture Model or GMM. Using the standard Netlab procedure, we first initialize the Gaussian means using k-means clustering, then refine the models using Expectation-Maximization. Netlab makes this all pretty straightforward:

 % Define how many dimensions we are going to use, and how many Gaussian components in our mixture 
 ndim = 2;
 nmix = 5;
 % Initialize diagonal-covariance models for both PDFs to be estimated 
 gmS = gmm(ndim,nmix,'diag');  % Singing GMM
 gmM = gmm(ndim,nmix,'diag');  % Music GMM
 options = foptions;     % default optimization options
 options(14) = 5;        % 5 iterations of k-means in initialization
 gmS = gmminit(gmS, ddS(:,1:2), options);  % Initialize from sung data
Warning: Maximum number of iterations has been exceeded
 gmM = gmminit(gmM, ddM(:,1:2), options);  % Initialize from unsung data
Warning: Maximum number of iterations has been exceeded
 % Now full EM re-estimation: 
 options = zeros(1, 18);
 options(14) = 20;               % Number of iterations
 gmS = gmmem(gmS, ddS(:,1:2), options);                       
Warning: Maximum number of iterations has been exceeded
 gmM = gmmem(gmM, ddM(:,1:2), options);                       
Warning: Maximum number of iterations has been exceeded

To see what we have got, we need to sample the PDFs defined by the GMMs across a regular grid of values, then plot the result in 2D or 3D:

 % Define sample points across the range in question 
 xx = linspace(-28,-8);
 yy = linspace(-5,5);
 [x,y] = meshgrid(xx,yy);
 % x and y are 100x100 arrays defining the grid sample points.
 % We'll 'unravel' them to 10,000 point vectors, evaluate the probability at each point,
 % then re-ravel the result to 100x100:
 ppS = gmmprob(gmS, [x(:),y(:)]);
 ppS = reshape(ppS, 100, 100);
 subplot(223)
 imagesc(xx,yy,ppS)
 axis xy
 % Repeat for the unsung (music) model:
 ppM = gmmprob(gmM, [x(:),y(:)]);
 ppM = reshape(ppM, 100, 100);
 subplot(224)
 imagesc(xx,yy,ppM)
 axis xy
 % You can see the principal Gaussian peaks, also more differences between the two PDFs

Assuming equal prior likelihoods for each class, the decision rule is that points in feature space for which one PDF is larger are classified as belonging to that class. Thus, we can visualize the decision rule by plotting the two surfaces on top of one another in 3D:

 % 3D plot in full window 
 subplot(111)
 surf(xx,yy,ppM, 0*ppM)
 % 4th argument to surf sets color for each point - make them all 0 
 % Plot sung-feature PDF GMM on same display, with colorvalue = 1 
 hold on
 surf(xx,yy,ppS, 1+0*ppS)
 % Where red surface pokes through, classification will be as sung frames

[3D surface plot shows decision regions]

Now we need to see how these PDF estimate fare as classifiers. One quick thing we can do is try applying the classifier to the training data itself. This will be optimistic, since in general new test data won't look exactly like the training data, but it at least lets us confirm that the classifier is doing something like we expect. We calculate the log of the ratio of the likelihoods under each model for all the training frames, then see how many of the frames labeled "singing" have log-ratios greater than zero, and vice-versa:

 % Figure likelihoods of all training data under each model 
 lS = gmmprob(gmS, ftrs(:,[1 2]));
 lM = gmmprob(gmM, ftrs(:,[1 2]));
 % Hence, log-likelihood ratio 
 llrSM = log(lS./lM);
 % If we call LLR > 0 "sung", how many do we get right? 
 mean( (llrSM > 0) == labs)
ans =
    0.6676
 % 66.7% - well, it's better than guessing (~50%)

Finally, to simplify matters, we can put all these stages into a function, traingmms, that takes the number of Gaussian mixture components, and the data dimensions to use, as input parameters, then trains the models and evaluates them on the test data:

 % Redo 5 mixtures on 1st two dimensions
 [M0,M1] = traingmms(ftrs(:,[1 2]), labs, 5);
Warning: Maximum number of iterations has been exceeded
Warning: Maximum number of iterations has been exceeded
Warning: Maximum number of iterations has been exceeded
Warning: Maximum number of iterations has been exceeded
Accuracy on training data = 66.1%
Elapsed time = 36.7439 secs
 % How about 10 mixtures?
 [M0,M1] = traingmms(ftrs(:,[1 2]), labs, 10);
Warning: Maximum number of iterations has been exceeded
Warning: Maximum number of iterations has been exceeded
Warning: Maximum number of iterations has been exceeded
Warning: Maximum number of iterations has been exceeded
Accuracy on training data = 66.8%
Elapsed time = 73.1332 secs
 % Minimal accuracy gain for extra complexity/compute time

Assignment

Investigate the classifier accuracy as the number of Gaussian mixture components and/or the number of feature dimensions is varied. You can even vary the amount of training data used by subselecting rows in ftrs and labs. Can you make a plot of accuracy versus system complexity? How about versus training time?

Back: Data and features

Top

Next: Neural nets

Last updated: $Date: 2003/07/02 15:40:03 $

Dan Ellis <dpwe@ee.columbia.edu>

A Practical Investigation of Singing Detection: 2. Gaussian Mixture Models

Assignment

A Practical Investigation of Singing Detection:
2. Gaussian Mixture Models