====== Differences ====== This shows you the differences between two versions of the page.
| 
    latentartists [2013/06/30 16:07] ben  | 
    
    latentartists [2013/06/30 17:09] (current) ben  | 
    ||
|---|---|---|---|
| Line 4: | Line 4: | ||
| - | Dataset - | + | ==== Dataset - ==== | 
| - | Assume a fixed vocabulary $V$, which in our experiments is a company internal list | + | |
| + | Assume a fixed vocabulary $V$, which in our experiments is a list compiled by the Echonest | ||
| of music related multiword terms. | of music related multiword terms. | ||
| - | Each item $d_i$, from an OO-programming point of view, has the following fields | + | Each item $d_i$ in our dataset $D$, from an OO-programming point of view, has the following fields | 
| * Artist Name | * Artist Name | ||
| * Echonest ID | * Echonest ID | ||
| - | * Echonest Genres | + | * Echonest Genres (used for qualitative evaluation) | 
| * ML unigram model $x_i$, treating a sample of reviews for this artist as a bag of terms $w \in V$ | * ML unigram model $x_i$, treating a sample of reviews for this artist as a bag of terms $w \in V$ | ||
| - | Approach - | + | $|V| = 3368$ | 
| + | |||
| + | $|D| = 23541$ | ||
| + | |||
| + | |||
| + | ==== Modeling Approach - ==== | ||
| Using Factor Analysis, each $x_i$ as | Using Factor Analysis, each $x_i$ as | ||
| Line 22: | Line 28: | ||
| $z_i \sim \mathcal{N}(0,\mathbf{I})$ | $z_i \sim \mathcal{N}(0,\mathbf{I})$ | ||
| + | $x_i \sim \mathcal{N}(Wz,\Psi)$ | ||
| + | |||
| + | ==== Hypothesis - ==== | ||
| + | |||
| + | Much work that discovers similarity through low-dimensional representations such as PCA or Neural Networks treat | ||
| + | each data point as a single point in space.  By taking the Bayesian approach described above we can not only embed data in | ||
| + | a low dimensional space but also quantify our uncertainty about each dimension.  | ||
| + | |||
| + | ==== Method - ==== | ||
| + | |||
| + | The above model can be used to predict similar artists based on distance in the latent space.  The traditional  | ||
| + | approach would be to represent artist $d_i$ with its posterior mean $\mathbb{E}[z_i]$, and measure Euclidian distance. | ||
| + | Our alternative computes distance with KL divergence between full posteriors.  The posterior probability is given as | ||
| + | |||
| + | $z_i \sim \mathcal{N}(\mathbb{E}[z_i],G)$ | ||
| + | |||
| + | where | ||
| + | |||
| + | $G = (I + W^T\Psi^{-1}W)^{-1}$ | ||
| + | |||
| + | Distance between can be computed with KL-divergence, which for Multivariate Gaussian's is given as | ||
| + | |||
| + | $KL(\mathcal{N}_0||\mathcal{N}_1) \propto (\mathbb{E}[z_0] - \mathbb{E}[z_1])\Sigma^{-1}(\mathbb{E}[z_0] - \mathbb{E}[z_1])^T + C$ | ||
| + | |||
| + | if the covariance matrix $\Sigma$ is the same for both Gaussians.  This shows that if $\Sigma^{-1}$ is a multiple | ||
| + | of the identity matrix, the ranking retrieved will be the same as that of Euclidian distance between posterior means.  | ||
| + | |||
| + | We can calculate the artists that are similar to an arbitrary artist by calculating their distance to all other artists using one of these | ||
| + | metrics and applying a threshold.  | ||
| + | |||
| + | ==== Evaluation - ==== | ||
| + | |||
| + | We evaluate prediction of similarity on the top 300 artists by Echonest "hotttness", a set we will call $\mathcal{H}$.  | ||
| + | We use the official artists similars from the Echonest database for each artist as the ground truth, provided that these | ||
| + | similar artists are also in $\mathcal{H}$.  By varying the threshold on KL divergence or Euclidian distance we can trace out | ||
| + | an ROC curve. | ||
| + | |||
| + | Our results, contained in the ROC plots below, correspond to training on the full dataset and only the top 1000 by hotttness.  | ||
| + | In both experimental setups the same top 300 artists are used for evaluation, the only difference is the amount of information available  | ||
| + | during training. | ||
| + | |||
| + | == Hottt 1000 == | ||
| + | |||
| + | {{::1000.jpg?600|}} | ||
| + | |||
| + | == Full Dataset == | ||
| + | |||
| + | {{::full.jpg?600|}} | ||
| + | The results do not support our hypothesis that taking uncertainty into account would create a more robust notion of similarity.  | ||
| + | While both methods clearly capture the information in the Echonest artist similar lists, the area under the ROC curve is clearly | ||
| + | greater for the simple Euclidean distance based approach.  | ||
| + | The reason that the experimental results do not match our intuition is unclear.  One possibility is that KL divergence | ||
| + | is not an appropriate metric for similarity.  | ||