====== Differences ====== This shows you the differences between two versions of the page.
latentartists [2013/06/30 16:23] ben |
latentartists [2013/06/30 17:09] (current) ben |
||
---|---|---|---|
Line 4: | Line 4: | ||
- | Dataset - | + | ==== Dataset - ==== |
- | Assume a fixed vocabulary $V$, which in our experiments is a company internal list | + | |
+ | Assume a fixed vocabulary $V$, which in our experiments is a list compiled by the Echonest | ||
of music related multiword terms. | of music related multiword terms. | ||
Line 17: | Line 18: | ||
$|V| = 3368$ | $|V| = 3368$ | ||
+ | |||
$|D| = 23541$ | $|D| = 23541$ | ||
- | Modeling Approach - | + | ==== Modeling Approach - ==== |
Using Factor Analysis, each $x_i$ as | Using Factor Analysis, each $x_i$ as | ||
Line 28: | Line 30: | ||
$x_i \sim \mathcal{N}(Wz,\Psi)$ | $x_i \sim \mathcal{N}(Wz,\Psi)$ | ||
- | Hypothesis - | + | ==== Hypothesis - ==== |
Much work that discovers similarity through low-dimensional representations such as PCA or Neural Networks treat | Much work that discovers similarity through low-dimensional representations such as PCA or Neural Networks treat | ||
Line 34: | Line 36: | ||
a low dimensional space but also quantify our uncertainty about each dimension. | a low dimensional space but also quantify our uncertainty about each dimension. | ||
- | Method - | + | ==== Method - ==== |
The above model can be used to predict similar artists based on distance in the latent space. The traditional | The above model can be used to predict similar artists based on distance in the latent space. The traditional | ||
Line 46: | Line 48: | ||
$G = (I + W^T\Psi^{-1}W)^{-1}$ | $G = (I + W^T\Psi^{-1}W)^{-1}$ | ||
+ | Distance between can be computed with KL-divergence, which for Multivariate Gaussian's is given as | ||
+ | $KL(\mathcal{N}_0||\mathcal{N}_1) \propto (\mathbb{E}[z_0] - \mathbb{E}[z_1])\Sigma^{-1}(\mathbb{E}[z_0] - \mathbb{E}[z_1])^T + C$ | ||
- | Evaluation - | + | if the covariance matrix $\Sigma$ is the same for both Gaussians. This shows that if $\Sigma^{-1}$ is a multiple |
+ | of the identity matrix, the ranking retrieved will be the same as that of Euclidian distance between posterior means. | ||
+ | |||
+ | We can calculate the artists that are similar to an arbitrary artist by calculating their distance to all other artists using one of these | ||
+ | metrics and applying a threshold. | ||
+ | |||
+ | ==== Evaluation - ==== | ||
We evaluate prediction of similarity on the top 300 artists by Echonest "hotttness", a set we will call $\mathcal{H}$. | We evaluate prediction of similarity on the top 300 artists by Echonest "hotttness", a set we will call $\mathcal{H}$. | ||
We use the official artists similars from the Echonest database for each artist as the ground truth, provided that these | We use the official artists similars from the Echonest database for each artist as the ground truth, provided that these | ||
- | similar artists are also in $\mathcal{H}$. By varying the numeric thesh | + | similar artists are also in $\mathcal{H}$. By varying the threshold on KL divergence or Euclidian distance we can trace out |
+ | an ROC curve. | ||
+ | |||
+ | Our results, contained in the ROC plots below, correspond to training on the full dataset and only the top 1000 by hotttness. | ||
+ | In both experimental setups the same top 300 artists are used for evaluation, the only difference is the amount of information available | ||
+ | during training. | ||
+ | |||
+ | == Hottt 1000 == | ||
+ | {{::1000.jpg?600|}} | ||
+ | == Full Dataset == | ||
+ | {{::full.jpg?600|}} | ||
+ | The results do not support our hypothesis that taking uncertainty into account would create a more robust notion of similarity. | ||
+ | While both methods clearly capture the information in the Echonest artist similar lists, the area under the ROC curve is clearly | ||
+ | greater for the simple Euclidean distance based approach. | ||
+ | The reason that the experimental results do not match our intuition is unclear. One possibility is that KL divergence | ||
+ | is not an appropriate metric for similarity. | ||