HAMR 2013 Proceedings

====== Optimal Integration of Labels for Cal500 Dataset ====== | Authors | Dawen Liang | | Affiliation | LabROSA, Columbia University | | Code | [[https://github.com/dawenl/glad_cal500|Github Link]] | [[http://cosmal.ucsd.edu/cal/projects/AnnRet/|Cal500]] is a widely used dataset for music tagging. The tags it contains include instrumentation ("Electric Guitar"), genre ("Jazz"), emotion ("Happy"), usage ("For a Party"), etc. They were collected from human annotators and integrated by "majority voting" (The tags that most people annotated are kept). However, considering the expertise from different annotators and the difficulty of different pieces, we can come up with a better statistical model for optimal label integration, which would ideally infer the label, as well as the expertise of the annotators and the difficulty of the songs. This work is primarily based on [[http://mplab.ucsd.edu/~jake/OptimalLabeling.pdf|this paper]] in NIPS 2009. ===== - Model ===== ==== - Notation and Model specification ==== $i\in\{1,2,\cdots,I\}$ is used to index annotators and $j\in\{1,2,\cdots,J\}$ is used to index songs. $L_{ij}$ represent the label collected from annotator $i$ on song $j$, while $Z_{j}$ stands for the "true" label of the corresponding song. For each annotator $i$, $\alpha_i \in (-\infty, +\infty)$ is used to indicate his/her expertise. $\alpha_i = +\infty$ means the annotator can always make the correct labels while $\alpha_i = -\infty$ means the annotator can always make the **opposite** label (maybe intentionally). $\alpha_i = 0$ means the label from the annotator doesn't carry any information. For each song $j$, $1/\beta_j \in [0, \infty)$ is used to indicate the difficulty of annotating it correctly, i.e. the larger $\beta_j$ is, the easier it is to annotate this piece correctly. Now we write the probability that the annotator $i$ correctly label song $j$ as: $P(L_{ij} = Z_j | \alpha_i, \beta_j) = \sigma(\alpha_i \beta_j)$ where $\sigma(\cdot)$ is logistic function $\sigma(x) = \frac{1}{1+\exp(-x)}$, which is shown below: {{::600px-logistic-curve_svg.png?200|}} From the shape of logistic function, we can see that if the annotator is good at making correct annotation (larger $\alpha_i$), given the same piece (fixed $\beta_j$), it has higher probability to make the right label. However, if the piece is difficult to label correctly ($\beta_j$ close to 0), it will bend the probability towards 0.5 for every annotator. ==== - Inference ==== This model can be fit by the classic [[http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm|expectation-maximization (EM) algorithm]]. To put it simple: - Do the following until convergence: - E-step: Treat $Z_j$ as latent variable and "guess" its value. - M-step: Optimize $\alpha_i$ and $\beta_j$ based on the guessing of $Z_j$ from E-step. ===== - Preliminary results ===== After fitting the model to Cal500, for each label, we can obtain $I$ different $\alpha_i$ corresponding to the expertise of $I$ annotators and we can take the mean to obtain an "average" expertise $\hat{\alpha} = \frac{1}{I}\sum_i \alpha_i$ for the given label. This can be understood as how well **on average** people can annotate this label, larger $\hat{\alpha}$ means higher average expertise. I fit the model to instrument-based labels and genre-based labels as they are simple and easy to understand (plus for now the model I implemented only support binary labels). ==== - Instruments as solo v.s. background ==== One thing which is interesting to see is how the annotators are good at labeling instruments as "Solo" (e.g. "Piano Solo", "Electric Guitar Solo"), as opposed to just labeling instrument as background. {{:comp.png?200|}} The histogram above shows both the distribution of average expertise $\hat{\alpha}$ of labeling instrument as background and as solo. We can that there is no overlapping, indicating the annotators are significantly better at annotating instruments as solo than as background. ==== - Difficulty of labeling different instruments ==== We can interpret the average expertise $\hat{\alpha}$ to label instrument-based tags as a reflection on how difficult it is to label the corresponding instruments correctly. Below is the top 5 simplest instruments v.s. the top 5 hardest instruments in terms of $\hat{\alpha}$: ^ Top 5 simplest ^ Top 5 hardest ^ | Ambient sounds | Drum set | | Harmonica | Male Lead Vocal | | Saxophone | Electric Guitar (Clean) | | Horn | Tambourine | | Violin | Sequencer | The top 5 simplest instruments make a lot of sense as those are usually standing out clearly in music. On the other hand, the top 5 hardest is arguable, but still those are definitely not easy to label in general. ==== - Genre ==== We can take the similar approach on the genre-based tags: ^ Genre (from the simplest to the hardest) ^ | Rock | | World | | Folk | | Electronica | | R&B | | Pop | | Bluegrass | | Blues | | Hip-hop/Rap | | Country | | Jazz | Not surprising, Jazz is hard. ===== - Future work ===== - At the moment, only binary labels are supported. But in fact, the model is easily extended to handle multinomial labels. - Now each individual label is treated completely independent. However, in the real world, it's easy to consider the correlation between different tags (e.g. "Rock" is definitely more positively-correlated to "Electric Guitar (Distortion)" than "Sampler"). This can be done by the similar idea from Correlated Topic Model ([[http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_774.pdf|CTM]]). - An interesting yet challenging problem would be to integrate the noisy beat annotations to create better ground truth data for beat tracking tasks. The main difference is that in beat annotation, the labels are no longer discretized categories, instead they are temporally-dependent series, which makes the problem much more difficult.

HAMR 2013 Proceedings

User Tools

Site Tools

Page Tools