HAMR 2013 Proceedings

====== Differences ====== This shows you the differences between two versions of the page.

--- glad [2013/06/30 20:54]
dawenl
+++ glad [2013/07/01 18:31] (current)
dawenl [3 Future work]
@@ Line 3: / Line 3: @@
 | Authors | Dawen Liang |
 | Affiliation | LabROSA, Columbia University |
-| Code | [[https://github.com/dawenl/glad_cal500|GLAD]] |
+| Code | [[https://github.com/dawenl/glad_cal500|Github Link]] |
-[[http://cosmal.ucsd.edu/cal/projects/AnnRet/|Cal500]] is a widely used dataset for music tagging. The tags it contains include instrumentation, genre, emotion, and usage. It was collected from human annotators and integrated by "majority voting". However, considering the expertise from different annotators and the difficulty of different pieces, we can come up with a better statistical model for optimal label integration, which would ideally infer the label, as well as the expertise of the annotators and the difficulty of the songs. This work is primarily based on [[http://mplab.ucsd.edu/~jake/OptimalLabeling.pdf|this paper]] in NIPS 2009.
+[[http://cosmal.ucsd.edu/cal/projects/AnnRet/|Cal500]] is a widely used dataset for music tagging. The tags it contains include instrumentation ("Electric Guitar"), genre ("Jazz"), emotion ("Happy"), usage ("For a Party"), etc. They were collected from human annotators and integrated by "majority voting" (The tags that most people annotated are kept). However, considering the expertise from different annotators and the difficulty of different pieces, we can come up with a better statistical model for optimal label integration, which would ideally infer the label, as well as the expertise of the annotators and the difficulty of the songs. This work is primarily based on [[http://mplab.ucsd.edu/~jake/OptimalLabeling.pdf|this paper]] in NIPS 2009.
 ===== - Model =====
 ==== - Notation and Model specification ====
-$i\in\{1,2,\cdots,I\}$ is used to index annotators and $j\in\{1,2,\cdots,J\}$ is used to index music pieces. $L_{ij}$ represent the label collected from annotator $i$ on music $j$, while $Z_{j}$ stands for the "true" label of the corresponding music.
+$i\in\{1,2,\cdots,I\}$ is used to index annotators and $j\in\{1,2,\cdots,J\}$ is used to index songs. $L_{ij}$ represent the label collected from annotator $i$ on song $j$, while $Z_{j}$ stands for the "true" label of the corresponding song.
 For each annotator $i$, $\alpha_i \in (-\infty, +\infty)$ is used to indicate his/her expertise. $\alpha_i = +\infty$ means the annotator can always make the correct labels while $\alpha_i = -\infty$ means the annotator can always make the **opposite** label (maybe intentionally). $\alpha_i = 0$ means the label from the annotator doesn't carry any information.
-For each music piece $j$, $1/\beta_j \in [0, \infty)$ is used to indicate the difficulty of annotating it correctly, i.e. the larger $\beta_j$ is, the easier to annotate this piece correctly.
+For each song $j$, $1/\beta_j \in [0, \infty)$ is used to indicate the difficulty of annotating it correctly, i.e. the larger $\beta_j$ is, the easier it is to annotate this piece correctly.
-Now we write the probability that annotator $i$ correctly label piece $j$ as:
+Now we write the probability that the annotator $i$ correctly label song $j$ as:
-$P(L_{ij} = Z_J | \alpha_i, \beta_j) = \sigma(\alpha_i \beta_j)$
+$P(L_{ij} = Z_j | \alpha_i, \beta_j) = \sigma(\alpha_i \beta_j)$
-where $\sigma(\cdot)$ is logistic function which is shown below:
+where $\sigma(\cdot)$ is logistic function $\sigma(x) = \frac{1}{1+\exp(-x)}$, which is shown below:
-[[http://2.bp.blogspot.com/_-jVR4K_MOec/SwGKV-wjK3I/AAAAAAAAAAk/McdFMs4c1K8/s1600/600px-Logistic-curve_svg.png|{{wiki:dokuwiki-128.png}}]]
+{{::600px-logistic-curve_svg.png?200|}}
 From the shape of logistic function, we can see that if the annotator is good at making correct annotation (larger $\alpha_i$), given the same piece (fixed $\beta_j$), it has higher probability to make the right label.
-However, if the piece is difficult to label correctly ($\beta_j$ close to 0), it will bend the probability for every annotation to close to 0.5.
+However, if the piece is difficult to label correctly ($\beta_j$ close to 0), it will bend the probability towards 0.5 for every annotator.
 ==== - Inference ====
@@ Line 36: / Line 36: @@
 ===== - Preliminary results =====
-After fitting the model to Cal500, for each label, we can obtain $I$ different $\alpha_i$ corresponding to the expertise of $I$ annotators and we can take the average to obtain an "average" expertise for the given label.
+After fitting the model to Cal500, for each label, we can obtain $I$ different $\alpha_i$ corresponding to the expertise of $I$ annotators and we can take the mean to obtain an "average" expertise $\hat{\alpha} = \frac{1}{I}\sum_i \alpha_i$ for the given label. This can be understood as how well **on average** people can annotate this label, larger $\hat{\alpha}$ means higher average expertise.
-Here I fit the model to instrument-based labels and genre-based labels as they are simple and easy to understand.
+I fit the model to instrument-based labels and genre-based labels as they are simple and easy to understand (plus for now the model I implemented only support binary labels).
-==== - Solo v.s. Instrument ====
+==== - Instruments as solo v.s. background ====
-One thing interesting is how the annotators are good at labeling "Solo", as opposed to just labeling "Instrument" (as background).
+One thing which is interesting to see is how the annotators are good at labeling instruments as "Solo" (e.g. "Piano Solo", "Electric Guitar Solo"), as opposed to just labeling instrument as background.
 {{:comp.png?200|}}
-The histogram above shows both the distribution of average expertise of labeling instrument as background and as solo. We can that there is a huge gap, indicating the annotators are way better at annotating solo.
+The histogram above shows both the distribution of average expertise $\hat{\alpha}$ of labeling instrument as background and as solo. We can that there is no overlapping, indicating the annotators are significantly better at annotating instruments as solo than as background.
-==== - Difficulty of different instruments ====
+==== - Difficulty of labeling different instruments ====
-We can average the expertise based on instrument to see the difficulty for labeling, from the annotator's point of view. Below is the top 5 simplest v.s. the top 5 hardest:
+We can interpret the average expertise $\hat{\alpha}$ to label instrument-based tags as a reflection on how difficult it is to label the corresponding instruments correctly. Below is the top 5 simplest instruments v.s. the top 5 hardest instruments in terms of $\hat{\alpha}$:
 ^ Top 5 simplest ^ Top 5 hardest ^
@@ Line 59: / Line 59: @@
 | Violin | Sequencer |
+The top 5 simplest instruments make a lot of sense as those are usually standing out clearly in music. On the other hand, the top 5 hardest is arguable, but still those are definitely not easy to label in general.
+==== - Genre ====
+We can take the similar approach on the genre-based tags:
-==== - Genre ====
+^ Genre (from the simplest to the hardest) ^
+| Rock |
+| World |
+| Folk |
+| Electronica |
+| R&B |
+| Pop |
+| Bluegrass |
+| Blues |
+| Hip-hop/Rap |
+| Country |
+| Jazz |
+Not surprising, Jazz is hard.
+===== - Future work =====
+- At the moment, only binary labels are supported. But in fact, the model is easily extended to handle multinomial labels.
+- Now each individual label is treated completely independent. However, in the real world, it's easy to consider the correlation between different tags (e.g. "Rock" is definitely more positively-correlated to "Electric Guitar (Distortion)" than "Sampler"). This can be done by the similar idea from Correlated Topic Model ([[http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_774.pdf|CTM]]).
+- An interesting yet challenging problem would be to integrate the noisy beat annotations to create better ground truth data for beat tracking tasks. The main difference is that in beat annotation, the labels are no longer discretized categories, instead they are temporally-dependent series, which makes the problem much more difficult.

HAMR 2013 Proceedings

User Tools

Site Tools

Page Tools