scrape-yt-match-fprint - Recovering fingerprinted audio from YouTube
Introduction
One use for audio fingerprints is to confirm that audio files at two locations are the same, at least modulo the kinds of channel distortions to which fingerprints are robust. This is particularly significant in situations where there are legal restrictions that prevent simply copying the files, as with the commercial audio commonly dealt with in Music IR.
Many commercial music tracks are, however, available via YouTube. If you want to hear a particular track, you can very often enter the artist and title into YouTube, and quickly locate several videos with different versions of the music as the soundtrack. Some may be low-quality or mislabeled, but usually you'll quickly find what you want.
So one approach to "distributing" music audio collections for research that avoids the copyright-infringing act of copying audio files is to distribute descriptions of the tracks, then let any interested researcher grab the audio from YouTube. However, there may be many different versions on YouTube, with more or less significant variations in performance, timing, or quality. This can be particularly important when trying to match audio to time-specific annotations (such as chord or structure transcriptions). Then, even something as innocuous as an extra couple of seconds of silence at the start of the track can disrupt the data.
To solve this, the original researcher could run a fingerprint on the source audio, then distribute this compact but discriminating information. In fact, by comparing the relative timings of local fingerprint matches, it is also possible to figure out the correct editing (trimming and resampling) to apply to the local audio to make it line up temporally with the reference audio used to create the fingerprints.
audfprint is my landmark-based robust audio fingerprinting tool. It has provisions to edit query audio and write out a version trimmed and scaled to synchronize within a few milliseconds to the reference audio described in the reference database.
Thus, to recreate an approximation of a reference audio set, a researcher simply needs to obtain a fingerprint database created from the original audio set, and a set of keywords (i.e., the artist and title) for each track. Then you can query YouTube with those keywords, and use the fingerprinter to (a) check whether the tracks you downloaded actually match the original track, then (b) scale and trim the downloaded audio to line up with the originals.
To help with this, I've created a small shell script, scrape-yt-match-fprint.sh that takes a set of keywords as input, queries YouTube, downloads the top ten associated videos, checks them against an audfprint fingerprint database, chooses the one with the greatest number of filtered hash matches (which we expect to be the closest match), then writes out a new version of the audio aligned to the fingerprint match. Owing to the way audfprint handles outputing aligned files, the name of the file written out is taken from the fingerprint database, and should match the original filename.