scrape-yt-match-fprint - Recovering fingerprinted audio from YouTube

Introduction

One use for audio fingerprints is to confirm that audio files at two locations are the same, at least modulo the kinds of channel distortions to which fingerprints are robust. This is particularly significant in situations where there are legal restrictions that prevent simply copying the files, as with the commercial audio commonly dealt with in Music IR.

Many commercial music tracks are, however, available via YouTube. If you want to hear a particular track, you can very often enter the artist and title into YouTube, and quickly locate several videos with different versions of the music as the soundtrack. Some may be low-quality or mislabeled, but usually you'll quickly find what you want.

So one approach to "distributing" music audio collections for research that avoids the copyright-infringing act of copying audio files is to distribute descriptions of the tracks, then let any interested researcher grab the audio from YouTube. However, there may be many different versions on YouTube, with more or less significant variations in performance, timing, or quality. This can be particularly important when trying to match audio to time-specific annotations (such as chord or structure transcriptions). Then, even something as innocuous as an extra couple of seconds of silence at the start of the track can disrupt the data.

To solve this, the original researcher could run a fingerprint on the source audio, then distribute this compact but discriminating information. In fact, by comparing the relative timings of local fingerprint matches, it is also possible to figure out the correct editing (trimming and resampling) to apply to the local audio to make it line up temporally with the reference audio used to create the fingerprints.

audfprint is my landmark-based robust audio fingerprinting tool. It has provisions to edit query audio and write out a version trimmed and scaled to synchronize within a few milliseconds to the reference audio described in the reference database.

Thus, to recreate an approximation of a reference audio set, a researcher simply needs to obtain a fingerprint database created from the original audio set, and a set of keywords (i.e., the artist and title) for each track. Then you can query YouTube with those keywords, and use the fingerprinter to (a) check whether the tracks you downloaded actually match the original track, then (b) scale and trim the downloaded audio to line up with the originals.

To help with this, I've created a small shell script, scrape-yt-match-fprint.sh that takes a set of keywords as input, queries YouTube, downloads the top ten associated videos, checks them against an audfprint fingerprint database, chooses the one with the greatest number of filtered hash matches (which we expect to be the closest match), then writes out a new version of the audio aligned to the fingerprint match. Owing to the way audfprint handles outputing aligned files, the name of the file written out is taken from the fingerprint database, and should match the original filename.

idlist.txt is a file of individual single-token identifiers for each track to download, and id_plus_artist_title.txt is a similar file but including the full artist name and title on a line with each ID; thus, grepping it returns the keywords whcih are passed to the scraping script. The output from this script should include exactly one line matching "Best match:", and we accumulate these results in PROVENANCE.txt, which will tell us eithere the YouTube ID found as the best match, or that no good match was found. Meanwhile, mp3 versions of each matched file are written, according to the filename stored in the fingerprint database, under the "alignout/" directory. The sleep 120 is just to slow down the access to avoid tripping YouTube's excessive use check.

For this script to work, you need the following packages installed. I have run this on both Linux and MacOS.

Example 1: burgoyne2013 Billboard v2.0 data

These tools were actually developed in response to my wish to use Ashley Burgoyne's excellent McGill Billboard Project annotations, which include both chords and structure. Although Ashley includes Echonest features, and generously offers to run other feature extractors over the audio, it's always nice to have local audio to work from. Ashley kindly sent me a fingerprint database built with audfprint, and I was able to use it in conjunction with the scrape-yt-match-fprint script to reconstruct audio for more than 90% of the tracks. Here are the pieces you'll need:

For reference, bb2013provenance.txt lists the actual YouTube IDs at which I found the matching audio for each ID (if any). You don't need this -- the script deduces it -- but it might be interesting for comparison. Each line is <4 char Billboard index> <11 char YouTube ID>. Where no match was found on YouTube, the second part is empty.

Example 2: Isophonics Beatles data

I made an earlier attempt to use fingerprinting for fixing the problem of non-distributable audio when I started using Chris Harte's Beatles annotations, which was the first large, consistent, database of manual chord annotations. Unfortunately, there are many releases of the Beatles' albums, and they differ quite significantly in speed, I think because the master tapes have stretched over time. I built a special-purpose tool, beatles_fprint, to rewrite label files to match local versions of the audio. Having been recently bitten by the proliferation of label file versions this produces, I now think it's the wrong approach, and I'd rather help people fix their local audio, and keep annotations as a gold standard, without many versions. So I adapted that fingerprint database to work with the current version of audfprint, and you can use it on YouTube with the tools here (you'll have to edit scrape-yt-match-fprint to use a different fingerprint database). Here's the database: beatles-fprint.mat.

scrape-yt-match-fprint - Recovering fingerprinted audio from YouTube

Introduction

Example 1: burgoyne2013 Billboard v2.0 data

Example 2: Isophonics Beatles data

History