Fast k-NN using HDF5

We present how to directly use an HDF5 file holding features with a fast Nearest Neighbor library in Python.

This tutorial assumes you have pytables and scikits.ann.

During a training pass, you have extracted feature vectors and labels. Let's take the example of year prediction. In a HDF5 file, you have vectors of size 5 and corresponding years. They are organized in a group called 'data'. We have 1.5M feature vector in this case. In iPython:

In [6]: import hdf5_getters as GETTERS
In [7]: h5 = GETTERS.open_h5_file_read('/feats_year.h5')
In [11]: h5.root.data.year.shape
Out[11]: (1537581,)
In [12]: h5.root.data.feats.shape
Out[12]: (1537581, 5)

We know load the fast Approximate Nearest Neighbor library. It also does exact NN, which we use here.

In [8]: import scikits.ann as ANN

Usually, we create kdtree, used for fast NN queries, using a numpy array. But the good thing is that, with pytables, HDF5 columns look like numpy arrays! Therefore, we can do:

In [9]: kd = ANN.kdtree(h5.root.data.feats)

It takes some time to load (a minute or two), but then, check the time:

In [15]: %time kd.knn(np.random.rand(1,5)*100)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[16]: (array([[451046]], dtype=int32), array([[ 440.65787012]]))

It is instantaneous! Well, not really. But first, note that the output are the index of the closest element, and then its distance. K was implicitely 1, we can do 10-NN if we want:

In [17]: %time kd.knn(np.random.rand(1,5)*100,10)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[18]: 
(array([[ 959263,  478361,  795133, 1267893,  176398,  942215, 1346241,
         973526,  158917,  996840]], dtype=int32),
 array([[  599.48257344,   813.42850047,  1043.60168629,  1434.72259457,
         1464.13185926,  1557.32693239,  1604.61336171,  1715.19310994,
         1854.45981542,  1875.36760981]]))

And we can test more than one feature vector at a time, e.g 3:

In [21]: %time kd.knn(np.random.rand(3,5)*100)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[22]: 
(array([[ 661944],
       [  79454],
       [1115377]], dtype=int32),
 array([[ 268.81010349],
       [ 806.11565055],
       [ 575.00976952]]))

Only detail, for some reason, you must delete the kd tree before closing the HDF5 file if you want a clean exit:

In [23]: del kd
In [24]: h5.close()

Enjoy!