Classification Using Nearest Neighbors

Pairwise Distance Metrics

Categorizing query points based on their distance to points in a training data set can be a simple yet effective way of classifying new points. You can use various metrics to determine the distance, described next. Use pdist2 to find the distance between a set of data and query points.

Distance Metrics

Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2, ..., xmx, and an my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt are defined as follows:

  • Euclidean distance

    d2st=(xsyt)(xsyt).

    The Euclidean distance is a special case of the Minkowski distance, where p = 2.

  • Standardized Euclidean distance

    d2st=(xsyt)V1(xsyt),

    where V is the n-by-n diagonal matrix whose jth diagonal element is (S(j))2, where S is a vector of scaling factors for each dimension.

  • Mahalanobis distance

    d2st=(xsyt)C1(xsyt),

    where C is the covariance matrix.

  • City block distance

    dst=n?j=1?xsjytj?.

    The city block distance is a special case of the Minkowski distance, where p = 1.

  • Minkowski distance

    dst=pGn?j=1?xsjytj?p.

    For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.

  • Chebychev distance

    dst=maxj{?xsjytj?}.

    The Chebychev distance is a special case of the Minkowski distance, where p = ∞.

  • Cosine distance

    dst=(1xsytG(xsxs)(ytyt)).

  • Correlation distance

    dst=1(xsxs)(ytyt)G(xsxs)(xsxs)G(ytyt)(ytyt),

    where

    xs=1n?jxsj

    and

    yt=1n?jytj.

  • Hamming distance

    dst=(#(xsjytj)/n).

  • Jaccard distance

    dst=#[(xsjytj)((xsj0)(ytj0))]#[(xsj0)(ytj0)].

  • Spearman distance

    dst=1(rsrs)(rtrt)G(rsrs)(rsrs)G(rtrt)(rtrt),

    where

    • rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank.

    • rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank.

    • rs and rt are the coordinate-wise rank vectors of xs and yt, that is, rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn).

    • rs=1n?jrsj=(n+1)2.

    • rt=1n?jrtj=(n+1)2.