Categorizing query points based on their distance to points in a training data set can be a simple yet effective way of classifying new points. You can use various metrics to determine the distance, described next. Use pdist2 to find the distance between a set of data and query points.
Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2, ..., xmx, and an my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt are defined as follows:
Euclidean distance
d2st=(xs−yt)(xs−yt)′.
The Euclidean distance is a special case of the Minkowski distance, where p = 2.
Standardized Euclidean distance
d2st=(xs−yt)V−1(xs−yt)′,
where V is the n-by-n diagonal matrix whose jth diagonal element is (S(j))2, where S is a vector of scaling factors for each dimension.
Mahalanobis distance
d2st=(xs−yt)C−1(xs−yt)′,
where C is the covariance matrix.
City block distance
dst=n?j=1?xsj−ytj?.
The city block distance is a special case of the Minkowski distance, where p = 1.
Minkowski distance
dst=pGn?j=1?xsj−ytj?p.
For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.
Chebychev distance
dst=maxj{?xsj−ytj?}.
The Chebychev distance is a special case of the Minkowski distance, where p = ∞.
Cosine distance
dst=(1−xsy′tG(xsx′s)(yty′t)).
Correlation distance
dst=1−(xs−‾xs)(yt−‾yt)′G(xs−‾xs)(xs−‾xs)′G(yt−‾yt)(yt−‾yt)′,
where
‾xs=1n?jxsj
and
‾yt=1n?jytj.
Hamming distance
dst=(#(xsj≠ytj)/n).
Jaccard distance
dst=#[(xsj≠ytj)∩((xsj≠0)∪(ytj≠0))]#[(xsj≠0)∪(ytj≠0)].
Spearman distance
dst=1−(rs−‾rs)(rt−‾rt)′G(rs−‾rs)(rs−‾rs)′G(rt−‾rt)(rt−‾rt)′,
where
rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank.
rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank.
rs and rt are the coordinate-wise rank vectors of xs and yt, that is, rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn).
‾rs=1n?jrsj=(n+1)2.
‾rt=1n?jrtj=(n+1)2.