MATLAB: Nearest Neighbors Classification

Pairwise Distance Metrics

Categorizing query points based on their distance to points in a training data set can be a simple yet effective way of classifying new points. You can use various metrics to determine the distance, described next. Use pdist2 to find the distance between a set of data and query points.

Distance Metrics

Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x₁, x₂, ..., x_mx, and an my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y₁, y₂, ...,y_my, the various distances between the vector x_s and y_t are defined as follows:

Euclidean distance

d2st=(xs−yt)(xs−yt)′.

The Euclidean distance is a special case of the Minkowski distance, where p = 2.
Standardized Euclidean distance

d2st=(xs−yt)V−1(xs−yt)′,

where V is the n-by-n diagonal matrix whose jth diagonal element is (S(j))², where S is a vector of scaling factors for each dimension.
Mahalanobis distance

d2st=(xs−yt)C−1(xs−yt)′,

where C is the covariance matrix.
City block distance

dst=n?j=1?xsj−ytj?.

The city block distance is a special case of the Minkowski distance, where p = 1.
Minkowski distance

dst=pGn?j=1?xsj−ytj?p.

For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.
Chebychev distance

dst=maxj{?xsj−ytj?}.

The Chebychev distance is a special case of the Minkowski distance, where p = ∞.
Cosine distance

dst=(1−xsy′tG(xsx′s)(yty′t)).
Correlation distance

dst=1−(xs−‾xs)(yt−‾yt)′G(xs−‾xs)(xs−‾xs)′G(yt−‾yt)(yt−‾yt)′,

where

‾xs=1n?jxsj

and

‾yt=1n?jytj.
Hamming distance

dst=(#(xsj≠ytj)/n).
Jaccard distance

dst=#[(xsj≠ytj)∩((xsj≠0)∪(ytj≠0))]#[(xsj≠0)∪(ytj≠0)].
Spearman distance

dst=1−(rs−‾rs)(rt−‾rt)′G(rs−‾rs)(rs−‾rs)′G(rt−‾rt)(rt−‾rt)′,

where
- r_sj is the rank of x_sj taken over x_1j, x_2j, ...x_mx,j, as computed by tiedrank.
- r_tj is the rank of y_tj taken over y_1j, y_2j, ...y_my,j, as computed by tiedrank.
- r_s and r_t are the coordinate-wise rank vectors of x_s and y_t, that is, r_s = (r_s₁, r_s₂, ... r_sn) and r_t = (r_t1, r_t2, ... r_tn).
- ‾rs=1n?jrsj=(n+1)2.
- ‾rt=1n?jrtj=(n+1)2.

MATLAB & Simulink Help

Programming & Technical Help

Engineering & Specialized Tools

Writing & Exam Services

Data Analysis Services

Classification Using Nearest Neighbors

Pairwise Distance Metrics

Distance Metrics