Distance measures for Big Data

883views

written 8.5 years ago by

teamques10 ★ 70k

Suppose we have a set of points, called a space. A distance measure on this space is a function d(x, y) that takes two points in the space as arguments and produces a real number, and satisfies the following axioms:

d(x, y) ≥ 0 (no negative distances).
d(x, y) = 0 if and only if x = y (distances are positive, except for the distance from a point to itself).
d(x, y) = d(y, x) (distance is symmetric).
d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality).

Types of Distance Measures:

Euclidean Distance:

The most familiar distance measure is the one we normally think of as “distance.” An n-dimensional Euclidean space is one where points are vectors of n real numbers.

Jaccard Distance

we define the Jaccard distance of sets by d(x, y) = 1 − SIM(x, y). That is, the Jaccard distance is 1 minus the ratio of the sizes of the intersection and union of sets x and y. We must verifythat this function is a distance measure.

Cosine Distance

The cosine distance between two points is the angle that the vectors to those points make. This angle will be in the range 0 to 180 degrees, regardless of how many dimensions the space has.

Edit Distance

This distance makes sense when points are strings. The distance between two strings x = x1x2 • • • xn and y = y1y2 • • • ym is the smallest number of insertions and deletions of single characters that will convert x to y.

Hamming Distance

Given a space of vectors, we define the Hamming distance between two vectors to be the number of components in which they differ. the Hamming distance cannot be negative, and if it is zero, then the vectors are identical. The distance does not depend on which of two vectors we consider first.

ADD COMMENT EDIT