Similarity Measures

  • Author: Johannes Maucher

  • Last update: 15th February 2021

In many computer vision tasks it is necessary to compare the similarity between images and image-regions. In the introduction of image features global-, subwindow- and local- image descriptors have been distinguished. Independent of this descriptor type, each descriptor is a numeric vector. Hence, in order to compare image(-regions) we need methods to calculate the similarity of numeric vectors. Actually there exists a bunch of different similarity-measures. The most important of them are defined in this section.

Motivation: Given the following 3 vectors v1, v2 and v3, the question is:

Is v1 or v3 more similar to v2?

import numpy as np
np.set_printoptions(precision=3)
v1=np.array([1,5,1,1,1,7,1,1])
print("v1 = ",v1)
v2=np.array([1,1,5,1,1,1,7,1])
print("v2 = ",v2)
v3=np.array([7,5,1,1,1,1,1,1])
print("v3 = ",v3)
v1 =  [1 5 1 1 1 7 1 1]
v2 =  [1 1 5 1 1 1 7 1]
v3 =  [7 5 1 1 1 1 1 1]

The 3 vectors are visualized below:

from matplotlib import pyplot as plt
plt.figure(figsize=(10,12))
plt.subplot(3,1,1)
plt.bar(np.arange(0,8),v1,color="orange",alpha=0.5)
plt.title("Visualization of v1")
plt.subplot(3,1,2)
plt.bar(np.arange(0,8),v2,color="magenta",alpha=0.5)
plt.title("Visualization of v2")
plt.subplot(3,1,3)
plt.bar(np.arange(0,8),v3,color="cyan",alpha=0.5)
plt.title("Visualization of v3")
plt.show()
../_images/similarityMetrics_5_0.png

Similarity and Distance

The similarity between a pair of vectors \(\mathbf{u}=(u_0,u_1,\ldots,u_{N-1})\) and \(\mathbf{v}=(v_0,v_1,\ldots,v_{N-1})\), each of length \(N\), is denoted by

\[ s(\mathbf{u},\mathbf{v}). \]

The distance between the same pair of vectors is denoted by

\[ d(\mathbf{u},\mathbf{v}). \]

Distance and similarity are inverse to each other, i.e. the higher the distance the lower the similarity between anc vice versa. Below the most important distance- and similarity-measures are described. Some of them are primarily defined as distance-metric (e.g. euclidean distance) and a transformation-function can be applied to compute the associated similarity. Others are primarily defined as similarity-metric (e.g. cosine similarity) and a transformation function can be applied to compute the associated distance.

In the context of this section vectors represent Histograms. However, the distance metrics described below can be applied for arbitrary numeric vectors.

We denote by

\[ H_q = \left(H_q(0),H_q(1),H_q(2),\ldots,H_q(N) \right) \]

the N-bin histogram of an image(-region) \(q\). The value \(H_q(i)\) is the frequency of the values in \(q\), which fall into the \(i.th\) bin.

Normalized Histogram:

A normalized representation \(H^n_q\) of histogram \(H_q\) can be obtained as follows:

(2)\[ H^n_q := \frac{1}{m_q} \left(H_q(0),H_q(1),H_q(2),\ldots,H_q(Z) \right), \]

where

\[ m_q=\sum\limits_{i=1}^Z H_q(i) \]

Common Distance- and Similarity Metrics

Euclidean Distance

The Euclidean distance between two histograms \(H_q\) and \(H_p\) is

\[ d_E(H_q,H_p)=\sqrt{\sum_i (H_q(i)-H_p(i))^2} \]

The corresponding similarity measure is

\[ s_E(H_q,H_p)=\frac{1}{d_E(H_q,H_p)+\epsilon}, \]

where \(\epsilon\) is a small value \(>0\), used to avoid zero-division.

Characteristics of the Euclidean distance \(d_E(H_q,H_p)\) are:

  • Value range: In general \([0,\infty[\).

  • All bins are weighted equally

  • Strong impact of outliers

  • Bin-by-Bin comparison. Drawback: If two images are identical up to a small brightness change, then the histograms look similar up to a small shift along the x-axis (see \(v1\) and \(v2\) in the example above). However, such a type of similarity can not be measured by a Bin-by-Bin metric.

Cosine Similarity

The Cosine similarity between two histograms \(H_q\) and \(H_p\) is

\[ s_C(H_q,H_p)= \frac{\sum_i H_q(i)\cdot H_p(i)}{\sqrt{\sum_i H_q(i)^2 \cdot \sum_i H_p(i)^2}}, \]

The corresponding distance measure is

\[ d_C(H_q,H_p)=1-s_C(H_q,H_p) \]

Characteristics of the Cosine similarity \(s_C(H_q,H_p)\):

  • Value range: \([-1,1]\)

  • Bin-by-bin comparison

  • Regards the angle between vectors. Two vectors are maximal similar, if they point to the same direction (the angle between them is 0°).

Pearson Correlation

The Pearson correlation (similarity measure) between two histograms \(H_q\) and \(H_p\) is

\[ s_P(H_q,H_p)= \frac{\sum_i (H_q(i)-\mu_q)\cdot (H_p(i)-\mu_p)}{\sqrt{\sum_i (H_q(i)-\mu_q)^2 \cdot \sum_i (H_p(i)-\mu_p)^2}}, \]

where

\[ \mu_q = \frac{1}{N} \sum\limits_{i=1}^N H_q(i) \]

is the mean value over all bins in \(H_q\).

The corresponding distance measure is

\[ d_P(H_q,H_p)=1-s_P(H_q,H_p) \]

Characteristics of the Pearson correlation \(s_P(H_q,H_p)\):

  • Value range: \([-1,1]\)

  • Bin-by-bin comparison

  • Ignores different offsets and is an measure for linearity

Bray-Curtis Distance

The Bray-Curtis distance between two histograms \(H_q\) and \(H_p\) is

\[ d_{BC}(H_q,H_p)=\frac{\sum_i |H_q(i)-H_p(i)|}{\sum_i|H_q(i)+H_p(i)|} \]

The corresponding similarity measure is

\[ s_{BC}(H_q,H_p)=1-d_{BC}(H_q,H_p) \]

Characteristics:

  • Value range: \([0,1]\) for positive signals as in the case of histograms

  • Bin-by-bin comparison

  • More robust w.r.t. outliers than euclidean distance

  • Each bin difference is weighted equally

Canberra Distance

The Canberra distance between two histograms \(H_q\) and \(H_p\) is

\[ d_{Ca}(H_q,H_p)=\sum_i\frac{|H_q(i)-H_p(i)|}{|H_q(i)|+|H_p(i)|} \]

The corresponding similarity measure is

\[ s_{Ca}(H_q,H_p)=1-d_{Ca}(H_q,H_p) \]

Characteristics:

  • Value range: \([0,1]\)

  • Bin-by-bin comparison

  • Compared to Bray-Curtis now each bin difference is weighted individually.

Bhattacharyya Distance

The Bhattacharyya distance between two normalized histograms \(H^n_q\) and \(H^n_p\) is

\[ d_{Ba}(H^n_q,H^n_p)= \sqrt{1-\sum_i \sqrt{H^n_q(i) \cdot H^n_p(i)}} \]

Characteristics:

  • Value range: \([0,1]\)

  • Bin-by-bin comparison

  • Requires normalized inputs

  • Statistically motivated for measuring similarity between probability distributions. Here applied for the univariate case, but also applicable for multivariate distributions.

Chi-Square Distance

The Chi-Square (\(\chi^2\)) distance between two histograms \(H_q\) and \(H_p\) is

\[ d_{\chi^2}(H_q,H_p)=\sum_i \frac{(H_q(i)-H_p(i))^2}{H_q(i)+H_p(i)} \]

The corresponding similarity measure is

\[ s_{\chi^2}(H_q,H_p)=\frac{1}{d_{\chi^2}(H_q,H_p)+\epsilon}, \]

where \(\epsilon\) is a small value \(>0\), used to avoid zero-division.

Characteristics:

  • Value range: \([0,\infty]\)

  • Bin-by-bin comparison

  • Each bin is weighted individually

  • Statistically motivated for measuring similarity between probability distributions. Here applied for the univariate case, but also applicable for multivariate distributions.

Intersection Distance

The Intersection distance between two histograms \(H_q\) and \(H_p\) is

\[ d_{IS}(H_q,H_p)=\sum_i \min(H_q(i),H_p(i)) \]

The corresponding similarity measure is

\[ s_{IS}(H_q,H_p)=\frac{1}{d_{IS}(H_q,H_p)+\epsilon}, \]

where \(\epsilon\) is a small value \(>0\), used to avoid zero-division.

Characteristics:

  • Histograms shall be normalized

  • Value range is \([0,\infty]\)

  • Bin-by-bin comparison

  • Measures intersection of both histograms

  • Introduced in [SB91] for color histogram comparison

Earth Mover Distance (Wasserstein Distance)

All of the distance measures introduced so far, ingore shift-similarity. Shift-similar means, that one histogram is the shifted version of the other. This is actually the case for vectors \(v1\) and \(v2\) in the example above. The previously defined distance measure compare bin-by-bin. Therefore, they are not able to detect the shift-similarity.

The Earth Mover Distance (EMD) has been introduced in [RubnerTomasiGuibas98]. It’s not a bin-by-bin similarity measure and it is capable to detect shift-similarity. The drawback of the EMD is that it is quite complex to calculate.

The EMD between two histograms measures the amount of work, which is necessary to transform one histogram to the other, where work is the product of

\[ \mbox{amount of pieces that must be shifted} \cdot \mbox{distance of corresponding shift}. \]

The idea is sketched in the pictures below:

The EMD requires the two histograms to be normalized, i.e. both histograms contain the equal amount of earth.

\[ Work = \sum (\mbox{ distance moved }) \cdot (\mbox{ amount moved }) \]
(3)\[ Work = \sum\limits_{a=1}^m \sum\limits_{b=1}^n d_{a,b} \cdot f_{a,b} \]

where \(m\) is the number of bins in histogram \(H_p\) and \(n\) is the number of bins in histogram \(H_q\).

The Earth Mover Distance \(d_{emd}(H_q,H_p)\) between histogram \(H_q\) and histogram \(H_p\) is the minimum possible amount of work, required to convert \(H_p\) into \(H_q\), divided by the sum over all \(f_{a,b}\):

(4)\[ Work = \frac{\sum\limits_{a=1}^m \sum\limits_{b=1}^n d_{a,b} \cdot f_{a,b}}{\sum\limits_{a=1}^m \sum\limits_{b=1}^n f_{a,b}} \]

The EMD is also called Wasserstein-Distance (see Wikipedia). A nice example for calculating the EMD between two 2-dimensional distributions can be found here.

Calculation of EMD (Wasserstein-Distance) with scipy

In order to calculate the Wasserstein-distance the scipy method wasserstein_distance can be applied. This function calculates a variant of the EMD by:

  1. The histograms are internally normalized, such that in both histograms the sum over all bins is 1 (like in equation (2) ).

  2. Work is calculated as in equation (3), i.e. not normalized like in (4).

Below, we apply this scipy-method to calculate the Wasserstein-distance between \(v1\), \(v2\) and \(v3\):

print("v1 = ",v1)
print("v2 = ",v2)
print("v3 = ",v3)
v1 =  [1 5 1 1 1 7 1 1]
v2 =  [1 1 5 1 1 1 7 1]
v3 =  [7 5 1 1 1 1 1 1]

In order to calculate the EMD, the vectors \(v1\), \(v2\) and \(v3\) are not sufficient, because these vectors contain only frequencies of 8 different values. It is clear that also the values must be defined. Here, we assume that

  • the first frequency (1 in \(v1\)) is the frequency of the value \(0\) in the first distribution

  • the second frequency (5 in \(v1\)) is the frequency of the value \(1\) in the first distribution

  • the last frequency (1 in \(v1\)) is the frequency of the value \(7\) in the first distribution

We assume that for all 3 distributions we have the same set of 8 values and define them below:

values=list(range(8))
values
[0, 1, 2, 3, 4, 5, 6, 7]

Then the pairwise Wasserstein-distances between \(v1\), \(v2\) and \(v3\) can be calculated as follows:

from scipy.stats import wasserstein_distance
dw12=wasserstein_distance(values,values,v1,v2)
print("Wasserstein distance between v1 and v2 : %2.3f"%dw12)
dw13=wasserstein_distance(values,values,v1,v3)
print("Wasserstein distance between v1 and v3 : %2.3f"%dw13)
dw23=wasserstein_distance(values,values,v2,v3)
print("Wasserstein distance between v2 and v3 : %2.3f"%dw23)
Wasserstein distance between v1 and v2 : 0.556
Wasserstein distance between v1 and v3 : 1.667
Wasserstein distance between v2 and v3 : 2.222

Let’s check the internal calculation by the example of \(v1\) and \(v2\):

  • As mentioned above, the scipy-method internally first normalizes the histograms:

v1norm=1/np.sum(v1)*v1
v1norm
array([0.056, 0.278, 0.056, 0.056, 0.056, 0.389, 0.056, 0.056])
v2norm=1/np.sum(v2)*v2
v2norm
array([0.056, 0.056, 0.278, 0.056, 0.056, 0.056, 0.389, 0.056])
  • Next, in order to transform the normed \(v1\) into the normed \(v2\)

    1. an amount of \(4/18\) must be shifted over a distance of \(1\) from bin 1 to bin 2.

    2. an amount of \(6/18\) must be shifted over a distance of \(1\) from bin 5 to bin 6.

  • The calculated Wasserstein distance is then

\[ \frac{4}{18} \cdot 1 + \frac{6}{18} \cdot 1 = \frac{10}{18} = 0.555 \]

Application of different Distance Metrics

Below, we will apply all the distance measures introduced above on the pairwise similarity determination between our example-vectors vectors \(v1\), \(v2\) and \(v3\).

The Scipy package spatial.distance contains methods for calculating the euclidean-, Pearson-Correlation-, Canberra- and Bray-Curtis-distance. As already mentioned the EMD distance is implemented as wasserstein_distance() in scipy.stats. The remaining distance measures as well as a corresponding normalization function are defined in this python program.

def normalize(a):
    s=np.sum(a)
    return a/s

def mindist(a,b):
    return 1.0/(1+np.sum(np.minimum(a,b)))

def bhattacharyya(a,b): # this is acutally the Hellinger distance,
                        # which is a modification of Bhattacharyya
                        # Implementation according to http://en.wikipedia.org/wiki/Bhattacharyya_distance
    anorm=normalize(a)
    bnorm=normalize(b)
    BC=np.sum(np.sqrt(anorm*bnorm))
    if BC > 1:
        #print BC
        return 0
    else:
        return np.sqrt(1-BC)

def chi2(a,b):
    idx=np.nonzero(a+b)
    af=a[idx]
    bf=b[idx]
    return np.sum((af-bf)**2/(af+bf))
from scipy.spatial.distance import *
from scipy.stats import wasserstein_distance
methods=[euclidean,correlation,canberra,braycurtis,mindist,bhattacharyya,chi2,wasserstein_distance]
methodName=["euclidean","correlation","canberra","braycurtis","intersection","bhattacharyya","chi2","emd"]
import pandas as pd
distDF=pd.DataFrame(index=methodName,columns=["v1-v2", "v1-v3", "v2-v3"])
values = np.arange(0,len(v1))
for j,m in enumerate(methods):
    if j == len(methods)-1:
       distDF.loc[methodName[j],"v1-v2"]=wasserstein_distance(values,values,v1,v2)
       distDF.loc[methodName[j],"v1-v3"]=wasserstein_distance(values,values,v1,v3)
       distDF.loc[methodName[j],"v2-v3"]=wasserstein_distance(values,values,v2,v3)
    else: 
       distDF.loc[methodName[j],"v1-v2"]=m(v1,v2)
       distDF.loc[methodName[j],"v1-v3"]=m(v1,v3)
       distDF.loc[methodName[j],"v2-v3"]=m(v2,v3)
distDF
v1-v2 v1-v3 v2-v3
euclidean 10.198039 8.485281 10.198039
correlation 1.316456 0.911392 1.316456
canberra 2.833333 1.500000 2.833333
braycurtis 0.555556 0.333333 0.555556
intersection 0.111111 0.076923 0.111111
bhattacharyya 0.485132 0.387907 0.485132
chi2 14.333333 9.000000 14.333333
emd 0.555556 1.666667 2.222222

The absolute distance values of different distance-measures can not be compared. But we can analyse the for each distance measure the relative distances.

As can be seen all distance measures, except EMD, calculate the same distance between \(v1\) and \(v2\) as between \(v2\) and \(v3\). This is because these distance measures perform a bin-by-bin comparison, and these bin-by-bin comparisons are identical for both vector-pairs. Moreover, all metrics except EMD, find that \(v1\) is closer to \(v3\) than to \(v2\). This contradicts the subjective assessment, because subjectively one would say \(v2\) is closer to \(v1\), because \(v2\) is just a one-step-right-shift of \(v1\).

As will be demonstrated in the next section, the performance of, e.g. a CBIR system, strongly depends on the choice of the distance metric. The question on which metric will perform best in a given task and given data can not be determined in advance, it must be determined empirically.