Variance and the sum of squared pairwise differences

The variance $\sigma^2$ of a set of $n$ values $x_1, x_2, ..., x_n$ is usually expressed in terms of squared differences between those values and the mean $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$ of those values.

$\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$

However the sum of squared differences $(x_i - \bar{x})^2$ between the values and the mean can also be expressed in term of the sum of squared pairwise differences $(x_i - x_j)^2$ among the values themselves, without reference to the mean $\bar{x}$ .

In particular, we want to show that

$\sum_{i=1}^{n} (x_i - \bar{x})^2 = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - x_j)^2$ .

To get an expression involving $\bar{x}$ we rewrite the squared difference in the righthand sum and then expand the result:

$\sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - x_j)^2 = \sum_{i=1}^{n} \sum_{j=1}^{n} [(x_i - \bar{x}) - (x_j - \bar{x})]^2$

$= \sum_{i=1}^{n} \sum_{j=1}^{n} [(x_i - \bar{x})^2 - 2 (x_i - \bar{x}) (x_j - \bar{x}) + (x_j - \bar{x})^2]$

$= \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x})^2 - 2 \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x}) (x_j - \bar{x}) + \sum_{i=1}^{n} \sum_{j=1}^{n} (x_j - \bar{x})^2$

Since the squared difference in the first term does not depend on $j$ , the first term can be rewritten as

$\sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} n (x_i - \bar{x})^2 = n \sum_{i=1}^{n} (x_i - \bar{x})^2$

Since the squared difference in the third term does not depend on $i$ , the third term can be rewritten as

$= \sum_{i=1}^{n} \sum_{j=1}^{n} (x_j - \bar{x})^2 = n \sum_{j=1}^{n} (x_j - \bar{x})^2 = n \sum_{i=1}^{n} (x_i - \bar{x})^2$

where in the last step we replaced $j$ as an index with $i$ . So the third term is identical to the first term.

We now turn to the second term, $-2 \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x}) (x_j - \bar{x})$ . We can bring the difference $(x_i - \bar{x})$ out of the inner sum, since it does not depend on the index $j$ . This gives us

$-2 \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x}) (x_j - \bar{x}) = -2 \sum_{i=1}^{n} (x_i - \bar{x}) [\sum_{j=1}^{n} (x_j - \bar{x})]$

The sum $\sum_{j=1}^{n} (x_j - \bar{x})$ can then be rewritten as

$\sum_{j=1}^{n} (x_j - \bar{x}) = \sum_{j=1}^{n} x_j - \sum_{j=1}^{n} \bar{x}$

$= \sum_{j=1}^{n} x_j - n \bar{x}$

But we have $\bar{x} = \frac{1}{n} \sum_{j=1}^{n} x_j$ by definition, so we then have

$\sum_{j=1}^{n} (x_j - \bar{x}) = \sum_{j=1}^{n} x_j - n \bar{x} = n \bar{x} - n \bar{x} = 0$

We can then substitute this result into the second term as follows:

$-2 \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x}) (x_j - \bar{x}) = -2 \sum_{i=1}^{n} (x_i - \bar{x}) [\sum_{j=1}^{n} (x_j - \bar{x})]$

$= -2 \sum_{i=1}^{n} (x_i - \bar{x}) \cdot 0 = -2 \sum_{i=1}^{n} 0 = 0$

Now that we know the value of all three terms we have

$\sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - x_j)^2$

$= \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x})^2 - 2 \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - \bar{x}) (x_j - \bar{x}) + \sum_{i=1}^{n} \sum_{j=1}^{n} (x_j - \bar{x})^2$

$= n \sum_{i=1}^{n} (x_i - \bar{x})^2 + 0 + n \sum_{i=1}^{n} (x_i - \bar{x})^2$

$= 2n \sum_{i=1}^{n} (x_i - \bar{x})^2$

so that

$\sum_{i=1}^{n} (x_i - \bar{x})^2 = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - x_j)^2$

which is what we set out to prove.

However, we can further simplify this identity. Since $(x_i - x_j) = 0$ when $i = j$ and $(x_i - x_j)^2 = (x_j - x_i)^2$ , we can consider only differences when $i < j$ (i.e., elements above the diagonal, if we consider the pairwise comparisons to form a matrix):