- Let:
-
Ci,1,
Ci,2, ...,
Ci,k
be the centroids (medians)
that achieve the
minimum cost
cost(Xi, Xi),
for i = 1, 2, ..., m
-
C*1,
C*2, ...,
C*k
be the centroids (medians)
that achieve the
minimum cost
cost(S, S)
|
- The following picture tries to make it
less abstract:
The red dots are elements
in X1
The blue dots are elements
in X2
All dots are elements in
S
- We do not need to compute
the cost expression exactly:
We will instead bound
this expression using
Theorem 1
- In order to use Theorem 1,
we first compute
the cost of the
set of weighted centers X*
restricted to these point as centers:
C*1,
C*2, ...,
C*k:
-
f(X*,
C*1,
C*2, ...,
C*k) =
∑Ci,j ∈ X*
(
minh = 1, 2, ..., k
wi,j
×
d(Ci,j, C*h)
)
......... (1)
|
- The input set X*
consists of the
cluster centers
Ci,j
- The point
C*h
is the nearest centroid
in the solution f(S, S)
that is closest to the
cluster centers
Ci,j
(because we
minimize
over
C*1,
C*2, ...,
C*k
)
- wi,j =
number of elements x
∈ Xi
associated with the center
Ci,j
(= the weight
of the centroid Ci,j
in Xi)
- Define these notations to simplify
the expression:
- c(x) =
the closest of
Ci,1,
Ci,2, ...,
Ci,k
to element x,
for x ∈ Xi
(I.e., c(x) is
the center associated
with the data point x
in the restricted set Xi)
- C*(x) =
the closest of
C*1,
C*2, ...,
C*k
to element x
(I.e., C*(x) is
the center associated
with the data point x
in the full set S)
|
Example:
Using these 2 definitions,
Equation 1 can be re-written as:
- The original Equation (1):
f(X*,
C*1,
C*2, ...,
C*k) =
∑Ci,j ∈ X*
(
minh = 1, 2, ..., k
wi,j
×
d(Ci,j, C*h)
)
......... (1)
- Re-written as:
f(X*,
C*1,
C*2, ...,
C*k) =
∑Ci,j ∈ X*
(
{x: c(x) = Ci,j}
wi,j
× d(c(x), C*(x))
)
......... (2)
|
- Fact (triangle inequality)
-
d(c(x), C*(x)) ≤
d(x, c(x)) +
d(x, C*(x))
|
Hence:
-
{x: c(x) = Ci,j}
wi,j × d(c(x), C*(x))
≤
∑ {x: c(x) = Ci,j}
(
d(x, c(x))
+
d(x, C*(x))
)
|
Therefore:
-
f(X*,
C*1,
C*2, ...,
C*k)
≤
∑Ci,j ∈ X*
(
∑ {x: c(x) = Ci,j}
(
d(x, c(x))
+
d(x, C*(x))
)
)
When you sum
all elements belong to all centers,
we are actually
summing all elements in the original set
Se we can replace:
"∑Ci,j ∈ X*
(
∑ {x: c(x) = Ci,j} ... )"
by:
∑ {x ∈ S}
=
∑ {x ∈ S}
(
d(x, c(x))
+
d(x, C*(x))
)
=
∑ {x ∈ S}
d(x, c(x))
+
∑ {x ∈ S}
d(x, C*(x))
Since:
- c(x) =
the closest of
Ci,1,
Ci,2, ...,
Ci,k
to element x,
for x ∈ Xi
(I.e., c(x) is
the center associated
with the data point x
in the restricted set Xi)
- C*(x) =
the closest of
C*1,
C*2, ...,
C*k
to element x
(I.e., C*(x) is
the center associated
with the data point x
in the full set S)
|
We have that:
∑ {x ∈ S}
d(x, c(x))
= cost of solution using partitioned sets
= C
∑ {x ∈ S}
d(x, C*(x))
= cost of solution using complete set
= C*
Therefore:
-
f(X*,
C*1,
C*2, ...,
C*k)
≤
C + C*
................ (3)
|
|
- Now we can apply
Theorem 1:
-
f(X*, X*)
≤
2 × f(X*,
C*1,
C*2, ...,
C*k)
(the set Q is
{C*1,
C*2, ...,
C*k}
)
= 2(C + C*)
|
|