Clustering problem on streaming data

Definitions and Notations

Definition: cost of k-medians

Given an instance (S,k) of k-median problem, where
with a distance metric d(-,-)
The cost of the
k-medians C₁, C₂, ..., C_k is:

f(S, C₁, C₂, ..., C_k) = ∑_{_{x ∈ S}} ( min_{_{i = 1, 2, ..., k}} d(x, C_i) )

Continuous clustering:
Discrete clustering:
Definition: cost(S,Q)
Cost of continuous clustering:
Cost of discrete clustering:

Clustering data in input streams

The problem of clustering on input data from a stream is also handled by partitioning

First, the authors present a clustering algorithm that has very small space requirement:

Divides the input data into pieces (segments)
Cluster each piece separately
Then cluster all the centers of the resulting clusters in one batch

This algorithm is called Small-Space in the paper.

Next they presents an incremental center clustering algorithm:

Divides the input data into pieces (segments)
Cluster each piece separately
Cluster the centers of the resulting clusters incrementally

This algorithm is called Smaller-Space in the paper

The Small-Space Algorithm

Description of the Small-Space (S) Algorithm:

Divide input stream into m disjoin pieces X₁, X₂, ..., X_m
Example:
For each X_i (i = 1, 2, ..., m), find O(k) centers (cluster centroids) in X_i
(So we form O(k) clusters)
Example:
There will be O(k×m) centers total
Let X' be the O(k×m) centers obtained in step 2
Weight each center by the number of points in its cluster
Cluster X' to obtain k centers
The weighted centers are computed by using:

Analysis of the Small-Space Algorithm
- Before analyzing the Small-Space algorithm, we need to some relationships between the discrete and the continuous clustering problem.

Theorem 1: bound of cost

Theorem 1

Given an instance of the k-median problem (S, k)
Then:
Note: the intention is that Q is a larger set than S so there is more possibilities to pick better centers...

Proof:

Consider the k-median solution with centroids restricted to data points in Q:
The cost of this solution is:
Now replace each centroids with the nearest data point in the cluster:
The cost of this solution is:
The distance d(x, C^'_i) can be bounded by:
because:

Therefore:

Cost(S, S) = ∑_{_{x ∈ S}} ( min_{_{i = 1, 2, ..., k}} d(x, C^'_i) ) ≤ 2 × ∑_{_{x ∈ S}} ( min_{_{i = 1, 2, ..., k}} d(x, C_i) ) = 2 × Cost(S, Q)

Theorem 2: cost of the intermediate step to the Small-Space algorithm

Theorem 2

Consider an arbitrary partition of the input data S into m sets:
Then:
where:
- Cost(X_i, X_i) is the cost of a k-median problem
- Cost(S, S) is the cost of a (k×m)-median problem

Proof:

From Theorem 1:

Therefore:

&sum _{_{i = 1, 2, .. , m}} Cost(X_i, X_i) ≤ 2 × ( Cost(X₁, S) + Cost(X₂, S) + ... + Cost(X_k, S) )
= 2 × Cost(S,S)

We will use k×m medians in X₁, X₂, ..., X_m for Cost(S,S)...

Intermesso...

Let us review the Small-Space Algorithm:

Divide input stream into m disjoin pieces X₁, X₂, ..., X_m
For each X_i (i = 1, 2, ..., m), find O(k) centers (cluster centroids) in X_i
(So we form O(k) clusters per set X_i)

There will be O(k×m) centers total

Theorem 1 and 2 relate to the cost of the solution at this point in the algorithm
Let X' be the O(k×m) centers obtained in step 2
Weight each center by the number of points in its cluster
Cluster X' to obtain k centers

The next theorem will bound the cost of clustering the cluster centers ....

Theorem 3: cost of the solution found by Small-Space (bounding the cost of clustering the cluster centers)

Theorem 3:

Let:

C = &sum_{_{i = 1, 2, .., m}} cost(X_i, X_i)
(I.e., C is the sum of cost of the partitioned solutions)
Example:
C^* = Cost(S,S) = &sum_{i = 1, 2, .., m} cost(X_i, S)
(I.e., C^* is the cost of the solution of the complete problem)
Example:

And let:

Let X^* be a weighted instance of the k-median problem solved by the Small-Space algorithm by clustering the clusters
The cost of this solution is: f(X^*, X^*)
(because after clustering the data items, we only know the locations of the cluster centers)

Claim:

The Small Space algorithm produces a k-median solution with a cost of at most 2×(C + C^*)

Proof:

Let:

C_i,1, C_i,2, ..., C_i,k be the centroids (medians) that achieve the minimum cost cost(X_i, X_i), for i = 1, 2, ..., m
C^*₁, C^*₂, ..., C^*_k be the centroids (medians) that achieve the minimum cost cost(S, S)

The following picture tries to make it less abstract:
The red dots are elements in X₁
The blue dots are elements in X₂
All dots are elements in S
We do not need to compute the cost expression exactly:
We will instead bound this expression using Theorem 1

In order to use Theorem 1, we first compute the cost of the set of weighted centers X^* restricted to these point as centers: C^*₁, C^*₂, ..., C^*_k:

f(X^*, C^*₁, C^*₂, ..., C^*_k) = ∑_{_{C_i,j ∈ X^*}} ( min_{_{h = 1, 2, ..., k}} w_i,j × d(C_i,j, C^*_h) ) ......... (1)

The input set X^* consists of the cluster centers C_i,j
The point C^*_h is the nearest centroid in the solution f(S, S) that is closest to the cluster centers C_i,j (because we minimize over C^*₁, C^*₂, ..., C^*_k )
w_i,j = number of elements x ∈ X_i associated with the center C_i,j (= the weight of the centroid C_i,j in X_i)

Define these notations to simplify the expression:

c(x) = the closest of C_i,1, C_i,2, ..., C_i,k to element x, for x ∈ X_i
(I.e., c(x) is the center associated with the data point x in the restricted set X_i)
C^*(x) = the closest of C^*₁, C^*₂, ..., C^*_k to element x
(I.e., C^*(x) is the center associated with the data point x in the full set S)

Example:

Using these 2 definitions, Equation 1 can be re-written as:

The original Equation (1):
Re-written as:

Fact (triangle inequality)

d(c(x), C^*(x)) ≤ d(x, c(x)) + d(x, C^*(x))

Hence:

_{_{{x: c(x) = C_i,j}}} w_i,j × d(c(x), C^*(x)) ≤ ∑ _{_{{x: c(x) = C_i,j}}} ( d(x, c(x)) + d(x, C^*(x)) )

Therefore:

f(X^*, C^*₁, C^*₂, ..., C^*_k) ≤ ∑_{_{C_i,j ∈ X^*}} ( ∑ _{_{{x: c(x) = C_i,j}}} ( d(x, c(x)) + d(x, C^*(x)) ) )

When you sum all elements belong to all centers, we are actually summing all elements in the original set
Se we can replace: "∑_{_{C_i,j ∈ X^*}} ( ∑ _{_{{x: c(x) = C_i,j}}} ... )" by: ∑ _{_{{x ∈ S}}}

                                    = ∑ _{_{{x ∈ S}}} ( d(x, c(x)) + d(x, C^*(x)) )
                                    = ∑ _{_{{x ∈ S}}} d(x, c(x)) + ∑ _{_{{x ∈ S}}} d(x, C^*(x))

Since:

c(x) = the closest of C_i,1, C_i,2, ..., C_i,k to element x, for x ∈ X_i
(I.e., c(x) is the center associated with the data point x in the restricted set X_i)
C^*(x) = the closest of C^*₁, C^*₂, ..., C^*_k to element x
(I.e., C^*(x) is the center associated with the data point x in the full set S)

We have that:

∑ _{_{{x ∈ S}}} d(x, c(x))

partitioned sets

∑ _{_{{x ∈ S}}} d(x, C^*(x)) = cost of solution using complete set = C^*

Therefore:

f(X^*, C^*₁, C^*₂, ..., C^*_k) ≤ C + C^* ................ (3)

Now we can apply Theorem 1:

f(X^*, X^*) ≤ 2 × f(X^*, C^*₁, C^*₂, ..., C^*_k) (the set Q is {C^*₁, C^*₂, ..., C^*_k} )
= 2(C + C^*)

Post-Script....
- The paper also presents an approximate clustering step that uses an approximation algorithm presented in another literature
- I have omitted this step to save time

Smaller-space: Recursive small space

Algorithm

Smaller-Space( S, i ) { if ( i == 0 ) { stop; // Done } Divide S into k disjoint pieces: X₁, X₂, ..., X_k Cluster each X_i; Let X^* = set of k×m cluster centers; Call Smaller-Space( X^*, i-1 ); }

Example:
- Input data
- Partition into k-sets:
- Determinte the cluster centers:
- Repeat with cluster center as input data:

Clustering algorithm for data streams

The approach is similar to the Smaller-Space algorithm

Stream clustering algorithm:

Input the first m data points
(m is a user parameter)
Cluster the m data points into 2k level 1 cluster centers
Repeat the above steps until we have processed m²/(2k) data points
We will now have m/(2k) × 2k = m level 1 cluster centers
Cluster these m level 1 cluster centers into 2k level 2 cluster centers
And so on...
In general:
At the end of the stream, cluster all centers into k final centers

Graphically:

Cluster first m data points:
Then the next m data points:
Until you have m cluster points:
Cluster the m cluster points into 2k level 2 cluster points:
Repeat, next m data points:
And so on...

Postlude...
- I have left out the error analysis because I omitted a compression step in their algorithm.
  This compression step refers to another publication that we did not discuss.
- As a result, the error expression does not make any sense if I present them verbatim.