SLDPC: Towards Second Order Learning for Detecting Persistent Clusters in Data Streams

The main attention of research on data stream clustering algorithms so far has been focused on the adaptation of the algorithms for static datasets to the data streams and improvements of the existing adapted algorithms. Such algorithms fulfil the purpose of the first-order learning from data to clusters. This paper prompts a new question on second-order learning of cluster models from data streams and presents a learning algorithm that detects persistent clusters from consecutive clustering snapshots in data streams. In this work, we first collect a sequence of cluster snapshots as the output clusters at selected query points and then identify the persistent clusters within a given timeframe. The algorithm is evaluated on collections of synthetic datasets. The experimental results have demonstrated the effectiveness of the algorithm in detecting such persistent clusters.


I. INTRODUCTION
Data stream clustering is defined as a grouping of data in light of frequently arriving new data chunks for gaining understanding about underlying group patterns that may change over time [1].Depending on the approaches taken, either incremental learning or two-phase learning, existing algorithms for data stream clustering either present an up-to-current-time view of clusters [2] or generate a view of clusters at a user query point [3].However, there is no keeping of a historic trail of the output cluster models over time.If such a historic trail is maintained, the persistence of certain clusters can be analysed through a second-order learning, i.e. learning persistent clusters by mining the clustering outputs collected over a sequence of time points.Stability of output clusters in static datasets has been well researched, and has a lot of advantages including predicting a correct number of clusters, detecting the lack of structure in the dataset, and assessing the quality of clustering algorithms [4] [5].However, identifying the persistent clusters in data streams has not been properly studied.
There are a lot of potential applications that can benefit from finding persistent clusters in the data streams.Tracking social media events such as birthdays, tracking objects such as cars and rockets from video footages, using CCTV cameras in identifying abnormal objects like unattended bags against a stable background, and monitoring patients in hospitals are only a few of many possible examples.
In this paper, we first define the problem of second-order learning of persistent clusters in data streams.In the problem context, we argue that data stream clustering can happen on two levels of processing.At the first level, also known as the online layer, the first-order learning of clusters is performed by using existing algorithms such as the prototype-based algorithm EINCKM presented in [6], and the clustering results are saved.At the second level, also known as the offline layer, secondorder learning algorithms can be deployed to detect the patterns of persistent clusters from the saved clustering results produced at the first level.The paper then presents a simple second-order learning (SLDPC) algorithm for this purpose.The algorithm was evaluated on a selected collection of datasets using various measurements.Experimental results show that the proposed algorithm is capable of detecting correct persistent clusters.The modular structure of the proposed algorithm makes it easy to accommodate future improvements and parallelisations.
The rest of this paper is organised as follows.Section 2 explains the state of the art of the related work in the current literature.Section 3 formally defines the problem of persistent cluster detection, and illustrates it with examples.Section 4 describes the proposed SLDPC algorithm.Section 5 presents an evaluation of the algorithm performance through experiments using synthesised datasets.Section 6 concludes the work and outlines the possible future directions of this research.

A. Clustering of Clusters and Clustering Ensemble
In many pattern recognition problems, we are dealing with cluster analysis of existing clusters [7], such as multi-resolution granularity in hierarchical clustering [8].Existing clusters are groups of pattern samples which, due to a priori knowledge, are known to belong to the same cluster.Kandt [9] presented a system named SEISMO for detecting seismic activity by several sensor networks.The characteristics of each detection and the time interval between these detections are used to group a subset of detections together (subclusters), which correspond to an event (cluster).Although clustering of clusters is a kind of second-order learning, it is more about grouping data in different granularity of abstraction than finding persistent clusters among existing ones.
Cluster ensemble has emerged as a prominent way of improving robustness, stability, and accuracy of clusters [10].It is a process of merging multiple clustering models into a single consolidated clustering [11].Fathzadeh and Mokhtari [12] presented an ensemble fuzzy C-Means (SEFCM) algorithm for data streams.The divide-and-conquer method comprised of three stages; 1) divide data streams into smaller blocks; 2) cluster every block using ensemble clustering (EFCM) algorithm; and 3) combine the concluding clusters using single linkage to find global clusters across the block clusters.Cluster ensemble is more about the consensus of different clustering models, and not really about discovering persistent and stable clusters although their inputs can be seen as clusters.

B. Stable Clusters
Stable clusters in static data mean when multiple datasets are sampled from the same distribution, the clustering algorithm is expected to behave in the same way and produce similar results [13].In other words, to find stable clusters in static data, we need to further analyse the clustering results from each sample to identify stable ones.There is some degree of similarity but also a big difference between stable clusters in static data clustering and persistent clusters in data stream clustering [14] [15].In static clustering, the stability of clusters is defined over different versions of clustering, i.e. the stability is not defined over a time period.However, in data stream clustering algorithms, the data chunk could be evolved over time, i.e. some old clusters may disappear, and some new clusters may emerge (concept drift principle [16]).Our objective is therefore different: we have already a number of versions of clusterings depend on query points and the aim is to discover the clusters that stay relatively fixed.

C. Clustering Learning Approaches
Learning approach in the data streams mining is a wide area of research.Generally, learning approaches could be divided into two categories [17]: instance-incremental (or incremental learning) methods that learn from each example as it arrives and batch-incremental (or two-phase leaning) methods that gather examples in batches to train models.He et al. [18] proposed a general adaptive incremental learning framework that is capable of learning from continuous raw data, accumulating experience over time, and using such knowledge to improve future learning and prediction performance for classification purpose.However, this work is focusing on improving the output model to adapt to new incoming data chunks whereas we are aiming to identify persistent clusters through consecutive clustering snapshots.
There is a big difference between two-phase leaning [3] and second-order learning for data stream clustering.The two-phase learning algorithms try to discover final clusters from many prototype micro-clusters.It is still a first-order learning from data to clusters.The second-order learning algorithms, on the other hand, try to find persistent clusters that exist through a sequence of consecutive clustering results.In other words, it takes as inputs a sequence of clustering results and identifies as outputs the clusters that persist over the whole time period.

D. Data Stream Clustering Algorithm EINCKM
EINCKM is an incremental prototype-based algorithm for clustering data streams [6].It consists of a generic modular framework that comprises three main steps Build Clusters, Merge, and Prune.Build Clusters applies the K-Means method to identify the clusters from input data chunks, Merge may combine the newly discovered clusters with some existing ones, and Prune identifies outlier objects and removes out of date data points.The algorithm applies a simple heuristic-based strategy to estimate the number of clusters, a radius-based scheme to combine overlapped clusters, and a variance-based technique to detect the outliers.However, this algorithm is designed to perform first-order clusterings.In other words, it gives the up-to-current-time snapshot of clustering results.

III. PROBLEM DESCRIPTION
Informally, persistent clusters are those that do not change much and persist over a period throughout a series of clustering results with respect to the coming data chunks.More precisely, persistent clusters can be defined in the following way.Let a cluster clustering result at a time point i known as a snapshot where is a cluster.Let represent a sequence of clustering snapshots.Given a time frame , where , and , and user-defined thresholds on: x Centroid change margin , representing the maximum distance allowed for the centroids of cluster and .
x Size change margin , representing the maximum amount of change in cluster sizes between clusters and .x Variance (radius) change margin , representing the maximum amount of change in cluster radius between and .
Then the persistent clusters are those s which exist within the given time frame and the cross-snapshot differences (i.e.changes) of their centroids, sizes, and variances are less than or equal to , , and respectively.Mining such persistent clusters is an automatic process of discovering all persistent clusters as defined.
Fig. 1 shows an example of persistent cluster discovery.In the first cluster snapshot (Fig. 1(a)), there are three initial clusters.In the second snapshot (Fig. 1(b)), the three clusters from the snapshot one persist with little changes, but the snapshot shows the creation of a new cluster (cluster 4).In the third snapshot (Fig. 1(c)), the three persistent clusters still remain with little changes to the centroids, sizes and variances.However, cluster 4 disappears, indicating that it is only a temporary cluster.At the same time, a new cluster (i.e. the new cluster 4) emerges.In the final snapshot (Fig. 1(d)), the three persistent clusters still remain in place, but cluster 4 in the previous snapshot disappears because it is a temporary cluster.So, finally, three persistent clusters are obtained as the output of the discovery process.Table 1 describes the snapshots summary and indicates the persistent clusters (Persistency-Tag with the highest value in the last snapshot).Note that the N represents the number of data points in each cluster, μ is the vector value of the centroids, R is the radius of each cluster, and Persistency-Tag is the counter of repeated clusters.

IV. THE PROPOSED SLDPC ALGORITHM
The general contextual framework of the basic proposed SLDPC algorithm is depicted in Fig. 2. Against the snapshots produced by online EINCKM algorithm, the main stages of the SLDPC algorithm are described as follows: x Receive consecutive clustering snapshots and user parameters as inputs.
x Define the persistent clusters for each consecutive pair of clustering snapshots using the merging strategy.x Find the final output persistent clusters through the consecutive clustering snapshots.
Algorithm 1 presents the pseudo-code description of the basic SLDPC algorithm.The inputs are mainly a sequence of consecutive snapshots of existing clusters summary C. Each cluster summary is a tuple , where N is the number of data points, LS is the linear sum of the data points and LSS is the sum of squared data points.The inputs also include the user definition thresholds which include , , and (cf.Sec. 3).The output is PS persistent clusters.For the user convenience, the algorithm takes the first threshold parameter in terms of how many standard deviation from the mean of a cluster.Therefore, the user will decide a real number (e.g. 1, 1.5, 2, etc.), and the algorithm multiplies the number with the standard deviation of cluster to determine the absolute distance threshold between the two centroids.For the same purpose user convenience, the size change margin threshold parameter is normally represented as a percentage of change in relation to the size of cluster .Similarly, the radius change margin threshold parameter is also represented as a percentage of change in relation to the radius of cluster .

V. EVALUATION AND EXPERIMENTAL RESULTS
The SLDPC algorithm is meant to work with any data stream clustering solutions and for any scenarios of any dataset with any number of dimensions as long as the output clusters can be represented in a summary form as expected by the algorithm.However, in order to ease the verification of the results, we decided to use EINCKM algorithm for clustering data streams because the algorithm describes the clustering  2 summarises four scenarios of different numbers of persistent and non-persistent clusters (known as temporary clusters).In the first three scenarios, there are both temporary and persistent clusters.Fig. 4 shows that for DS1 there are three persistent and three temporary clusters (first scenario), and Fig. 5 illustrates the evolution of the clusters through a sequence of four snapshots.
In the fourth scenario, however, all clusters are temporary; an extreme case of concept drift where nothing is persistent.
To evaluate correctness, we used three commonly used evaluators: purity, entropy, and the sum of squared errors (SSE).Purity was used in [19], entropy in [20], and SSE in [3].Purity refers to the proportion of the data points belonging to a known cluster that are assigned as members of a cluster by the algorithm.The higher the proportion of purity (between [0, 1]) is, the more certain that the algorithm has found the original clusters and the better the algorithm is [21].Entropy reflects the number of the data points from different known clusters in the original dataset that are assigned to a cluster by the algorithm.The value of this measure is between [0 , ]  where N is the number of known clusters involved.The smaller value of the entropy is, the fewer members of the known clusters are mixed in the clusters discovered by the algorithm, and the better the clustering algorithm is [22].SSE is a commonly used cluster quality measure.It evaluates the compactness of the resulting clusters.Low scores of SSE indicates better clustering results as the clusters contain less internal variations [21].
The efficiency of an algorithm was measured by the amount of time in seconds taken for the algorithm in completing the clustering task.
MATLAB 2017b was used to implement the SLDPC algorithm and the experiment framework.For the first, second, and third scenarios as mentioned, we split a given dataset into two parts: the persistent clusters and the temporary clusters.We selected data chunks randomly from the persistent clusters and snapshot-wise data points from the temporary clusters.The idea behind the random selection of the data points is to investigate the behaviour of the algorithm when there is no control on the sequence of data points, i.e. we did not select specific data points from particular groups in the original datasets.In order to minimise the effect of the random choice of data points, the experiments were repeated 100 times, and the average was calculated.
All the experiments were run on a machine equipped with 2.30 GHz 4 cores Intel(R) Core(TM) i5-4590 CPU and 16 GB memory.The operating system was Windows7.

A. Experimental Results
Fig. 6 illustrate the performance evaluation of SLDPC algorithm.As shown in Fig. 6, differences between the scenarios across the synthetic datasets are only marginal, the algorithm performs consistently across the synthesised datasets in all scenarios.Fig. 6(a) shows that the level of purity is high across all scenarios when comparing the persistent output clusters from the SLDPC algorithm against the known persistent clusters in the ground truth (the synthesised datasets with known clusters).This is caused by the stringent merge strategy deployed in both EINCKM and SLDPC algorithms and the exclusion of some data points as outliers by using the filtering technique in EINCKM.With a small number of persistent clusters, e.g. in scenario 3 the level of purity is lower than those for the scenarios with more persistent clusters.Both entropy measurements and SSE measurement as shown in Fig. 6(b) and (c) are relatively low deu to the effective pruning strategy of the

B. Discussion
The most noticeable nature of the SLDPC algorithm is its simplicity and efficiency in discovering persistent clusters.The main principle behind the algorithm is to maintain a vote to each cluster.Only the clusters with sufficient votes remain as persistent clusters.The constraint of the basic algorithm is that it assumes the input clusters are represented as cluster summaries which tend to be applied only to spherical shaped clusters.Therefore, the algorithm works well with prototypebased and model-based algorithms.Currently, the algorithm might not apply to cluster inputs that are represented in other forms of structures (such as data point based representation of clusters by density-based algorithms).
Regarding parameters representing thresholds, we set the default values , and to .Deciding parameter is not trivial.There are number of ways to define it.For instance, we could use absolute distance between two centroids, but this number is very hard for the user to find.By refering to the normal distribution and statistic theory regarding the significant difference we decid to rely on the number of STDs to determine this particular threshold.Setting this threshold is challenging, therefore, need further investigation.We set the default value of cluster size change and the cluster radius change depending on the heuristics.However, such default values may not apply to a certain dataset, and hence we leave the user to define the appropriate thresholds for the parameters.
We understand the importance of the threshold values to the final outputs of persistent clusters in the problem definition.To further this consideration, we can introduce two more threshold parameters.The first additional parameter is the number of snapshots within the time frame .This parameter allows the discovery of persistent clusters not in all the snapshots in the snapshot sequence , but rather among snapshots of the sequence.Another addtional threhsold parameter we can introduce is the persistency rate that specifies the rate of persistency across the snapshots; the persistent clusters does not have to appear in every snapshot, but percent of the snapshots.Both parameters are meant to increase the flexibility of the algorithm in producing the persistent clusters that are variants from the standard definition.

VI. CONCLUSION AND FUTURE WORKS
This paper prompted a problem of second-order learning for persistent clusters in data streams, and presented the SLDPC algorithm for detecting such persistent clusters by analysing a sequence of snapshots of clustering results.The key ideas of the algorithm is to assign a vote to clusters that do not change much, and then collect those clusters.The evaluation results have shown that the algorithm produces correct and good quality clusters with low time complexity.The algorithm emphasises on simplicity and adaptivity for future improvement.
Our future work will focus on enhancing the algorithm.Firstly, we will work towards tailoring the algorithm to suit other cluster input representations.Secondly, we will investigate introducing degrees of fuzziness in user-defined thresholds and reducing the needs for user-defined thresholds if possible.Finally, we will further investigate discovering the patterns of periodic changes in cluster models besides persistency.In discovering periodic changes and persistent hidden group patterns can have a wide range of applications such as climate changes.

APPENDIX I
The following tables show the details and specify the distribution of each of the three synthesized datasets.

Fig. 2 .Fig. 1 .
Fig. 2. Outline of the SLDPC algorithm Algorithm 1 Second Order Learning Inputs: -Consecutive accumulated cluster snapshots summary .-User parameters : Time frame : Threshold of moving the centroids // by default : Threshold of cluster size change // by default : Threshold of cluster radius change // by default Outputs: PS: Persistent Clusters; Algorithm Steps:1.Repeat for each pair of consecutive snapshots // dist is a distance function such as Euclidean // calculate the percentage differences of sizes // calculate the percentage differences of radii

TABLE 1 .
Snapshot description