%\documentclass[10pt,a4paper]{article} %\documentclass{jss} \documentclass[nojss]{jss} \usepackage[utf8]{inputenc} \usepackage[english]{babel} %\usepackage{a4wide} %\setlength{\parskip}{0.5ex plus0.1ex minus0.1ex} %\setlength{\parindent}{0em} %\usepackage[round,longnamesfirst]{natbib} %\usepackage{hyperref} %%% for tabulars %\usepackage{rotating} %\usepackage{multirow} %%% for hanging paragraph %\usepackage{hanging} %%% double spacing % \usepackage{setspace} % \doublespacing %\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\func}[1]{\mbox{\texttt{#1()}}} %\newcommand{\code}[1]{\mbox{\texttt{#1}}} \newcommand{\pkg}[1]{\strong{#1}} \newcommand{\samp}[1]{`\mbox{\texttt{#1}}'} %\newcommand{\proglang}[1]{\textsf{#1}} \newcommand{\set}[1]{\mathcal{#1}} \newcommand{\vect}[1]{\mathbf{#1}} \DeclareTextFontCommand{\emph}{\normalfont} %\usepackage{Sweave} %\VignetteIndexEntry{stream: Introduction to the package} %% publication information %% NOTE: This needs to filled out ONLY IF THE PAPER WAS ACCEPTED. %% If it was not (yet) accepted, leave them commented. %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} \author{ Michael Hahsler\\Southern Methodist University \And Matthew Bola\~nos\\Microsoft Corporation \AND John Forrest\\Microsoft Corporation } \title{Introduction to \pkg{stream}: An Extensible Framework for Data Stream Clustering Research with \proglang{R}} \Plainauthor{Michael Hahsler, Matthew Bolanos, John Forrest} \Plaintitle{Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R} \Shorttitle{Introduction to \pkg{stream}} %% an abstract and keywords \Abstract{In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce \pkg{stream}, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of \pkg{stream} is that it seamlessly integrates with the large existing infrastructure provided by \proglang{R}. In addition to data handling, plotting and easy scripting capabilities, \proglang{R} also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., \proglang{C/C++}, \proglang{Java} and \proglang{Python}). %\pkg{stream} also supports the use of the recently introduced methods %to efficiently access large data stored in secondary memory %(e.g., with packages~\pkg{ff} and \pkg{bigmemory}). In this paper we describe the architecture of \pkg{stream} and focus on its use for data stream clustering research. \pkg{stream} was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining. } \Keywords{data streams, data mining, clustering} \Plainkeywords{data streams, data mining, clustering} \Address{Michael Hahsler\\ Computer Science\\ Lyle School of Engineering\\ Southern Methodist University\\ P.O. Box 750122 \\ Dallas, TX 75275-0122\\ E-mail: \email{mhahsler@lyle.smu.edu}\\ URL: \url{http://lyle.smu.edu/~mhahsler} Matthew Bola\~nos\\ Microsoft Corporation\\ One Microsoft Way\\ Redmond, WA 98052-7329\\ E-mail: \email{mbolanos@curiouscrane.com} John Forrest\\ Microsoft Corporation\\ One Microsoft Way\\ Redmond, WA 98052-7329\\ E-mail: \email{jforrest@microsoft.com} } \begin{document} \vfill {\bf Note:} A previous version of this manuscript was published in the \emph{Journal of Statistical Software} \citep{stream:Hahsler:2017}.\\ %\maketitle %% Add TOC (not with jss style) %\clearpage \tableofcontents \clearpage %\sloppy <>= options(width = 75, digits = 3, prompt = 'R> ', scipen = 3) @ \section{Introduction} Typical statistical and data mining methods (e.g., clustering, regression, classification and frequent pattern mining) work with ``static'' data sets, meaning that the complete data set is available as a whole to perform all necessary computations. Well known methods like $k$-means clustering, linear regression, decision tree induction and the APRIORI algorithm to find frequent itemsets scan the complete data set repeatedly to produce their results~\citep{stream:Hastie+Tibshirani+Friedman:2001}. However, in recent years more and more applications need to work with data which are not static, but are the result of a continuous data generating process which is likely to evolve over time. Some examples are web click-stream data, computer network monitoring data, telecommunication connection data, readings from sensor nets and stock quotes. These types of data are called data streams and dealing with data streams has become an increasingly important area of research~\citep{stream:Babcock:2002,stream:Gaber:2005,stream:Aggarwal:2007}. Early on, the statistics community also recognized the importance of the emerging field of statistical analysis of massive data streams~(see~\cite{stream:NRC:2004}). A data stream can be formalized as an ordered sequence of data points $$Y=\langle \vect{y}_1, \vect{y}_2, \vect{y}_3, \ldots\rangle,$$ where the index reflects the order (either by explicit time stamps or just by an integer reflecting order). The data points themselves are often simple vectors in multidimensional space, but can also contains nominal/ordinal variables, complex information (e.g., graphs) or unstructured information (e.g., text). The characteristic of continually arriving data points introduces an important property of data streams which also poses the greatest challenge: the size of a data stream is potentially unbounded. This leads to the following requirements for data stream processing algorithms: \begin{itemize} \item {Bounded storage:} The algorithm can only store a very limited amount of data to summarize the data stream. \item {Single pass:} The incoming data points cannot be permanently stored and need to be processed at once in the arriving order. \item {Real-time:} The algorithm has to process data points on average at least as fast as the data is arriving. \item {Concept drift:} The algorithm has to be able to deal with a data generating process which evolves over time (e.g., distributions change or new structure in the data appears). \end{itemize} Most existing algorithms designed for static data are not able to satisfy all these requirements and thus are only usable if techniques like sampling or time windows are used to extract small, quasi-static subsets. While these approaches are important, new algorithms to deal with the special challenges posed by data streams are needed and have been introduced over the last decade. Even though \proglang{R} represents an ideal platform to develop and test prototypes for data stream mining algorithms, \proglang{R} currently does only have very limited infrastructure for data streams. The following are some packages available from the Comprehensive R Archive Network\footnote{\url{http://CRAN.R-project.org/}} related to streams: \begin{description} \item[Data sources:] Random numbers are typically created as streams (see e.g., \pkg{rstream}~\citep{stream:Leydold:2012} and \pkg{rlecuyer}~\citep{stream:Sevcikova:2012}). Financial data can be obtained via packages like \pkg{quantmod}~\citep{stream:Ryan:2013}. Intra-day price and trading volume can be considered a data stream. For Twitter, a popular micro-blogging service, packages like \pkg{streamR}~\citep{stream:Barbera:2014} and \pkg{twitteR}~\citep{stream:Gentry:2013} provide interfaces to retrieve life Twitter feeds. \item[Statistical models:] Several packages provide algorithms for iteratively updating statistical models, typically to deal with very large data. For example, \pkg{factas}~\citep{stream:Bar:2014} implements iterative versions of correspondence analysis, PCA, canonical correlation analysis and canonical discriminant analysis. For clustering, \pkg{birch}~\citep{stream:Charest:2012} implements BIRCH, a clustering algorithm for very large data sets. The algorithm maintains a clustering feature tree which can be updated in an iterative fashion. Although BIRCH was not developed as a data stream clustering algorithm, it first introduced some characteristics needed for efficiently handling data streams. Unfortunately, the \pkg{birch} package is no longer maintained and was removed recently from CRAN. \pkg{rEMM}~\citep{stream:Hahsler+Dunham:2014} implements a stand-alone version of a pure data stream clustering algorithm enhanced with a methodology to model a data stream's temporal structure. %The clustering part of this algorithm called DBSTREAM %(threshold nearest neighbor) is also available in the \pkg{stream} framework. Very recently \pkg{RMOA}~\citep{stream:Wijffels:2014} was introduced. The package interfaces data stream classification algorithms from the MOA framework (see existing tools discussed in Section~\ref{sec:background:moa}), however, the package focuses not on data streams but on static data sets that do not fit into main memory. \item[Distributed computing frameworks:] With the development of Hadoop\footnote{\url{http://hadoop.apache.org/}}, distributed computing frameworks to solve large scale computational problems have become very popular. \pkg{HadoopStreaming}~\citep{stream:Rosenberg:2012} is available to use map and reduce scripts written in \proglang{R} within the \proglang{Java}-based Hadoop framework. However, contrary the word streaming in its name, \pkg{HadoopStreaming} does not refer to data streams. As Hadoop itself, \pkg{HadoopStreaming} is used for batch processing and streaming in the name refers only to the internal usage of pipelines for ``streaming'' the input and output between the Hadoop framework and the used \proglang{R} scripts. A distributed framework for realtime computation is Storm\footnote{\url{http://storm.incubator.apache.org/}}. Storm builds on the idea of constructing a computing topology by connecting spouts (data stream sources) with a set of bolts (computational units). \pkg{RStorm}~\citep{stream:Kaptein:2013} provides an environment to prototype bolts in \proglang{R}. Spouts are represented as data frames. Bolts developed in \pkg{RStorm} can currently not directly be used in Storm, but this is planned for the future~\citep{stream:Kaptein:2014}. %At the time of writing this paper, %the topology has a single spout which only reads data from a static data.frame. \end{description} Even in the stream-related packages discussed above, data is still represented by data frames or matrices which is suitable for static data but not ideal to represent streams. In this paper we introduce the package \pkg{stream}~\citep{stream:stream:2014} which provides a framework to represent and process data streams and use them to develop, test and compare data stream algorithms in \proglang{R}. We include an initial set of data stream generators and data stream clustering algorithms in this package with the hope that other researchers will use \pkg{stream} to develop, study and improve their own algorithms. The paper is organized as follows. We briefly review data stream mining in Section~\ref{sec:mining}. In Section~\ref{sec:design} we cover the basic design principles of the \pkg{stream} framework. Sections~\ref{sec:dsd}, \ref{sec:dst} and \ref{sec:evaluation} introduce details about creating data stream sources, performing data stream mining tasks, and evaluating data stream clustering algorithms, respectively. Each of the three sections include example code. Section~\ref{sec:example} we provides comprehensive examples performing an experimental comparison of several data stream clustering algorithms and clustering a large, high-dimensional data set. %Extending the framework with new data stream sources and algorithms is briefly %described in Section~\ref{sec:extension} %and Section~\ref{sec:conclusion} concludes the paper. %%% \section{Data stream mining} \label{sec:mining} Due to advances in data gathering techniques, it is often the case that data is no longer viewed as a static collection, but rather as a potentially very large dynamic set, or stream, of incoming data points. The most common data stream mining tasks are clustering, classification and frequent pattern mining \citep{stream:Aggarwal:2007,stream:Gama:2010}. In this section we will give a brief introduction to these data stream mining tasks. We will focus on clustering, since this is also the current focus of package \pkg{stream}. \subsection{Data stream clustering} \label{sec:background:dsc} Clustering, the assignment of data points to (typically $k$) groups such that points within each group are more similar to each other than to points in different groups, is a very basic unsupervised data mining task. For static data sets, methods like $k$-means, $k$-medoids, hierarchical clustering and density-based methods have been developed among others~\citep{stream:Jain:1999}. Many of these methods are available in tools like \proglang{R}, however, the standard algorithms need access to all data points and typically iterate over the data multiple times. This requirement makes these algorithms unsuitable for large data streams and led to the development of data stream clustering algorithms. Over the last 10 years many algorithms for clustering data streams have been proposed (see \cite{stream:Silva:2013} for a current survey). % \citep[e.g.,][]{stream_clust:Guha:2003, % stream_clust:Aggarwal:2003, % stream_clust:Aggarwal:2004, % stream_clust:Cao:2006, % stream_clust:Tasoulis:2006, % stream_clust:Tasoulis:2007, % stream_clust:Udommanetanakit:2007, % stream:Tu:2009, % stream:Wan+Ng+Dang+Yu+Zhang:2009, % stream_clust:Kranen:2011}. Most data stream clustering algorithms deal with the problems of unbounded stream size, and the requirements for real-time processing in a single pass by using the following two-stage online/offline approach introduced by~\cite{stream_clust:Aggarwal:2003}. \begin{enumerate} \item {Online:} Summarize the data using a set of $k^\prime$~micro-clusters organized in a space efficient data structure which also enables fast look-up. Micro-clusters were introduced for \emph{CluStream} by \cite{stream_clust:Aggarwal:2003} based on the idea of cluster features developed for clustering large data sets with the \emph{BIRCH} algorithm~\citep{stream:Zhang:1996}. Micro-clusters are representatives for sets of similar data points and are created using a single pass over the data (typically in real time when the data stream arrives). Micro-clusters are often represented by cluster centers and additional statistics such as weight (local density) and dispersion (variance). Each new data point is assigned to its closest (in terms of a similarity function) micro-cluster. Some algorithms use a grid instead and micro-clusters are represented by non-empty grid cells (e.g., \emph{D-Stream} by \cite{stream:Tu:2009} or \emph{MR-Stream} by \cite{stream:Wan+Ng+Dang+Yu+Zhang:2009}). If a new data point cannot be assigned to an existing micro-cluster, a new micro-cluster is created. The algorithm might also perform some housekeeping (merging or deleting micro-clusters) to keep the number of micro-clusters at a manageable size or to remove information outdated due to a change in the stream's data generating process. \item {Offline:} When the user or the application requires a clustering, the $k^\prime$ micro-clusters are reclustered into $k \ll k^\prime$ final clusters sometimes referred to as macro-clusters. Since the offline part is usually not regarded time critical, most researchers use a conventional clustering algorithm where micro-cluster centers are regarded as pseudo-points. Typical reclustering methods involve $k$-means or clustering based on the concept of reachability introduced by \emph{DBSCAN}~\citep{Ester96adensity-based}. The algorithms are often modified to take also the weight of micro-clusters into account. \end{enumerate} %A first data stream clustering algorithm called \emph{STREAM} was proposed by %\cite{stream_clust:O'Callaghan:2002} \citep[see also][]{stream_clust:Guha:2003}. %The algorithm attacks the $k$-medians %problem by dividing the data stream into pieces, clusters each piece %individually and then iteratively reclusters the resulting centers to obtain a %final clustering. % %Starting with \emph{CluStream}~\citep{stream_clust:Aggarwal:2003} %most modern data stream clustering algorithms separate the clustering process into two parts. %An online component which aggregates the %data stream in real-time into summaries often called micro-clusters %(an extension of cluster feature vectors used by BIRCH~\citep{stream_clust:Zhang:1996}) %and %an offline component which uses only the summaries to create a final clustering. %The offline component is typically only executed on demand and uses %traditional clustering %algorithms, such as $k$-means or the density-based method \emph{DBSCAN}~\citep{stream:Ester:1996}. %Summarizing the %incoming data points into micro-clusters ensures that the input to the offline %component is constrained to a finite space. %To maintain a finite number of micro-clusters, a pruning function is often %associated within the summarization process. The goal of the pruning process is %to discard micro-clusters that have not enough data points assigned to them %or became obsolete. %The latter case occurs when the structure of %the data stream changes over time which is known as concept drift %\citep{stream:Masud+Chen+Khan+Aggarwal+Gao+Han+Thuraisingham:2010}. %%% FIXME: check reference % %In CluStream \citep{stream_clust:Aggarwal:2003} micro-clusters can be deleted %and merged and permanently stored at different points in time to allow to %create final clusterings (recluster micro-clusters with $k$-means) for %different time frames. %\cite{stream_clust:Kriegel:2003} and %\cite{stream_clust:Tasoulis:2007} present variants of the density based method %{\em OPTICS} \citep{stream_clust:Ankerst:1999} suitable for streaming data. %\cite{stream_clust:Aggarwal:2004} introduce {\em HPStream} which finds %clusters that are well defined in different subsets of the dimensions %of the data. The set of dimensions for each cluster can evolve over time %and a fading function is used to discount the influence of older data points %by fading the entire cluster structure. %\cite{stream_clust:Cao:2006} introduce {\em DenStream} which maintains %micro-clusters in real time and uses a variant of %GDBSCAN \citep{stream_clust:Sander:1998} to produce a final clustering %for users. %\cite{stream_clust:Tasoulis:2006} present {\em WSTREAM,} which uses %kernel density estimation to find rectangular windows to represent clusters. %The windows can move, contract, expand and be merged over time. %More recent density-based data stream clustering algorithms are %{\em D-Stream} \citep{stream_clust:Tu:2009} and %{\em MR-Stream} \citep{stream_clust:Wan:2009}. %{\em D-Stream} uses an online %component to map each data point into a predefined grid and then uses an %offline component to cluster the grid based on density. %{\em MR-Stream} facilitates the discovery of clusters %at multiple resolutions by using a %grid of cells that can dynamically be sub-divided into more cells using a tree %data structure. %\citep{stream:Aggarwal:2009}, threshold Nearest Neighbor (tNN) %One of the most challenging aspects of clustering is how to evaluate how well %an algorithm has performed. There are a number of metrics used to measure the %performance of traditional clustering algorithms %\citep{stream:Manning+Raghavan+Schtze:2008}, but they are often used as an %estimate of the performance rather than a guaranteed figure. Many of the %available metrics require comparison to a true classification of the data so %that it can be determined if incoming data points are being clustered into the %appropriate groups. Common metrics include purity, precision, recall, entropy, %etc. The MOA framework uses many of these traditional clustering metrics, and %additional stream clustering metrics to evaluate the performance on stream %clustering algorithms. %In \pkg{stream}, our goal with data stream clustering is to separate the online %component from each data stream clustering algorithm and use it as its own %entity. We can then compare the performance of the online components of each %algorithm when paired with a selected offline component. This is a feature %unique to the \pkg{stream} framework. We focus on the online component of the %algorithms because \proglang{R} already contains definitions for many of the %offline components used, and the novelty of many of the algorithms is in the %online component. Section \ref{sec:design} discusses what data stream %clustering algorithms are currently available in the framework, and how they %can be operated upon. The most popular approach to adapt to concept drift (changes of the data generating process over time) is to use the exponential fading strategy introduced first for \emph{DenStream} by~\cite{stream_clust:Cao:2006}. Micro-cluster weights are faded in every time step by a factor of $2^{-\lambda}$, where $\lambda >0$ is a user-specified fading factor. This way, new data points %with a weight of one have more impact on the clustering and the influence of older points gradually disappears. Alternative models use sliding or landmark windows. Details of these methods as well as other data stream clustering algorithms are discussed in the survey by \cite{stream:Silva:2013}. \subsection{Outlier detection} \label{sec:background:outlier_dtct} The outlier detection in data streams is a popular task, often used for risk management, e.g., fraud and intrusion detection. From the end-user point of view, outliers are important and meaningful data points that are standing out from the usual populations (clusters) that can be found in data streams \citep{stream:Silva:2013}. This differentiation between clusters and outliers can be statistically or density-based.\\ We build special outlier detectors to detect them in big data streams. Detecting small statistical (or density) differences between clusters and outliers is the key feature of a good outlier detector. Outlier detectors that can detect outlier smaller statistical or density variations are better. This leads us to the outlier detector breaking point, which is the smaller statistical or density variation for which the outlier detector still can detect some outlier.\\ End-user interest in outliers requires that detected outliers get reported back to the user in a limited time frame, which is directly correlated with data stream velocity.\\ Such a time requirement requires a more fine-grained approach than the previously described two-stage approach. For outlier detectors, such as Continuous Outlier Detection (COD), Micro-cluster Continuous Outlier Detection (MCOD) (\cite{kontaki2016efficient}), and Statistical Hierarchical Clustering (SHC) (\cite{krleza2020shc}), this means processing each data point retrieved from the input data stream in two steps. In the first step, outlier detectors are trying to classify the input data point. If the input data point does not belong to any known cluster (population), the outlier detector must decide whether the data point represents a new outlier.\\ The evolving nature of the input data stream causes outliers to become inliers, which represents an issue while trying to assess the outlier detection correctness. In such cases, outlier detectors must have outlier tracking capabilities, which allows users to re-check each outlier individually and determine whether a previously detected outlier is still an outlier, or it evolved into an inlier in the meantime. \subsection{Other popular data stream mining tasks} \label{sec:background:dscl} %\subsection{Classification} \label{sec:background:dscl} Classification, learning a model in order to assign labels to new, unlabeled data points is a well studied supervised machine learning task. Methods include naive Bayes, $k$-nearest neighbors, classification trees, support vector machines, rule-based classifiers and many more~\citep{stream:Hastie+Tibshirani+Friedman:2001}. However, as with clustering these algorithms need access to the complete training data several times and thus are not suitable for data streams with constantly arriving new training data and concept drift. Several classification methods suitable for data streams have been developed. Examples are \emph{Very Fast Decision Trees (VFDT)} \citep{stream:Domingos:2000} using Hoeffding trees, the time window-based \emph{Online Information Network (OLIN)} \citep{stream:Last:2002} and \emph{On-demand Classification} \citep{stream:Aggarwal:2004} based on micro-clusters found with the data-stream clustering algorithm CluStream~\citep{stream_clust:Aggarwal:2003}. For a detailed discussion of these and other methods we refer the reader to the survey by \cite{stream:Gaber:2007}. %\cite{stream:Last:2002} introduces \emph{OLIN,} an online classification %system, which instead of all data only uses a training window with the most %recent data to learn a classifier. The size of the training window and the %frequency of creating a new classification model are adjusted to compensate for %the current rate of concept drift. Since OLIN only requires the %data in the current training window it can be used for data streams. %An interesting new %novel class detection: www.cs.uiuc.edu/~hanj/pdf/pakdd10i\_mmasud.pdf %\subsection{Frequent pattern mining} Another common data stream mining task is frequent pattern mining. The aim of frequent pattern mining is to enumerate all frequently occurring patterns (e.g., itemsets, subsequences, subtrees, subgraphs) in large transaction data sets. Patterns are then used to summarize the data set and can provide insights into the data. Although finding all frequent patterns in large data sets is a computationally expensive task, many efficient algorithms have been developed for static data sets. A prime example is the \emph{APRIORI} algorithm \citep{arules:Agrawal:1993} for frequent itemsets. However, these algorithms use breath-first or depth-first search strategies which results in the need to pass over each transaction (i.e., data point) several times and thus makes them unusable for the case where transactions arrive and need to be processed in a streaming fashion. Algorithms for frequent pattern mining in streams are discussed in the surveys by \cite{stream:Jin:2007}, \cite{stream:Cheng:2008} and \cite{stream:Vijayarani:2012}. % Add regression and outlier detection. %\pagebreak[4] \subsection{Existing tools} \label{sec:background:moa} MOA\footnote{\url{http://moa.cms.waikato.ac.nz/}} (short for Massive Online Analysis) is a framework implemented in \proglang{Java} for stream classification, regression and clustering \citep{stream:Bifet+Holmes+Kirkby+Pfahringer:2010}. It was the first experimental framework to provide easy access to multiple data stream mining algorithms, as well as to tools for generating data streams that can be used to measure and compare the performance of different algorithms. Like WEKA~\citep{stream:Witten:2005}, a popular collection of machine learning algorithms, MOA is also mainly developed by the University of Waikato and its graphical user interface (GUI) and workflow are similar to those of WEKA. % %The workflow in MOA consists of three main steps: %\begin{enumerate} %\item Selection of the data stream model. %\item Selection of the learning algorithm(s) and evaluation measure. %\item Run the algorithm and inspect the results. %\end{enumerate} % %Similar to WEKA, MOA uses a very appealing graphical user interface. Classification results are shown as text, while clustering results have a visualization component that shows both the evolution of the clustering (in two dimensions) and various performance metrics over time~\citep{stream:Kranen:2010}. SAMOA\footnote{\url{http://yahoo.github.io/samoa/}} (Scalable Advanced Massive Online Analysis) is a recently introduced tool for distributed stream mining with Storm or the Apache S4 distributed computing platform. Similar to MOA it is implemented in \proglang{Java}, and supports the basic data stream mining tasks of clustering, classification and frequent pattern mining. Some MOA clustering algorithms are interfaced in SAMOA. SAMOA currently does not provide a GUI. Another distributed processing framework and streaming machine learning library is Jabatus\footnote{\url{http://jubat.us/en/}}. It is implemented in \proglang{C++} and supports classification, regression and clustering. For clustering it currently supports $k$-means and Gaussian Mixture Models (version 0.5.4). Commercial data stream mining platforms include IBM InfoSphere Streams and Microsoft StreamInsight (part of MS SQL Server). These platforms aim at building applications using existing data stream mining algorithms rather than developing and testing new algorithms. MOA is currently the most complete framework for data stream clustering research and it is an important pioneer in experimenting with data stream algorithms. MOA's advantages are that it interfaces with WEKA, provides already a set of data stream classification and clustering algorithms and it has a clear \proglang{Java} interface to add new algorithms or use the existing algorithms in other applications. A drawback of MOA and the other frameworks for \proglang{R} users is that for all but very simple experiments custom \proglang{Java} code has to be written. Also, using MOA's data stream mining algorithms together with the advanced capabilities of \proglang{R} to create artificial data and to analyze and visualize the results is currently very difficult and involves running code and copying data manually. The recently introduce R-package~\pkg{RMOA}~\citep{stream:Wijffels:2014} interfaces MOA's data stream classification algorithms, however, it focuses on processing large data sets that do not fit into main memory and not on data streams. \section{The stream framework} \label{sec:design} The \pkg{stream} framework provides an \proglang{R}-based alternative to MOA which seamlessly integrates with the extensive existing \proglang{R} infrastructure. Since \proglang{R} can interface code written in many different programming languages (e.g., \proglang{C/C++}, \proglang{Java}, \proglang{Python}), data stream mining algorithms in any of these languages can be easily integrated into \pkg{stream}. \pkg{stream} is based on several packages including \pkg{fpc}~\citep{stream:Hennig:2014}, \pkg{clue}~\citep{stream:Hornik:2013}, \pkg{cluster}~\citep{stream:Maechler:2014}, \pkg{clusterGeneration}~\citep{stream:Qiu+Joe:2009}, \pkg{MASS}~\citep{stream:Venables+Ripley:2002}, \pkg{proxy}~\citep{stream:Meyer+Buchta:2010}, and others. The \pkg{stream} extension package \pkg{streamMOA}~\citep{stream:streamMOA:2014} also interfaces the data stream clustering algorithms already available in MOA using the \pkg{rJava} package by \cite{stream:Urbanek:2013}. %Other than MOA, \pkg{stream} %can incorporate any algorithm which is written in a %language interfaceable by \proglang{R}. We will start with a very short example to make the introduction of the framework and its components easier to follow. After loading \pkg{stream}, we create a simulated data stream with data points drawn from three random Gaussians in 2D space. Note that we set the random number generator seed every time when we create simulated data sets to get reproducible results. \begin{figure}[tb] \centering \includegraphics[width=.5\linewidth]{stream-simple_kmeans_reclustering} \caption{Data stream clustering result of D-Stream on a simple simulated data set with three random Gaussians. Micro-clusters are shown as circles and macro-clusters are shown as crosses (size represents weight).} \label{figure:simple_kmeans_reclustering} \end{figure} <>= library("stream") set.seed(1000) stream_orig <- DSD_Gaussians(k = 3, d = 2) @ Next, we apply a filter to transform the stream. Here, we will scale the stream to z-scores. We use here a pipe, but \code{DSF_Scale()} can also be called with the input stream as its first argument. <>= stream <- stream_orig %>% DSF_Scale() @ Now, we create an instance of the density-based data stream clustering algorithm D-Stream which uses grid cells as micro-clusters. We specify the grid cell size (\code{gridsize}) as .1 and require that the density of a grid cell (\code{Cm}) needs to be at least 1.2 times the average cell density to become a micro-cluster. Then we update the model with the next 500 data points from the stream. <>= dstream <- DSC_DStream(gridsize = .5, Cm = 1.2) update(dstream, stream, n = 500) @ Finally, we perform reclustering using $k$-means with three clusters and plot the resulting micro and macro clusters (see Figure~\ref{figure:simple_kmeans_reclustering}). <>= km <- DSC_Kmeans(k = 3) recluster(km, dstream) plot(km, stream, type = "both") @ As shown in this example, the \pkg{stream} framework consists of three main components: \begin{enumerate} \item {Data stream data (DSD)} simulates or connects to a data stream. \item {Data stream filter (DSF)} one or more filters to transform the stream. \item {Data stream task (DST)} performs a data stream mining task. In the example above, we performed twice a data stream clustering (DSC) task. \end{enumerate} Figure \ref{figure:workflow} shows a high level view of the interaction of the components. We start by creating a DSD object and a DST object. Then the DST object starts receiving data form the DSD object. At any time, we can obtain the current results from the DST object. DSTs can implement any type of data stream mining task (e.g., classification or clustering). %In the following we will concentrate on clustering %since \pkg{stream} currently focuses on this type of task. %, but the %framework is implemented such that classification, frequent pattern mining %or any other task can be added easily in the future. \begin{figure}[tb] \centering \includegraphics[width=.9\linewidth]{architecture} \caption{A high level view of the \pkg{stream} architecture.} \label{figure:workflow} \end{figure} Since stream mining is a relatively young field and many advances are expected in the near future, the object oriented framework in \pkg{stream} was developed with easy extensibility in mind. We are using the \proglang{S3}~class system~\citep{stream:Chambers:1992} throughout and, for performance reasons, the \proglang{R}-based algorithms are implemented using reference classes. The framework provides for each of the two core components a lightweight interface definition (i.e., an abstract class) which can be easily implemented to create new data stream types or to interface new data stream mining algorithms. Developers can also extend the infrastructure with new data mining tasks. Details for developers interested in extending \pkg{stream} can be found in the package's vignette and manual pages~\citep{stream:stream:2014}. In the following we will concentrate on describing the aspects of the framework which are important to a users interested in dealing with data streams and performing data stream mining tasks in \proglang{R}. \section{Data stream data (DSD)} \label{sec:dsd} \subsection{Introduction} The first step in the \pkg{stream} workflow is to select a data stream implemented as a data stream data (DSD) object. This object can be a management layer on top of a real data stream, a wrapper for data stored in memory or on disk, or a generator which simulates a data stream with know properties for controlled experiments. Figure~\ref{figure:dsd} shows the relationship (inheritance hierarchy) of the DSD classes as a UML class diagram~\citep{stream:Fowler:2003}. All DSD classes extend the abstract base class~\code{DSD}. There are currently two types of DSD implementations, classes which implement \proglang{R}-based data streams~(\code{DSD_R}) and MOA-based stream generators~(\code{DSD_MOA}) provided in \pkg{streamMOA}. Note that abstract classes define interfaces and only implement common functionality. Only implementation classes can be used to create objects (instances). This mechanism is not enforced by S3, but is implemented in \pkg{stream} by providing for all abstract classes constructor functions which create an error. The package~\pkg{stream} provides currently the following set of DSD implementations: \begin{itemize} \item Simulated streams with static structure. \begin{itemize} \item\code{DSD_BarsAndGaussians} generates two uniformly filled rectangular and two Gaussians clusters with different density. \item \code{DSD_Gaussians} generates randomly placed static clusters with random multivariate Gaussian distributions. Allows generating and marking outliers for outlier detectors. \item\code{DSD_mlbenchData} provides streaming access to machine learning benchmark data sets found in the \code{mlbench} package~\citep{stream:Leisch:2010}. \item\code{DSD_mlbenchGenerator} interfaces the generators for artificial data sets defined in the \code{mlbench} package. \item\code{DSD_Target} generates a ball in circle data set. \item\code{DSD_UniformNoise} generates uniform noise in a $d$-dimensional (hyper) cube. \end{itemize} \item Simulated streams with concept drift. \begin{itemize} \item\code{DSD_Benchmark}, a collection of simple benchmark problems including splitting and joining clusters, and changes in density or size. This collection is indented to grow into a comprehensive benchmark set used for algorithm comparison. \item\code{DSD_MG}, a generator to specify complex data streams with concept drift. The shape as well as the behavior of each cluster over time (changes in position, density and dispersion) can be specified using keyframes (similar to keyframes in animation and film making) or by mathematical functions. \item\code{DSD_RandomRBFGeneratorEvents} (\pkg{streamMOA}) generates streams using radial base functions with noise. Clusters move, merge and split. \end{itemize} \item Connectors to real data and streams. \begin{itemize} \item \code{DSD_Memory} provides a streaming interface to static, matrix-like data (e.g., a data frame, a matrix) in memory which represent a fixed portion of a data stream. Matrix-like objects also include large objects potentially stored on disk like \code{ffdf} from package~\pkg{ff}~\citep{stream:Adler:2014} or \code{big.matrix} from package~\pkg{bigmemory}~\citep{stream:Kane:2013}. Any matrix-like object which implements at least row subsetting with \code{"["} and \code{dim()} can be used. Using these, stream mining algorithms (e.g., clustering) can be performed on data that does not fit into main memory. In addition, \code{DSD_Memory} can directly create a static copy of a portion of another DSD object to be replayed in experiments several times. \item \code{DSD_ReadCSV} reads data line by line in text format from a file or an open connection and makes it available in a streaming fashion. This way data that is larger than the available main memory can be processed. Connections can be used to read from real-time data streams. \item \code{DSD_ReadDB} provides an interface to an open result set from a SQL query to a relational database. Any of the many database management systems with a \pkg{DBI} interface~\citep{stream:DBI:2014} can be used. %(SQLite, mySQL, Oracle, SQLServer, etc.) can be used. \end{itemize} \item In-flight stream operations. \begin{itemize} \item \code{DSD_ScaleStream} can be used to standardize (centering and scaling) data in a data stream in-flight. \end{itemize} \end{itemize} \begin{figure} \centering \includegraphics[width=\linewidth]{dsd_uml} \caption{Overview of the data stream data (DSD) class structure.} \label{figure:dsd} \end{figure} All DSD implementations share a simple interface consisting of the following two functions: \begin{enumerate} \item A creator function. This function typically has the same name as the class. By definition the function name starts with the prefix \code{DSD_}. The list of parameters depends on the type of data stream it creates. The most common input parameters for the creation of DSD classes for clustering are \code{k}, the number of clusters (i.e., dense areas), and \code{d}, the number of dimensions. A full list of parameters can be obtained from the help page for each class. The result of this creator function is not a data set but an object representing the stream's properties and its current state. \item A data generating function \\ \code{get_points(x, n = 1, outofpoints = c("stop", "warn", "ignore") , info = TRUE, ...)}.\\ This function is used to obtain the next data point (or next \code{n} data points) from the stream represented by object~\code{x}. Parameter \code{outofpoints} controls how to deal with a stream which runs out of points (the stream source does not provide more points at this time). For \code{"warn"} and \code{"ignore"} all (possibly zero) available points are returned. For clustering data, the data points are returned as a data frame with each row representing a single data point. For other types of data streams (e.g., transaction data for frequent pattern mining), the returned points might be represented in a different, appropriate way (e.g., as a list). \end{enumerate} Next to these core functions several utility functions like \code{print()}, \code{plot()} and \code{write_stream()}, to save a part of a data stream to disk, are provided by \pkg{stream} for class \code{DSD} and are available for all data stream sources. Different data stream implementations might have additional functions implemented. For example, \code{DSD_Gaussians}, \code{DSD_Memory} and \code{DSD_ReadCSV} provide \code{reset_stream()} to reset the position in the stream to its beginning. %Following this simple interface, %other data stream implementations can be easily added. %This will be discussed in Section~\ref{sec:extension}. Next we give some examples of how to manage data streams using \pkg{stream}. In Section~\ref{examples:ds} we start with creating a data stream using different implementations of the DSD class. The second example in Section~\ref{examples:disk} shows how to save and read stream data to and from disk. Section~\ref{examples:replay} gives examples for how to reuse the same data from a stream in order to perform comparison experiments with multiple data stream mining algorithms on exactly the same data. All examples contain the complete code necessary for replication. \subsection{Example: Creating a data stream} \label{examples:ds} %In this example, we focus on implementations of the DSD class to model %data streams. <>= library("stream") set.seed(1000) stream <- DSD_Gaussians(k = 3, d = 3, noise = .05, p = c(.5, .3, .1)) stream @ After loading the \pkg{stream} package %(and setting a seed for the random number generator to make the experiments reproducible), we call the creator function for the class \code{DSD_Gaussians} specifying the number of clusters as $k=3$ and a data dimensionality of $d=3$ with an added noise of 5\% of the generated data points. Each cluster is represented by a multivariate Gaussian distribution with a randomly chosen mean (cluster center) and covariance matrix. New data points are requested from the stream using \code{get_points()}. When a new data point is requested from this generator, a cluster is chosen randomly (using the probability weights in \code{p}) and then a point is drawn from the multivariate Gaussian distribution given by the mean and covariance matrix of the cluster. Noise points are generated in a bounding box from a $d$-dimensional uniform distribution. The following instruction requests $n = 10$ new data points. <>= p <- get_points(stream, n = 10) p @ The result is a data frame containing the data points as rows. For evaluation it is often important to know the ground truth, i.e., from which cluster each point was created. Many generators also return the ground truth (class or cluster label) and other information as columns starting with `.`. Note that the data was created by a generator with 5\% noise. Noise points do not belong to any cluster and thus have a class label of \code{NA}. Next, we plot 500 points from the data stream to get an idea about its structure. <>= plot(stream, n = 500) @ The resulting scatter plot matrix is shown in Figures~\ref{figure:static}. The assignment values are automatically used to distinguish between clusters using color and different plotting symbols. Noise points are plotted as gray dots. The data can also be projected on its first two principal components using \code{method="pc"}. <>= plot(stream, n = 500, method = "pc") @ \begin{figure}[t] \centering \includegraphics[width=.8\linewidth]{stream-static} \caption{Plotting 500 data points from the data stream.} \label{figure:static} \end{figure} \begin{figure} \centering \includegraphics[width=.4\linewidth]{stream-static_pc} \caption{Plotting 500 data points from the data stream projected onto its first two principal components.} \label{figure:static_pc} \end{figure} Figures~\ref{figure:static_pc} show the projected data. Stream also supports data streams which contain concept drift. Several examples of such data stream generators are collected in \code{DSD_Benchmark}. We create an instance of the first benchmark generator which creates two clusters moving in two-dimensional space. One moves from top left to bottom right and the other one moves from bottom left to top right. Both clusters overlap when they meet exactly in the center of the data space. <>= set.seed(1000) stream <- DSD_Benchmark(1) stream @ To show concept drift, we request four times 250 data points from the stream and plot them. To fast-forward in the stream we request 1400 points in between the plots and ignore them. <>= for(i in 1:4) { plot(stream, 250, xlim = c(0, 1), ylim = c(0, 1)) tmp <- get_points(stream, n = 1400) } @ <>= plot(stream, 250, xlim = c(0, 1), ylim = c(0, 1)) arrows(.15, .85, .85, .15, col = rgb(.8, .8, .8, .6), lwd = 10) arrows(.15, .15, .85, .85, col = rgb(.8, .8, .8, .6), lwd = 10) tmp <- get_points(stream, n = 1400) @ <>= plot(stream, 250, xlim = c(0, 1), ylim = c(0, 1)) arrows(.15, .85, .85, .15, col = rgb(.8, .8, .8, .6), lwd = 10) arrows(.15, .15, .85, .85, col = rgb(.8, .8, .8, .6), lwd = 10) tmp <- get_points(stream, n=1400) @ <>= plot(stream, 250, xlim = c(0, 1), ylim = c(0, 1)) arrows(.15,.85,.85,.15, col=rgb(.8,.8,.8,.6), lwd=10) arrows(.15,.15,.85,.85, col=rgb(.8,.8,.8,.6), lwd=10) tmp <- get_points(stream, n=1400) @ <>= plot(stream, 250, xlim=c(0,1), ylim=c(0,1)) arrows(.15,.85,.85,.15, col=rgb(.8,.8,.8,.6), lwd=10) arrows(.15,.15,.85,.85, col=rgb(.8,.8,.8,.6), lwd=10) @ \begin{figure} \centering \begin{minipage}{.45\linewidth} \centering \includegraphics[width=\linewidth]{stream-moa1} \\(a) Position 1 \end{minipage} \begin{minipage}{.45\linewidth} \centering \includegraphics[width=\linewidth]{stream-moa2} \\(b) Position 1650 \end{minipage} \\ \begin{minipage}{.45\linewidth} \centering \includegraphics[width=\linewidth]{stream-moa3} \\(c) Position 3300 \end{minipage} \begin{minipage}{.45\linewidth} \centering \includegraphics[width=\linewidth]{stream-moa4} \\(d) Position 4950 \end{minipage} \caption{Data points from \code{DSD\_Benchmark(1)} at different positions in the stream. The two arrows are added to highlight the direction of movement.} \label{figure:dsd_bench} \end{figure} Figure \ref{figure:dsd_bench} shows the four plots where clusters move over time. Arrows are added to highlight the direction of cluster movement. An animation of the data can be generated using \code{animate_data()}. We use \code{reset_stream()} to start the animation at the beginning of the stream. <>= reset_stream(stream) animate_data(stream, n = 10000, horizon = 100, xlim = c(0, 1), ylim = c(0, 1)) @ Animations are recorded using package \pkg{animation}~\citep{stream:Xie:2013} and can be replayed using \code{ani.replay()}. <>= library("animation") animation::ani.options(interval = .1) ani.replay() @ Animations can also be saved as an animation embedded in a HTML document or an animated image in the Graphics Interchange Format (GIF) which can easily be used in presentations. <>= saveHTML(ani.replay()) saveGIF(ani.replay()) @ More formats for saving the animation are available in package~\pkg{animation}. %To see the life animation, we refer the reader to the example code in %the manual page for \code{animate_data}. \subsection{Example: Advanced statistical data streams} \label{examples:advanced_stat} \code{DSD_Gaussians} has capabilities to generate more complex statistical data streams. In the previous examples, we used simple cluster and outlier generating capabilities and Euclidean distance for their separation. \subsubsection{Maximal variance and space limitations} In case we do not predefine covariance matrices by using \code{sigma} parameter, \code{DSD_Gaussians} can randomly generate covariance matrices. Maximal variance used to generate covariance matrices can be limited, which comes together with space limitation to fit clusters. <<>>= library("stream") set.seed(1000) stream1 <- DSD_Gaussians(k = 3, d = 2, variance_limit = c(0, 0.01), space_limit = c(0, 5)) stream2 <- DSD_Gaussians(k = 3, d = 2, variance_limit = c(.05, .1), space_limit = c(0, 5)) @ Next, we plot 1000 points from the data stream, which can be seen in Figure \ref{figure:dsd_gauss_limits}. <>= plot(stream1, 1000) @ <>= plot(stream2, 1000) @ \begin{figure} \begin{minipage}{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-dsd-lim1} \\(a) Variance between 0 and 0.01 \end{minipage} \begin{minipage}{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-dsd-lim2} \\(b) Variance between .05 and 0.1 \end{minipage} \caption{Data points from \code{DSD\_Gaussians} having maximal variance limit and space limits.} \label{figure:dsd_gauss_limits} \end{figure} As seen in Figure \ref{figure:dsd_gauss_limits}b, we can experience overlapping of clusters due to high maximal variance limit. \subsubsection{Keeping clusters sufficiently separated} To keep cluster from overlapping we can use two separation distance measures: Euclidean and Mahalanobis (the statistical distance). While Euclidean distance can be used to some extent, it might not keep clusters cleanly separated at all times, since cluster size highly depend on the related covariance matrix. This is the reason why we want do use statistical distance (Mahalanobis) to control cluster separation. <<>>= library("stream") set.seed(1000) stream1 <- DSD_Gaussians(k = 5, d = 2, variance_limit = c(0.01, 0.03), space_limit = c(0, 7), separation_type = "Mahalanobis", separation = 3) @ <<>>= set.seed(1000) stream2 <- DSD_Gaussians(k = 5, d = 2, variance_limit = c(0.01, 0.03), space_limit = c(0, 7), separation_type = "Mahalanobis", separation = 10) @ Plots comprising 1000 points from the data stream can be seen in Figure \ref{figure:dsd_gauss_mahasep}. <>= plot(stream1, 1000) @ <>= plot(stream2, 1000) @ \begin{figure} \begin{minipage}{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-dsd-ms1} \\(a) Mahalanobis separation = 3 \end{minipage} \begin{minipage}{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-dsd-ms2} \\(b) Mahalanobis separation = 10 \end{minipage} \caption{Data points from \code{DSD\_Gaussians} having distinct Mahalanobis separation values.} \label{figure:dsd_gauss_mahasep} \end{figure} \subsection{Example: Reading and writing data streams} \label{examples:disk} Although data streams are potentially unbounded by definition and thus storing the complete stream is infeasible, it is often useful to store parts of a stream on disk. For example, a small part of a stream with an interesting feature can be used to test how a new algorithm handles this particular case. \pkg{stream} has support for reading and writing parts of data streams through \proglang{R} connections which provide a set of functions to interface file-like objects including files, compressed files, pipes, URLs or sockets~\citep{stream:RIO:2011}. We start the example by creating a DSD object. <<>>= library("stream") set.seed(1000) stream <- DSD_Gaussians(k = 3, d = 5) @ Next, we write 100 data points to disk using \code{write_stream()}. <>= write_stream(stream, "data.csv", n = 100, sep = ",") @ \code{write_stream()} accepts a DSD object, and then either a connection or a file name. The instruction above creates a new file called \code{dsd\_data.csv}. The \code{sep} parameter defines how the dimensions in each data point (row) are separated. Here a comma is used to create a comma separated values file. The actual writing is done by \proglang{R}'s \code{write.table()} function and additional parameters are passed on. Data points are requested blockwise (defaults to 100,000 points) from the stream and then written to the connection. This way the only restriction for the size of the written stream are limitations at the receiving end (e.g., the available storage). Finally, parameters \code{class} and \code{write_outliers} can be used to control writing of the class information and outlier marks. These two details are stored in fields named "class" and "outlier" respectively, and can be read again. The \code{DSD_ReadCSV} object is used to read a stream from a connection or a file. It reads only the specified number of data points at a time using the \code{read.table()} function. Since, after the read data is processed, e.g., by a data stream clustering algorithm, it is removed from memory, we can efficiently process files larger than the available main memory in a streaming fashion. In the following example we create a data stream object representing data stored as a compressed CSV-file in the package's examples directory. <<>>= file <- system.file("examples", "kddcup10000.data.gz", package = "stream") stream_file <- DSD_ReadCSV(gzfile(file), take = c(1, 5, 6, 8:11, 13:20, 23:41, .class = 42), k = 7) stream_file @ Using \code{take}, \code{class}, and \code{outlier} we define which columns should be used as data, which column contains the ground truth assignment, and which column contains outlier marks. We also specify the true number of clusters $k$ and outliers $o$. Ground truth and number of clusters do not need to be specified if they are not available or no evaluation is planned. Note that at this point no data has been read in. Reading only occurs when \code{get_points} is called. %\code{DSD_ReadCSV} objects are just like any other DSD object in that you %can call \code{get_points()} to retrieve data points from the data stream. <<>>= get_points(stream_file, n = 5) @ For clustering it is often necessary to normalize data first. Streams can be scaled and centered in-flight using \code{DSD_ScaleStream}. The scaling and centering factors are computed from a set of points (by default 1000) from the beginning of the stream. <<>>= stream_scaled <- DSD_ScaleStream(stream_file, center = TRUE, scale = TRUE) get_points(stream_scaled, n = 5) @ %Looping over the data several times and %resetting the position in the \code{DSD_ReadCSV} to the file's beginning %is possible with \code{reset_stream()} %and will described in the next example. \newpage \subsection{Example: Replaying a data stream} \label{examples:replay} An important feature of \pkg{stream} is the ability to replay portions of a data stream. With this feature we can capture a special feature of the data (e.g., an anomaly) and then adapt our algorithm and test if the change improved the behavior on exactly that data. Also, this feature can be used to conduct experiments where different algorithms need to be compared using exactly the same data. There are several ways to replay streams. As described in the previous section, we can write a portion of a stream to disk with \code{write_stream()} and then use \code{DSD_ReadCSV} to read the stream portion back every time it is needed. However, often the interesting portion of the stream is small enough to fit into main memory or might be already available as a matrix or a data frame in \proglang{R}. In this case we can use the DSD class \code{DSD_Memory} which provides a stream interface for a matrix-like objects. For illustration purposes, we use data for four major European stock market indices available in \proglang{R} as a data frame. <<>>= data("EuStockMarkets", package = "datasets") head(EuStockMarkets) @ Next, we create a \code{DSD_Memory} object. The number of true clusters $k$ is unknown. <<>>= replayer <- DSD_Memory(EuStockMarkets, k = NA) replayer @ Every time we get a point from replayer, the stream moves to the next position (row) in the data. <<>>= get_points(replayer, n = 5) replayer @ Note that the stream is now at position 6. The stream only has 1854 points left and the following request for more than the available number of data points results in returning all remaining points and a warning. <<>>= points <- get_points(replayer, n = 2000) dim(points) @ Note that this behavior can be changed to an error or ignoring the problem with the parameter \code{outofpoints} when constructing table-based data streams like \code{DSD_Memory}, \code{DSD_ReadStream}, \code{DSD_ReadStream}, and \code{DSD_ReadDB}. \code{DSD_Memory} and \code{DSD_ReadCSV} can also be created to loop indefinitely, i.e., start over once the last data point is reached. This is achieved by passing \code{loop = TRUE} to the creator function. The current position in the stream for those two types of DSD classes can also be reset to the beginning of the stream or, for \code{DSD_Memory}, to an arbitrary position via \code{reset_stream()}. Here we set the stream to position 100. <<>>= reset_stream(replayer, pos = 100) replayer @ \code{DSD_Memory} also accepts other matrix-like objects. This includes data shared between processes or data that is too large to fit into main memory represented by memory-mapped files using \code{ffdf} objects from package~\pkg{ff}~\citep{stream:Adler:2014} or \code{big.matrix} objects from package~\pkg{bigmemory}~\citep{stream:Kane:2013}. In fact any object that provides basic matrix functions like \code{dim()} and subsetting with \code{"["} can be used. \section{Data stream task (DST)} \label{sec:dst} After choosing a DSD class to use as the data stream source, the next step in the workflow is to define a data stream task (DST). In \pkg{stream}, a DST refers to any data mining task that can be applied to data streams. The design is flexible enough for future extensions including even currently unknown tasks. Figure~\ref{figure:dst} shows the class hierarchy for DST. It is important to note that the DST base class is shown merely for conceptual purpose and is not directly visible in the code. The reason is that the actual implementations of data stream operators (DSO), clustering (DSC), classification (DSClass) or frequent pattern mining (DSFPM) are typically quite different and the benefit of sharing methods would be minimal. DST classes implement mutable objects which can be changed without creating a copy. This is more efficient, since otherwise a new copy of all data structures used by the algorithm would be created for processing each data point. Mutable objects can be implemented in \proglang{R} using environments or the recently introduced reference class construct (see package~\pkg{methods} by the \cite{stream:R:2005}). Alternatively, pointers to external data structures in \proglang{Java} or \proglang{C/C++} can be used to create mutable objects. \begin{figure} \centering \includegraphics[width=\linewidth]{dst_uml} \caption{Overview of the data stream task (DST) class structure with subclasses for data stream operators (DSO), clustering (DSC), classification (DSClass) and frequent pattern mining (DSFPM).} \label{figure:dst} \end{figure} We will restrict the following discussion to data stream clustering (DSC) since \pkg{stream} currently focuses on this task. \pkg{stream} currently provides moving windows and sampling from a stream as data stream operators (DSO). The operators provide simple functionality which can be used by other tasks and we will discuss them in the context of clustering. Packages which cover the other tasks using the \pkg{stream} framework are currently under development. \subsection{Introduction to data stream clustering (DSC)} Data stream clustering algorithms are implemented as subclasses of the abstract class \code{DSC} (see Figure~\ref{figure:dst}). First we differentiate between different interfaces for clustering algorithms. \code{DSC_R} provides a native \proglang{R} interface, while \code{DSC_MOA} (available in \pkg{streamMOA}) provides an interface to algorithms implemented for the \proglang{Java}-based MOA framework. DSCs implement the online process as subclasses of \code{DSC_Micro} (since it produces micro-clusters) and the offline process as subclasses of \code{DSC_Macro}. To implement the typical two-stage process in data stream clustering, \pkg{stream} provides \code{DSC_TwoStage} which can be used to combine any available micro and macro-clustering algorithm. The following functions can be used for objects of subclasses of DSC: \begin{itemize} \item A creator function which creates an empty clustering. Creator function names by definition start with the prefix \code{DSC_}. \item \code{update(dsc, dsd, n = 1, verbose = FALSE, ...)} which accepts a DSC object and a DSD object. It requests the \code{n} data points from \code{dsd} and adds them to the clustering in \code{dsc}. \item \code{nclusters(x, type = c("auto", "micro", "macro"), ...)} returns the number of clusters currently in the DSC object. This is important since the number of clusters is not fixed for most data stream clustering algorithms. DSC objects can contain several clusterings (e.g., micro and macro-clusters) at the same time. The default value for \code{type} is \code{"auto"} and results in \code{DSC_Micro} objects to return micro-cluster information and \code{DSC_Macro} objects to return macro-cluster information. Most \code{DSC_Macro} objects also store micro-clusters and using \code{type} these can also be retrieved. Some \code{DSC_Micro} implementations also have a reclustering procedure implemented and \code{type} also allows the user to retrieve macro-cluster information. Trying to access cluster information that is not available in the clustering results in an error. \code{type} is also available for many other functions. \item \code{get_centers(x, type = c("auto", "micro", "macro"), ...)} returns the centers of the clusters of the DSC object. Depending on the clustering algorithm the centers can be centroids, medoids, centers of dense grids, etc. \item \code{get_weights(x, type = c("auto", "micro", "macro"), ...)} returns the weights of the clusters in the DSC object \code{x}. How the weights are calculated depends on the clustering algorithm. Typically they are a function of the number of points assigned to each cluster. \item \code{get_assignment(dsc, points, type = c("auto", "micro", "macro"),} \\ \code{method = c("auto", "model", "nn"), ...)} returns a cluster assignment vector indicating to which cluster each data point in \code{points} would be assigned. The assignment can be determined by the model (e.g., point falls inside the radius of the micro-cluster) or via nearest neighbor assignment (\code{"nn"}). \code{method = "auto"} selects model-based assignment if available and otherwise defaults to nearest neighbor assignment. Note that model-based assignment might result in some points not being assigned to any cluster (i.e., an assignment value of \code{NA}) which indicates a noise data point. \item \code{get_copy(x)} creates a deep copy of a DSC object. This is necessary since clusterings are represented by mutable objects (\proglang{R}-based reference classes or external data structures). Calling this function results in an error if a mechanism for creating a deep copy is not available for the used DSC implementation. \item \code{plot(x, dsd = NULL, ..., method = "pairs", dim = NULL,}\\ \code{type = c("auto", "micro", "macro", "both", "outliers", "all")}\\ (see manual page for more available parameters) plots the centers of the clusters and marks detected outliers. There are 3 available plot methods: \code{"pairs"}, \code{"scatter"}, \code{"pc"}. Method \code{"pairs"} is the default method and produces a matrix of scatter plots that plots all attributes against one another (this method defaults to a regular scatter plot for \code{d = 2}). Method \code{"scatter"} takes the attributes specified in \code{dim} (the first two if \code{dim} is unspecified) and plots them in a scatter plot. Lastly, method \code{"pc"} performs Principle Component Analysis (PCA) on the data and projects the data onto a 2-dimensional plane for plotting. Parameter \code{type} controls plotting of cluster and outlier markings. User can select to plot micro- (\code{micro}), macro-clusters (\code{macro}), both micro and macro clusters (\code{both}), outliers (\code{outliers}), and everything (\code{all}). If a DSD object is provides as \code{dsd}, then some example data points are plotted in the background in light gray. \item \code{print(x, ...)} prints common attributes of the DSC object. This includes a short description of the underlying algorithm and the number of clusters that have been calculated. \end{itemize} We can add \code{DSC_Outlier} abstract class anywhere between \code{DSC} abstract class and a concrete clusterer implementation class. This means that the clusterer has additional outlier detection capabilities. The following functions can be used for objects of subclasses of \code{DSC_Outlier}: \begin{itemize} \item \code{clean_outliers(x, ...)} instructs the outlier detector to clean up the outlier list. \item \code{get_outlier_positions(x, ...)} returns positions of currently detected outliers. \item \code{recheck_outlier(x, outlier_correlated_id, ...)} invokes re-checking whether previously detected outlier identifier by \code{outlier_correlated_id} is still an outlier (TRUE) or has become an inlier in the meantime (FALSE). \item \code{noutlier(x, ...)} returns the number of current outliers. \item \code{print(x, ...)} prints out DSC object details that include detected outliers. \item \code{get_assignment(x, points, type=c("auto", "micro", "macro"),}\\ \code{method=c("auto", "nn", "model"), outlier_threshold=0.05, ...)} returns a data frame comprising cluster assignments for related data point in \code{points} argument. As an addition, attributes \code{outliers} and \code{outliers_corrid} are returned with assignment data frame. Attribute \code{outliers} comprises outlier marks, while attribute \code{outliers_corrid} comprises outlier identifiers. \end{itemize} All single-pass clusterers must have abstract class \code{DSC_SinglePass} anywhere between abstract class \code{DSC} and a concrete clusterer class. \code{DSC_SinglePass} abstract class indicates that the clusterer does automatic model update automatically when processing each data point retrieved from the input data stream (\code{DSD}). It is necessary that single-pass clusterers override implementations of the following methods: \begin{itemize} \item \code{update(dsc, dsd, n = 1, verbose = FALSE, ...)} \item \code{get_assignment(dsc, points, type=c("auto", "micro", "macro"),}\\ \code{method=c("auto", "nn", "model"), ...)} \end{itemize} \begin{figure} \centering \includegraphics[width=\linewidth]{interaction} \caption{Interaction between the DSD and DSC classes.} \label{figure:interaction} \end{figure} Figure~\ref{figure:interaction} shows the typical use of \code{update()} and other functions. Clustering on a data stream~(DSD) is performed with \code{update()} on a DSC object. This is typically done with a \code{DSC_Micro} object which will perform its online clustering process and the resulting micro-clusters are available from the object after clustering (via \code{get_centers()}, etc.). Note, that DSC classes implement mutable objects and thus the result of \code{update()} does not need to be reassigned to its name. %For evaluation, the clusters to which data points would be assigned can be %obtained using \code{get_assignment()}. Reclustering (the offline component of data stream clustering) is performed with \begin{center} \code{recluster(macro, micro, type="auto", ...)}, \end{center} where \code{micro} and \code{macro} are objects of class \code{DSC}. Here the centers in \code{micro} are used as pseudo-points by the \code{DSC_macro} object \code{macro}. After reclustering the macro-clusters can be inspected (using \code{get_centers()}, etc.) and the assignment of micro-clusters to macro-clusters is available via \code{microToMacro()}. The following data stream clustering algorithms are currently available: \begin{itemize} %\item\code{DSC_BIRCH} uses the first pass of the BIRCH (balanced iterative reducing and %clustering using hierarchies) algorithm by \cite{stream_clust:Zhang:1996}. It generates %a cluster feature (CF) tree and the leave notes are used as micro-clusters. %\item %StreamKM++ \citep{stream:Ackermann+Lammersen+Maertens+Raupach:2010} \item\code{DSC_CluStream} (\pkg{streamMOA}) interfaces the MOA implementation of the \emph{CluStream} algorithm by \cite{stream_clust:Aggarwal:2003}. The algorithm maintains a user-specified number of micro-clusters. The number of clusters is held constant by merging and removing clusters. The suggested reclustering method is weighted $k$-means. \item\code{DSC_ClusTree} (\pkg{streamMOA}) interfaces the MOA implementation of the \emph{ClusTree} algorithm by \cite{stream:Kranen+Assent+Baldauf+Seidl:2009}. The algorithm organizes micro-clusters in a tree structure for faster access and automatically adapts micro-cluster sizes based on the variance of the assigned data points. Either $k$-means or reachability from DBSCAN can be used for reclustering. \item\code{DSC_DenStream} (\pkg{streamMOA}) interfaces MOA's implementation of the \emph{DenStream} algorithm by \cite{stream_clust:Cao:2006}. DenStream estimates the density of micro-clusters in a user-specified neighborhood. To suppress noise, it also organizes micro-clusters based on their weight as core and outlier micro-clusters. Core Micro-clusters are reclustered using reachability from DBSCAN. \item\code{DSC_DStream} implements the \emph{D-Stream} algorithm by \cite{stream:Chen:2007}. D-Stream uses a grid to estimate density in grid cells. For reclustering adjacent dense cells are merged to form macro-clusters. Alternatively, the concept of attraction between grids cells can be used for reclustering \citep{stream:Tu:2009}. %\item %CobWeb \citep{stream:Fisher:1987} \item\code{DSC_Sample} provides a clustering interface to the data stream operator \code{DSO_Sample}. It selects a user-specified number of representative points from the stream via \emph{Reservoir Sampling}~\citep{Vitter:1985}. It keeps an unbiased sample of all data points seen thus far using the algorithm by \cite{stream:McLeod:1983}. For evolving data streams it is more appropriate to bias the sample toward more recent data points. For biased sampling, the method called Algorithm~2.1 by \cite{stream:Aggarwal:2006} is also implemented. \item\code{DSC_DBSTREAM}~\citep{hahsler:Hahsler2016b} implements an extension of the simple data stream clustering algorithm called \emph{tNN threshold nearest-neighbors (tNN)} which was developed for package~\pkg{rEMM} by \cite{stream:Hahsler+Dunham:2014,stream:Hahsler+Dunham:2010b}. Micro-clusters are defined by a fixed radius (threshold) around their center. Reachability from DBSCAN is used for reclustering. \item\code{DSC_Window} provides a clustering interface to the data stream operator \code{DSO_Window}. It implements the sliding window and the dampened window models~\citep{stream:Zhu:2002} which keep a user-specified number (window length) of the most recent data points of the stream. For the dampened window model, data points in the window have a weight that deceases exponentially with age. \end{itemize} Although the authors of most data stream clustering algorithms suggest a specific reclustering method, in \pkg{stream} any available method can be applied. For reclustering, the following clustering algorithms are currently available as subclasses of \code{DSC_Macro}: \begin{itemize} \item \code{DSC_DBSCAN} interfaces the weighted version of DBSCAN~\citep{Ester96adensity-based} implemented in package \pkg{dbscan}~\citep{stream:Hahsler:2015b}. \item \code{DSC_Hierarchical} interfaces \proglang{R}'s \code{hclust} function. \item \code{DSC_Kmeans} interface \proglang{R}'s $k$-means implementation and a version of $k$-means where the data points (micro-clusters) are weighted by the micro-cluster weights, i.e., a micro-cluster representing more data points has more weight. \item \code{DSC_Reachability} uses DBSCAN's concept of reachability for micro-clusters. Two micro-clusters are directly reachable if they are closer than a user-specified distance \code{epsilon} from each other (they are within each other's \code{epsilon}-neighborhood). Two micro-clusters are reachable and therefore assigned to the same macro-cluster if they are connected by a chain of directly reachable micro-clusters. Note that this concept is related to hierarchical clustering with single linkage and the dendrogram cut at he height of epsilon. \end{itemize} For outlier detection, the following clustering algorithms are currently available as subclasses of \code{DSC_SinglePass} and \code{DSC_Outlier}: \begin{itemize} \item \code{DSC_MCOD} (\pkg{streamMOA}) interfaces the MOA implementation of the \emph{MCOD} algorithm by \cite{kontaki2016efficient}. This is a micro-clusterer and outlier detector algorithm. For the macro-clustering it needs an additional macro-clusterer algorithm, to improve clustering results. \end{itemize} All single-pass and outlier examples are given in the \pkg{streamMOA} package, since \code{DSC_MCOD} is currently the only algorithm that was implemented to support such functionalities. Some non-outlier detecting data clustering algorithms create small clusters for noise or outliers in the data. \pkg{stream} provides \code{prune_clusters(dsc, threshold = .05, weight = TRUE)} to remove a given percentage (given by \code{threshold}) of the clusters with the least weight. The percentage is either computed based on the number of clusters (e.g., remove 5\% of the number of clusters) or based on the total weight of the clustering (e.g., remove enough clusters to reduce the total weight by 5\%). The default \code{weight = TRUE} is based on the total weight. The resulting clustering is a static copy (\code{DSC_Static}). Further clustering cannot be performed with this object, but it can be used as input for reclustering and for evaluation. Pruning is also available in many macro-clustering algorithms as parameter \code{min_weight} which excludes all micro-clusters with a weight less than the specified value before reclustering. To specify a full data stream clustering process with an arbitrarily chosen online and offline algorithm, \pkg{stream} implements a special DSC class called \code{DSC_TwoStage} which can combine any \code{DSC_Micro} and \code{DSC_Macro} implementation into a two-stage process. %How to use \code{DSC_TwoStage} will be introduced %in a more elaborate example in Section~\ref{examples:full}. In the following section we give a short example for how to cluster a data stream. %\pagebreak[1] \subsection{Example: Clustering a data stream} \label{examples:clustering_ds} In this example we show how to cluster data using DSC implementations. First, we create a data stream (three Gaussian clusters in two dimensions with 5\% noise). <<>>= library("stream") set.seed(1000) stream <- DSD_Gaussians(k = 3, d = 2, noise = .05) @ Next, we prepare the clustering algorithm. We use here \code{DSC_DStream} which implements the D-Stream algorithm~\citep{stream:Tu:2009}. D-Stream assigns points to cells in a grid. For the example we use a gridsize of 0.1. <<>>= dstream <- DSC_DStream(gridsize = .1, Cm = 1.2) dstream @ After creating an empty clustering, we are ready to cluster data from the stream using the \code{update()} function. Note, that \code{update()} will implicitly alter the mutable DSC object so no reassignment is necessary. <<>>= update(dstream, stream, n = 500) dstream @ After clustering 500 data points, the clustering contains \Sexpr{nclusters(dstream)} micro-clusters. Note that the implementation of D-Stream has built-in reclustering and therefore also shows macro-clusters. The first few micro-cluster centers are: <<>>= head(get_centers(dstream)) @ It is often helpful to visualize the results of the clustering operation. <>= plot(dstream, stream) @ For the grid-based D-Stream algorithm there is also a second type of visualization available which shows the used dense and transitional grid cells as gray squares. <>= plot(dstream, stream, grid = TRUE) @ \begin{figure} \centering \begin{minipage}[b]{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-cluster} \\(a) \end{minipage} \begin{minipage}[b]{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-cluster-grid} \\(b) \end{minipage} \caption{Plotting the micro-clusters produced by D-Stream together with the original data points. Shown as (a) micro-clusters and as (b) dense grid cells.} \label{figure:cluster} \end{figure} The resulting plots are shown in Figure~\ref{figure:cluster}. In Figure~\ref{figure:cluster}(a) the micro-clusters are plotted in red on top of gray data points. The size of the micro-clusters indicates the weight, i.e., the number of data points represented by each micro-cluster. In Figure~\ref{figure:cluster}(b) the micro-clusters are shown as dense grid cells (density is coded with gray values). %\pagebreak[4] \newpage \section{Evaluation of data stream clustering}\label{sec:evaluation} \subsection{Introduction} Evaluation of data stream mining is an important issue. The evaluation of conventional clustering is discussed in the literature extensively and there are many evaluation criteria available. For an overview we refer the reader to the popular books by \cite{clust:Jain:1988} and \cite{Kaufman:1990}. However, this evaluation only measures how well the algorithm learns static structure in the data. Data streams often exhibit concept drift and it is important to evaluate how well the algorithm is able to adapt to these changes. The evaluation of data stream clustering is still in its infancy. The current state of the evaluation of data stream mining methods including clustering is described in the books by \cite{stream:Aggarwal:2007} and \cite{stream:Gama:2010}, and the papers by \cite{stream:Kremer:2011} and \cite{stream_clust:Gama:2013}. %% and \cite{stream_clust:Bifet:2015}. In the following we will discuss how \pkg{stream} can be used to evaluate clustering algorithms in terms of learning static structures and clustering dynamic streams. \subsection{Evaluation of clustering static data streams} Evaluation of how well an algorithm is able to learn static structures in a data stream which does not exhibit concept drift is performed in \pkg{stream} via \begin{center} \code{evaluate_static(dsc, dsd, measure, n = 100, type = c("auto", "micro", "macro"),}\\ \code{assign = "micro", assignmentMethod = c("auto", "model", "nn"),}\\ \code{noise = c("class", "exclude"), ...)}, \end{center} where \code{dsc} is the evaluated clustering. \code{n} data points are taken from \code{dsd} and used for evaluation. The evaluation measure is specified in \code{measure}. Several measures can be specified as a vector of character strings. For evaluation, the points are assigned to the clusters in the clustering in \code{dsc} using \code{get_assignment()}. By default the points are assigned to micro-clusters, but it is also possible to assign them to macro-cluster centers instead (\code{assign = "macro"}). New points can be assigned to clusters by the rule used in the clustering algorithm (\code{assignmentMethod = "model"}) or using nearest-neighbor assignment (\code{"nn"}). If the assignment method is set to \code{"auto"} then model assignment is used when available and otherwise nearest-neighbor assignment is used. The initial assignments are aggregated to the level specified in \code{type}. For example, for a macro-clustering, the initial assignments will be made by default to micro-clusters and then these assignments will be translated into macro-cluster assignments using the micro- to macro-cluster relationships stored in the clustering and available via \code{microToMacro()}. This separation between assignment and evaluation type is especially important for data with non-spherical clusters where micro-clusters are linked together in chains produced by a macro-clustering algorithm based on hierarchical clustering with single-link or reachability. How noise is handled is controlled by \code{noise}. Noise points in the data can be considered forming their own class. This is typically appropriate for external validity measures, however, for some internal validity measures using noise points is problematic since the noise data points will not form a compact cluster and thus negatively effect measures like the sum of squares. Therefore, for some internal measures, it is more consistent to exclude noise points. %The user will be notified by this fact with a warning. Clustering evaluation measures can be categorized into internal and external cluster validity measures. Internal measures evaluate properties of the clustering. A simple measure to evaluate the compactness of (spherical) clusters in a clustering is the within-cluster sum of squares, i.e., the sum of squared distances between each data point and the center of its cluster (method \code{"SSQ"}). External measures use the ground truth (i.e., true partition of the data into groups) to evaluate the agreement of the partition created by the clustering algorithm with a known true partition. In the following we will enumerate the evaluation measures (passed on as \code{measure}) available in \pkg{stream}. We will not describe each measure here since most of them are standard measures which can be found in many text books \citep[e.g.,][]{clust:Jain:1988,Kaufman:1990} or in the documentation supplied with the packages~\pkg{fpc}~\citep{stream:Hennig:2014}, \pkg{clue}~\citep{stream:Hornik:2013} and \pkg{cluster}~\citep{stream:Maechler:2014}. Measures currently available for \code{evaluate_static()} (method name are under quotation marks and the package that implements the evaluation measure is shown in parentheses) include: \begin{itemize} \item Information items. \begin{itemize} \item \code{"numMicroClusters"} Number of micro-clusters \item \code{"numMacroClusters"} Number of macro-clusters \item \code{"numClasses"} Number of classes (i.e., groups in the ground truth) \end{itemize} \item Noise-related items. \begin{itemize} \item \code{"noisePredicted"} Number data points predicted as noise \item \code{"noiseActual"} Number of data points which are actually noise \item \code{"noisePrecision"} Precision of the predicting noise (i.e., number of correctly predicted noise points over the total number of points predicted as noise) \end{itemize} \item Internal evaluation measures. \begin{itemize} \item \code{"SSQ"} Within cluster sum of squares. Assigns each non-noise point to its nearest center from the clustering and calculates the sum of squares \item \code{"silhouette"} Average silhouette width (actual noise points which stay unassigned by the clustering algorithm are removed; regular points that are unassigned by the clustering algorithm will form their own noise cluster) (\pkg{cluster}) ) \item \code{"average.between"} Average distance between clusters (\pkg{fpc}) \item \code{"average.within"} Average distance within clusters (\pkg{fpc}) \item \code{"max.diameter"} Maximum cluster diameter (\pkg{fpc}) \item \code{ "min.separation"} Minimum cluster separation (\pkg{fpc}) \item \code{"ave.within.cluster.ss"} a generalization of the within-clusters sum of squares (half the sum of the within-cluster squared dissimilarities divided by the cluster size) (\pkg{fpc}) \item \code{"g2"} Goodman and Kruskal's Gamma coefficient (\pkg{fpc}) \item \code{"pearsongamma"} Correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters (\pkg{fpc}) \item \code{"dunn"} Dunn index (minimum separation over maximum diameter) (\pkg{fpc}) \item \code{"dunn2"} Minimum average dissimilarity between two cluster over maximum average within-cluster dissimilarity (\pkg{fpc}) \item \code{"entropy"} entropy of the distribution of cluster memberships (\pkg{fpc}) \item \code{"wb.ratio"} average.within over average.between (\pkg{fpc}) \end{itemize} \item External evaluation measures. \begin{itemize} \item \code{"precision"}, \code{"recall"}, \code{"F1"}. A true positive (TP) decision assigns two points in the same true cluster also to the same cluster, a true negative (TN) decision assigns two points from two different true clusters to two different clusters. A false positive (FP) decision assigns two points from the same true cluster to two different clusters. A false negative (FN) decision assigns two points from the same true cluster to different clusters. $$\mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$ $$\mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$ The F1 measure is the harmonic mean of precision and recall. \item \code{"purity"} Average purity of clusters. The purity of each cluster is the proportion of the points of the majority true group assigned to it \citep{stream_clust:Cao:2006}. \item \code{"Euclidean"} Euclidean dissimilarity of the memberships %(See Dimitriadou, Weingessel and Hornik (2002)) (\pkg{clue}), \item \code{"Manhattan"} Manhattan dissimilarity of the memberships (\pkg{clue}) \item \code{"Rand"} Rand index %(see Rand (1971)) (\pkg{clue}) \item \code{"cRand"} Rand index corrected for chance %(see Hubert and Arabie (1985)) (\pkg{clue}) \item \code{"NMI"} Normalized Mutual Information %(see Strehl and Ghosh (2002)) (\pkg{clue}) \item \code{"KP"} Katz-Powell index %(see Katz and Powell (1953)) (\pkg{clue}) \item \code{"angle"} Maximal cosine of the angle between the agreements (\pkg{clue}) \item \code{"diag"} Maximal co-classification rate (\pkg{clue}) \item \code{"FM"} Fowlkes and Mallows's index %(see Fowlkes and Mallows (1983)) (\pkg{clue}) \item \code{"Jaccard"} Jaccard index (\pkg{clue}) \item \code{"PS"} Prediction Strength %(see Tibshirani and Walter (2005)) (\pkg{clue}) %\item \code{"corrected.rand"} corrected Rand index (\pkg{fpc}), \item \code{"vi"} Variation of Information (VI) index (\pkg{fpc}) \end{itemize} \item Outlier evaluation measures. \begin{itemize} \item \code{"OutlierJaccard"}. A variant of the Jaccard index that can be applied assess the outlier detection correctness \citep{krleza2020shc}. It can be applied only to data streams that mark outliers (for example \code{DSD_Gaussians}). The most sensible use of this measure is with outlier detectors, e.g., those clusterers that inherit \code{DSC_Outlier}. However, the \code{OutlierJaccard} can be calculated for all those clusterers that enclose outliers in small clusters, which are were \textbf{not} pruned by \code{prune_clusters}. A true positive (TP) decision is made for outliers that were both marked by the input data stream generator and found by the clusterer instance, A false positive (FP) decision is made for data points that were \textbf{not} marked by the input data stream generator, yet the clusterer instance marked it is an outlier. A undetected (UND) decision is made for outliers that were marked by the input data stream generator, yet the clusterer instance failed to recognized it as an outlier. The \code{OutlierJaccard} index is calculated as follows: $$\mathrm{OJI} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{UND}}$$ \end{itemize} \end{itemize} \code{evaluate_static()} is appropriate if the data stream does not evolve significantly from the data that is used to learn the clustering to the data that is used for evaluation. The approach described next might be more appropriate for streams which exhibit significant concept drift. \subsection{Evaluation of clustering of dynamic data streams} For dynamic data streams it is important to evaluate how well the clustering algorithm is able to adapt to concept drift which results in changes in the cluster structure. \cite{stream_clust:Aggarwal:2003} have introduced an evaluation scheme for data stream clustering which addresses these issues. In this approach a horizon is defined as a number of data points. The data stream is split into consecutive horizons. After a horizon is clustered, the points in the next horizon are each assigned to the closest centroid and the sum of squares is reported as an internal measure of cluster quality. Later on, this scheme was used by others (e.g., by \cite{stream:Tu:2009}). \cite{stream_clust:Cao:2006} and \cite{stream:Wan+Ng+Dang+Yu+Zhang:2009} also use this scheme for the external measure of average purity of clusters. Here for each (micro-) cluster the dominant true cluster label is determined and the proportion of points with the dominant label is averaged over all clusters. This type of evaluation strategy is called prequential since new data is always used for evaluation and and afterwards to update the model. Recent detailed analysis of prequential error estimation for classification can be found in the work by~\cite{stream_clust:Gama:2013} and \cite{stream_clust:Bifet:2015}. Obviously, algorithms which can better adapt to the changing stream will achieve better evaluation values. However, it is important to mention that choosing the horizon inappropriately for the stream may impact the evaluation. Consider, for example, a fast changing stream and a very long horizon. In this case the evaluation data might have not much similarity to the data used for clustering and thus the evaluation will produce meaningless results. For fast evolving streams a shorter horizon, or even a horizon of length one, needs to be used. Longer horizons have the advantage that evaluation can be usually performed more efficiently for larger batches of points. This prequential evaluation strategy is implemented in \pkg{stream} as function \code{evaluate_stream()}. It shares most parameters with \code{evaluate_static()} and %in addition to the methods sum of squared (\code{"SSQ"}) and purity (\code{"purity"}) all evaluation measures for \code{evaluate_static()} described above can be used. \subsection{Evaluation of clustering done by single-pass clusterers} The single-pass clusterers and outlier detectors are doing classification and model update for every data point retrieved from the input data stream. This makes then more fine-grained. Evaluation of single-pass clusterers is a special case of evaluation for dynamic data streams having $horizon=1$. For \code{evaluate_stream(..., horizon=1000, ...)} this means that a single-pass clusterer processes the batch of 1000 data points each data point individually, which includes both the model update and returning the assignment. From the caller point of view, such evaluation is the same as for all other clustering algorithms in the \pkg{stream} package. %%% FIXME: CMM %%% short overview in Silva:2013 %\section{Examples} \label{sec:examples} %Providing a framework for rapid prototyping new data stream mining algorithms and %comparing them experimentally is the main purpose of %\pkg{stream}. In this section we give several %increasingly complex %examples of how to use \pkg{stream}. %First, in Section~\ref{examples:ds} we start with creating a data stream using %different implementations of the DSD class. %The second example in Section~\ref{examples:disk} shows how to save and read stream data %to and from disk. %Section~\ref{examples:replay} gives examples for how to reuse the same data %from a stream %in order to perform comparison experiments with multiple data stream mining %algorithms on exactly the same data. %We show how to cluster data streams in Section~\ref{examples:clustering_ds} and %to evaluate cluster algorithms in Section~\ref{examples:evaluation}. %Finally, reclustering examples are given in Section~\ref{examples:recluster}. %All presented examples contain the complete code necessary to replicate the %examples. % %Finally, the last example introduces the use of data stream clustering %algorithms with a detailed %comparison of two algorithms from start to finish by first running the online %components, then using a weighted $k$-means algorithm %to re-cluster the %micro-clusters generated by each algorithm into final clusters. %\pagebreak[4] \subsection{Example: Evaluating clustering results} \label{examples:evaluation} In this example we will show how to calculate evaluation measures, first on a stream without concept drift and then on an evolving stream. First, we prepare a data stream and create a clustering. <<>>= library("stream") stream <- DSD_Gaussians(k = 3, d = 2, noise = .05) dstream <- DSC_DStream(gridsize = .1) update(dstream, stream, n = 2000) @ The \code{evaluate_static()} function takes a DSC object containing a clustering and a DSD object with evaluation data to compute several quality measures for clustering. <<>>= evaluate_static(dstream, stream, n = 100) @ The number of points taken from \code{dsd} and used for the evaluation are passed on as the parameter \code{n}. Since no evaluation measure is specified, all available measures are calculated. We use only a small number of points for evaluation since calculating some measures is computational quite expensive. Individual measures can be calculated using the measure argument. <<>>= evaluate_static(dstream, stream, measure = c("purity", "crand"), n = 500) @ Note that this second call of \code{evaluate_static()} uses a new and larger set of 500 evaluation data points from the stream and thus the results may vary slightly from the first call. Purity of the micro-clusters is high since each micro-cluster only covers points from the same true cluster, however, the corrected Rand index is low because several micro-clusters split the points from each true cluster. We will see in one of the following examples that reclustering will improve the corrected Rand index. To evaluate how well a clustering algorithm can adapt to an evolving data stream, \pkg{stream} provides \code{evaluate_stream()} to perform prequential evaluation with a given horizon. %Following %the evaluation scheme developed by %\cite{stream_clust:Aggarwal:2003}, we define %an evaluation horizon as %a number of data points. Each data point in the horizon is assigned to clusters to evaluate how well it fits into the clustering (internal evaluation) or its assignment agrees with the known true cluster labels (external evaluation). Average evaluation measures for each horizon are returned. Afterwards, the clustering is updated with the points in the horizon. The following examples evaluate D-Stream on an evolving stream created with \code{DSD_Benchmark}. This data stream was introduced in Figure~\ref{figure:dsd_bench} on page~\pageref{figure:dsd_bench} and contains two Gaussian clusters moving from left to right with their paths crossing in the middle. We modify the default decay parameter \code{lambda} of D-Stream since the data stream evolves relatively quickly and then perform the evaluation over 5000 data points with a horizon of 100. <<>>= set.seed(1000) stream <- DSD_Benchmark(1) dstream <- DSC_DStream(gridsize = .05, lambda = .01) ev <- evaluate_stream(dstream, stream, measure = c("numMicroClusters", "purity"), n = 5000, horizon = 100) head(ev) @ Note that the first row in the results contains \code{NA} for the purity measure. This is the case since we started evaluation with a new, empty clustering and for evaluating the first horizon no prior clusters were available. <>= plot(ev[ , "points"], ev[ , "purity"], type = "l", ylab = "Avg. Purity", xlab = "Points") @ \begin{figure} \centering \includegraphics[width=.7\linewidth]{stream-evaluation} \caption{Micro-cluster purity of D-Stream over an evolving stream.} \label{figure:evaluation} \end{figure} Figure~\ref{figure:evaluation} shows the development of the average micro-cluster purity (how well each micro-cluster only represents points of a single group in the ground truth) over 5000 data points in the data stream. Purity drops before point 3000 significantly, because the two true clusters overlap for a short period of time. To analyze the clustering process, we can visualize the clustering using \code{animate_cluster()}. To recreate the previous experiment, we reset the data stream and create a new empty clustering. <>= set.seed(1000) stream <- DSD_Benchmark(1) dstream <- DSC_DStream(gridsize = .05, lambda = .01) r <- animate_cluster(dstream, stream, horizon = 100, n = 5000, measure = "purity", plot.args = list(xlim = c(0, 1), ylim = c(0, 1))) @ \begin{figure}[tb] \centering \includegraphics[width=.45\linewidth]{eval} \caption{Result of animated clustering with evaluation.} \label{figure:eval} \end{figure} %%% save image 5x7 Figure~\ref{figure:eval} shows the result of the clustering animation with purity evaluation. The whole animation can be recreated by executing the code above. The animation can also be replayed and saved using package \pkg{animation}. % <>= % library(animation) % animation::ani.options(interval=.1) % ani.replay() % saveHTML(ani.replay()) % @ \subsection{Example: Evaluating reclustered DSC objects}\label{examples:recluster} This example shows how to recluster a DSC object after creating it and performing evaluation on the macro clusters. First we create data, a DSC micro-clustering object and cluster 1000 points. <<>>= library("stream") set.seed(1000) stream <- DSD_Gaussians(k = 3, d = 2, noise = .05) dstream <- DSC_DStream(gridsize = .05, Cm = 1.5) update(dstream, stream, n = 1000) dstream @ Although the data contains three clusters, the built-in reclustering of D-Stream (joining adjacent dense grids) only produces two macro-clusters. The reason for this can be found by visualizing the clustering. <>= plot(dstream, stream, type = "both") @ Figure~\ref{figure:recluster}(a) shows micro- and macro-clusters produced by D-Stream. Micro-clusters are shown as red circles while macro-clusters are represented by large blue crosses. Cluster symbol sizes are proportional to the cluster weights. We see that D-Stream's reclustering strategy which joins adjacent dense grid cells is not able to separate the two overlapping clusters in the top part of the plot. Micro-clusters produced with any clustering algorithm can be reclustered by the \code{recluster()} method with any available macro-clustering algorithm (sub-classes of \code{DSD_Macro}) available in \pkg{stream}. Some supported macro-clustering models that are typically used for reclustering are $k$-means, hierarchical clustering, and reachability. We use weighted $k$-means since we want to separate overlapping Gaussian clusters. <>= km <- DSC_Kmeans(k = 3, weighted = TRUE) recluster(km, dstream) km plot(km, stream, type = "both") @ \begin{figure} \begin{minipage}[b]{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-recluster} \\(a) \end{minipage} \begin{minipage}[b]{.48\linewidth} \centering \includegraphics[width=\linewidth]{stream-recluster2} \\(b) \end{minipage} \caption{A data stream clustered with D-Stream using the (a) built-in reclustering strategy, and (b) reclustered with weighted $k$-means and $k=3$.} \label{figure:recluster} \end{figure} Figure~\ref{figure:recluster}(b) shows that weighted $k$-means on the micro-clusters produces by D-Stream separated the three clusters correctly. Evaluation on a macro-clustering model automatically uses the macro-clusters. For evaluation, \code{n} new data points are requested from the data stream and each is assigned to its nearest micro-cluster. This assignment is translated into macro-cluster assignments and evaluated using the ground truth provided by the data stream generator. <<>>= evaluate_static(km, stream, measure = c("purity", "crand", "SSQ"), n = 1000) @ Alternatively, the new data points can also be directly assigned to the closest macro-cluster. <<>>= evaluate_static(km, stream, c(measure = "purity", "crand", "SSQ"), n = 1000, assign = "macro") @ In this case the evaluation measures purity and corrected Rand slightly increase, since D-Stream produces several micro-clusters covering the area between the top two true clusters (see micro-clusters in Figure~\ref{figure:recluster}). Each of these micro-clusters contains a mixture of points from the two clusters but has to assign all its points to only one resulting in some error. Assigning the points rather to the macro-cluster centers splits these points better and therefore decreases the number of incorrectly assigned points. The sum of squares decreases because the data points are now directly assigned to minimize this type of error. Other evaluation methods can also be used with a clustering in \pkg{stream}. For example we can calculate and plot silhouette information~\cite{Kaufman:1990} using the functions available in \pkg{cluster}. We take 100 data points and find the assignment to macro clusters in the data stream clustering. For a \code{DSC_Micro} implementation like D-Stream, the data points are assigned by default to micro clusters and then this assignment is translated to macro-cluster assignments. <<>>= points <- get_points(stream, n = 100) assignment <- get_assignment(dstream, points, type = "macro") assignment @ \begin{figure} \centering \includegraphics[width=.5\linewidth]{stream-silhouette} \caption{Silhouette plot for D-Stream clustering with two macro-clusters and a cluster ($j = 0$) representing the unassigned points.} \label{figure:silhouette} \end{figure} Note that D-Stream uses a grid for assignment and that points which do not fall inside a dense (or connected transitional) cell are not assigned to a cluster represented by a value of \code{NA}. For the following silhouette calculation we replace the \code{NA}s with 0 to make the unassigned (noise) points its own cluster. Note also that the silhouette is only calculated for a small number of points and not the whole stream. <>= assignment[is.na(assignment)] <- 0L library("cluster") plot(silhouette(assignment, dist = dist(points))) @ Figure~\ref{figure:silhouette} shows the silhouette plot for the macro-clusters produced by D-Stream. The top cluster ($j=0$) represents the points not assigned to any cluster by the algorithm (predicted noise) and thus is expected to have a large negative silhouette. Cluster $j=2$ comprises the two overlapping real clusters and thus has lower silhouette values than cluster $j=1$. Other visual evaluation methods can be used in a similar way. %\section{Extending the stream framework} \label{sec:extension} % %Since stream mining is a relatively young field and many advances are %expected in the near future, %the object oriented framework in \pkg{stream} is developed with easy %extensibility in mind. Implementations for data streams (DSD) and %data stream mining tasks (DST) can be easily added by implementing a small %number of core functions. The actual implementation can be written %in either \proglang{R}, \proglang{Java}, %\proglang{C}/\proglang{C++} or any other programming language %which can be interfaced by \proglang{R}. %In the following we discuss how to extend \pkg{stream} with new DSD and DST %implementations. %%In the following we discuss how to extend DSD, DST and how to interface %%algorithms from other frameworks in \pkg{stream}. % %\subsection{Adding a new data stream source (DSD)} % %The class hierarchy in Figure~\ref{figure:dsd} (on page~\pageref{figure:dsd}) %is implemented %using the S3 class system~\citep{stream:Chambers:1992}. %Class membership and the inheritance hierarchy is %represented by a vector %of class names stored as the object's class attribute. For example, an object of %class \code{DSD_Gaussians} will have the class attribute vector %\code{c("DSD_Gaussians", "DSD_R", "DSD")} indicating that %the object is an \proglang{R} implementation of DSD. This allows %the framework to implement all common functionality as functions at the level %of \code{DSD} and \code{DSD_R} and only a minimal set of functions %is required to implement a new data stream source. %Note that the class attribute has to contain a vector of all parent classes %in the class diagram in bottom-up order. % %For a new DSD implementation only the following two functions need to be %implemented: %\begin{enumerate} %\item A creator function (with a name starting with the prefix \code{DSD_}) and %\item the \code{get_points()} method. %\end{enumerate} %The creator function creates an object of the appropriate %\code{DSD} subclass. Typically this S3 object contains a list of all parameters, %an open \proglang{R} connection and/or an environment or a reference class %for storing state information (e.g., the current position in the stream). %Standard parameters are \code{d} and \code{k} for the number of dimensions of %the created data and the true number of clusters, respectively. %In addition an element called \code{"description"} should be provided. This element %is used by \code{print()}. % %The implemented \code{get_points()} needs to dispatch for the class %and create as the output a data frame containing the new data points as %rows. Also, if the ground truth (true cluster assignment as an integer vector; %noise is represented by \code{NA}) is available, then this can be attached to %the data frame as an attribute called \code{"assignment"}. % %For a very simple example, we show here the implementation of %\code{DSD_UniformNoise} available in the package's source code %in file \code{DSD_UniformNoise.R}. This generator creates noise points %uniformly distributed in a $d$-dimensional hypercube with a given range. % %<>= %DSD_UniformNoise <- function(d = 2, range = NULL) { % if(is.null(range)) range <- matrix(c(0, 1), ncol = 2, nrow = d, % byrow = TRUE) % structure(list(description = "Uniform Noise Data Stream", d = d, % k = NA_integer_, range = range), % class = c("DSD_UniformNoise", "DSD_R", "DSD")) % } % %get_points.DSD_UniformNoise <- function(x, n = 1, % assignment = FALSE, ...) { % data <- as.data.frame(t(replicate(n, % runif(x$d, min = x$range[ , 1], max = x$range[ , 2])))) % if(assignment) attr(data, "assignment") <- rep(NA_integer_, n) % data %} %@ % %The constructor only stores the description, the dimensionality and the range %of the data. %For this data generator \code{k}, the number of true clusters, is not applicable. %Since all data is random, there is also no need to store a state. The %\code{get_points()} implementation creates $n$ random points and if %assignments are needed attaches a vector with the appropriate %number of \code{NA}s indicating that the data points are all noise. %Several more complicated examples are available in the package's source code %directory in files starting with \code{DSD_}. % %\subsection{Adding a new data stream tasks (DST)} % %To add a new data stream mining tasks (e.g., frequent pattern mining), %a new package with %a subclass hierarchy %similar to the hierarchy in Figure~\ref{figure:dst} %(on page~\pageref{figure:dst}) for data stream %clustering (DSC) can be easily added. This new package can take full %advantage of the already existing infrastructure in \pkg{stream}. We plan %to provide add-on packages to \pkg{stream} for frequent pattern mining %and data stream classification in the near future. % %Next we discuss how to interface an existing algorithm with \pkg{stream}. %We concentrate again on clustering, but interfacing algorithms %for other types of tasks is similar. %To interface an existing clustering algorithm with \pkg{stream}, %\begin{enumerate} %\item a creator function (typically named after the algorithm and % starting with \code{DSC_}) which created the clustering object, %\item an implementation of the actual cluster algorithm, and %\item accessors for the clustering %\end{enumerate} %are needed. The implementation depends on the interface that is used. %Currently an \code{R} interface is available as \code{DSC_R} and %a MOA interface is implemented in \code{DSC_MOA} (in \pkg{streamMOA}). %The implementation for %\code{DSC_MOA} takes care of all MOA-based clustering algorithms and we will %concentrate here on the \proglang{R} interface. % %For the \proglang{R} interface, the clustering class needs to contain %the elements \code{"description"} and \code{"RObj"}. The description needs %to contain a character string describing the algorithm. RObj is expected to be %a reference class object and %contain the following methods: %\begin{enumerate} %\item \code{cluster(newdata, ...)}, where \code{newdata} is a data frame %with %new data points. %\item For micro-clusters: \code{get_microclusters(...)} and % \code{get_microweights(...)} %\item %For macro-clusters: \code{get_macroclusters(...)}, \code{get_macroweights} %and \\ \code{microToMacro(micro, ...)} which does micro- to macro-cluster %matching. %\end{enumerate} % %Note that these are methods for reference classes and do not contain the %called object in the parameter list. Neither of these methods are called directly %by the user. %Figure~\ref{figure:interaction} (on page~\pageref{figure:interaction}) %shows that the function \code{update()} %is used to cluster data points, and \code{get_centers()} and \code{get_weights()} %are used to obtain the clustering. These user facing functions call internally %the methods in RObj via the \proglang{R} interface in class \code{DSC_R}. % %For a comprehensive example of a clustering algorithm implemented in \proglang{R}, %we refer the reader to \code{DSC_DStream} (in file \code{DSC_DStream.R}) in the %package's \code{R} directory. % %%\subsection{Interfacing Algorithms from Other Frameworks} %%TODO % %\pagebreak[1] % \newpage \section{Example applications} \label{sec:example} \subsection{Experimental comparison of different algorithms} Providing a framework for rapid prototyping new data stream mining algorithms and comparing them experimentally is the main purpose of \pkg{stream}. In this section we give a more elaborate example of how to perform a comparison between several algorithms. First, we set up a static data set. We extract 1500 data points from the Bars and Gaussians data stream generator with 5\% noise and put them into a \code{DSD_Memory}. This object is used to replay the same part of the data stream for each algorithm. We will use the first 1000 points to learn the clustering and the remaining 500 points for evaluation. <>= set.seed(1000) library("stream") stream <- DSD_BarsAndGaussians(noise = .05) %>% DSD_Memory(n = 1500) stream plot(stream) @ \begin{figure} \centering \includegraphics[width=.5\linewidth]{stream-data_bng} \caption{Bars and Gaussians data set.} \label{figure:data_bng} \end{figure} Figure~\ref{figure:data_bng} shows the structure of the data set. It consists of four clusters, two Gaussians and two uniformly filled, slightly rotated rectangular clusters. The Gaussian and the bar to the right have $1/3$ the density of the other two clusters. We initialize four algorithms from \pkg{stream}. We choose the parameters experimentally so that the algorithms produce each approximately 100 micro-clusters. <<>>= algorithms <- list( 'Sample' = DSC_TwoStage(micro = DSC_Sample(k = 100), macro = DSC_Kmeans(k = 4)), 'Window' = DSC_TwoStage(micro = DSC_Window(horizon = 100), macro = DSC_Kmeans(k = 4)), 'D-Stream' = DSC_DStream(gridsize = .7, Cm = 1.5), 'DBSTREAM' = DSC_DBSTREAM(r = .45) ) @ The algorithms are reservoir sampling reclustered with weighted $k$-means, sliding window reclustered with weighted $k$-means, D-Stream and DBSTREAM with their built-in reclustering strategies. We store the algorithms in a list for easier handling and then cluster the same 1000 data points with each algorithm. Note that we have to reset the stream each time before we cluster with a new algorithm. <<>>= for(a in algorithms) { reset_stream(stream) update(a, stream, n = 1000) } @ We use \code{nclusters()} with \code{type="micro"} to inspect the number of micro-clusters. <<>>= sapply(algorithms, nclusters, type = "micro") @ To inspect micro-cluster placement, we plot the calculated micro-clusters on a sample of the original data. <>= op <- par(no.readonly = TRUE) layout(mat = matrix(1:length(algorithms), ncol = 2)) for (a in algorithms) { reset_stream(stream) plot(a, stream, main = description(a), type = "micro") } par(op) @ \begin{figure}[t] \centering \includegraphics{stream-microclusters} \caption{Micro-cluster placement for different data stream clustering algorithms.} \label{figure:microclusters} \end{figure} Figure~\ref{figure:microclusters} shows the micro-cluster placement by the different algorithms. Micro-clusters are shown as red circles and the size is proportional to each cluster's weight. Reservoir sampling and the sliding window select some data points as micro-clusters and also include a few noise points. D-Stream and DBSTREAM suppress noise well and concentrate the micro-clusters on the real clusters. D-Stream is grid-based and thus the micro-clusters are regularly spaced. DBSTREAM produces a slightly less regular pattern. It is also interesting to compare the assignment areas for micro-clusters created by different algorithms. The assignment area is the area around the center of a micro-cluster in which points are considered to belong to the micro-cluster. The specific clustering algorithm decides how points which fall inside the assignment area of several micro-clusters are assigned (e.g., assign the point to the closest center). To show the assignment area we add \code{assignment = TRUE} to plot. We also disable showing micro-cluster weights to make the plot less cluttered. <>= op <- par(no.readonly = TRUE) layout(mat = matrix(1:length(algorithms), ncol = 2)) for (a in algorithms) { reset_stream(stream) plot( a, stream, main = description(a), assignment = TRUE, weight = FALSE, type = "micro" ) } par(op) @ \begin{figure}[tb] \centering \includegraphics{stream-microclusters_assignment} \caption{Micro-cluster assignment areas for different data stream clustering algorithms.} \label{figure:microclusters_assignment} \end{figure} Figure~\ref{figure:microclusters_assignment} shows the assignment areas. For regular micro-cluster-based algorithms the assignment areas are shown as dotted circles around micro-cluster centers. For example for DBSTREAM the assignment area for all micro-clusters has exactly the same radius. D-Stream uses a grid for assignment and thus shows the grid. Reservoir sampling and sliding window does not have assignment areas and data points are always assigned to the nearest micro-cluster. To compare the cluster quality, we can check for example the micro-cluster purity. Note that we set the stream to position 1001 since we have used the first 1000 points for learning and we want to use data points not seen by the algorithms for evaluation. <<>>= sapply( algorithms, FUN = function(a) { reset_stream(stream, pos = 1001) evaluate_static( a, stream, measure = c("numMicroClusters", "purity"), type = "micro", n = 500 ) } ) @ We need to be careful with the comparison of these numbers, since they depend heavily on the number of micro-clusters with more clusters leading to a better value. We can compare purity here since we have set the clustering parameters such that the number of micro-clusters is very close. All algorithms produce very good values for purity for this data set with reasonably well separated clusters. Next, we compare macro-cluster placement. D-Stream and DBSTREAM have built-in reclustering strategies. D-Stream joins adjacent dense grid cells to form macro-clusters and DBSTREAM joins micro-clusters reachable by overlapping assignment areas. For sampling and sliding window we already have created a two-stage process together with weighted $k$-means ($k=4$). <>= op <- par(no.readonly = TRUE) layout(mat = matrix(1:length(algorithms), ncol = 2)) for (a in algorithms) { reset_stream(stream) plot(a, stream, main = description(a)) } par(op) @ \begin{figure}[tb] \centering \includegraphics{stream-macroclusters} \caption{Macro-cluster placement for different data stream clustering algorithms.} \label{figure:macroclusters} \end{figure} Figure~\ref{figure:macroclusters} shows the macro-cluster placement. Sampling and the sliding window use $k$-means reclustering and therefore produce exactly four clusters. However, the placement is off, splitting a true cluster and missing one of the less dense clusters. D-Stream and DBSTREAM identify the two denser clusters correctly, but split the lower density clusters into multiple pieces. <<>>= sapply(algorithms, FUN = function(a) { reset_stream(stream, pos = 1001) evaluate_static(a, stream, measure = c("numMacroClusters", "purity", "SSQ", "cRand", "silhouette"), n = 500, assign = "micro", type = "macro") }) @ The evaluation measures at the macro-cluster level reflect the findings from the visual analysis of the clustering with D-Stream and DBSTREAM producing the best results. Note that D-Stream and DBSTREAM do not assign some points which are not noise points which has a negative effect on the average silhouette width. %This is shown with a warning and these points form their own cluster %for calculating the within sum of squares and the average silhouette width. %\section{Experimental comparison using an evolving data stream} %\label{examples:full_evolving} Comparing algorithms on evolving streams is similarly easy in \pkg{stream}. For the following example we use again \code{DSD_Benchmark} with two moving clusters crossing each other's path (see Section~\ref{examples:ds}). First we create a fixed stream with 5000 data points. <<>>= set.seed(0) stream <- DSD_Memory(DSD_Benchmark(1), n = 5000) @ Next we initialize again a list of clustering algorithms. Note that this time we use a $k$ of two for reclustering sampling and the sliding window. We also use a sample biased to newer data points~\citep{stream:Aggarwal:2006} since otherwise outdated data points would result in creating outdated clusters. For the sliding window, D-Stream and DBSTREAM we use faster decay (\code{lambda=.01}) since the clusters in the data stream move very quickly. <<>>= algorithms <- list( 'Sample + k-means' = DSC_TwoStage(micro = DSC_Sample(k = 100, biased = TRUE), macro = DSC_Kmeans(k = 2)), 'Window + k-means' = DSC_TwoStage(micro = DSC_Window(horizon = 100, lambda = .01), macro = DSC_Kmeans(k = 2)), 'D-Stream' = DSC_DStream(gridsize = .1, lambda = .01), 'DBSTREAM' = DSC_DBSTREAM(r = .05, lambda = .01) ) @ We apply \code{evaluate_stream()} to each of the clustering algorithms, and we evaluate and cluster 5000 data points using the prequential evaluation method with a horizon of 250 points. The chosen evaluation measure is the corrected Rand index. This produces a list with $5000/250=20$ evaluations for each algorithm. <<>>= evaluation <- lapply(algorithms, FUN = function(a) { reset_stream(stream) evaluate_stream(a, stream, horizon = 100, n = 5000, measure = "cRand", type = "macro", assign = "micro") }) @ %reset_stream(stream) %dsc <- DSC_DBSTREAM(r=.1, lambda=.01) %dsc <- DSC_DStream(gridsize=.08, lambda=.01) %dsc <- DSC_TwoStage(micro=DSC_Sample(k=100, biased=TRUE), macro=DSC_Kmeans(k=2)) %dsc <- DSC_TwoStage(micro=DSC_Window(horizon=100, lambda=.01), macro=DSC_Kmeans(k=2)) %animate_cluster(dsc, stream, horizon=100, n=5000, measure="crand", type="macro", assign = "micro", plot.args=list(assign=T, ylim=c(0,1), type="both")) To plot the results we first get the positions at which the evaluation measure was calculated from the first element in the evaluation list and then extract a matrix with the corrected Rand index values. Note that the first evaluation values are again \code{NA} since we start with empty clusterings. <<>>= cRand <- sapply(evaluation, FUN = function(x) x[ , "cRand"]) head(cRand) @ We visualize the development of the evaluation measure over the stream as a line plot and we add a boxplot comparing the distributions. <>= pos <- evaluation[[1]][ , "points"] matplot(pos, cRand, type = "l", lwd = 1) legend("bottomleft", legend = names(evaluation), col = 1:6, lty = 1:6, lwd = 1) @ % #barplot(colMeans(cRand), las=2) <>= boxplot(cRand, las = 2, cex.axis = .8) @ \begin{figure} \centering \begin{minipage}{.7\linewidth} \centering \includegraphics[width=1\linewidth]{stream-dynamic} \end{minipage} \begin{minipage}{.27\linewidth} \centering \includegraphics[width=1\linewidth]{stream-dynamic_box} \end{minipage} \caption{Evaluation of data stream clustering of an evolving stream.} \label{figure:data_bng2} \end{figure} Figure~\ref{figure:data_bng2} shows the corrected Rand index for the four data stream clustering algorithms over the evolving data stream. All algorithms show that separating the two clusters is impossible around position 3000 when the two clusters overlap. D-Stream and DBSTREAM perform equally well while biased sampling and the sliding window achieve only a lower corrected Rand index. This is easily explained by the fact that these two algorithms cannot detect noise and thus have to assign noise points to one of the clusters resulting in the lower Rand index. The behavior of the individual clustering algorithms can be visually analyzed using \code{animate_cluster()}. The \pkg{stream} framework allows us to easily create many experiments by using different data and by matching different clustering and reclustering algorithms. An example of a study for clustering large data sets using an earlier version of \pkg{stream} can be found in~\cite{hahsler:Bolanos2012}. \subsection{Clustering a real data set} In this section we show how to cluster the well-known and widely used KDD Cup'99 data set. The data set was created for the Third International Knowledge Discovery and Data Mining Tools Competition and contains simulated network traffic with a wide variety of intrusions. The data set contains 4,898,431 data points and we use the 34 numeric features for clustering. The data set is available from the UCI Machine Learning Repository~\citep{Bache+Lichman:2013} and we directly stream the data from there. We use the first 1000 data points to center and scale the observations in the data stream in flight. <>= library("stream") con <- gzcon( url(paste0("http://archive.ics.uci.edu/ml/machine-learning-databases/", "kddcup99-mld/kddcup.data.gz"))) stream <- DSD_ReadCSV(con, take=c(1, 5, 6, 8:11, 13:20, 23:42), class = 42, k = 7) stream2 <- DSD_ScaleStream(stream, n = 1000) @ Next, we set up D-Stream with slightly modified values for gaptime (increased number of points after which obsolete micro-clusters are removed) and lambda (faster fading), and cluster the next 4 million data points. <>= dstream <- DSC_DStream(gridsize = .5, gaptime = 10000L, lambda = .01) update(dstream, stream2, n = 4000000, verbose = TRUE) @ \begin{figure} \centering \begin{minipage}{1\linewidth} \centering \includegraphics[width=1\linewidth]{mcs} \end{minipage} \\ (a) \\ \begin{minipage}{1\linewidth} \centering \includegraphics[width=1\linewidth]{classes} \end{minipage} \\ (b) \\ \begin{minipage}{1\linewidth} \centering \includegraphics[width=1\linewidth]{time} \end{minipage} \\ (c) \caption{Clustering 4 million data points of the KDD Cup'99 data set with D-Stream.} \label{figure:intrusion} \end{figure} In stream clustering, each data point is processed individually and we have recorded some key statistics averaged over 1000 point intervals. Figure~\ref{figure:intrusion}(a) shows the number of micro-clusters used by the algorithm. This number is directly related to the memory used by the algorithm. For the used 34 dimensional data set, each micro-cluster occupies 416 bytes of storage leading to a maximal memory requirement of less than 5MB (a maximum of 12,039 micro-clusters are used at the end of the first quarter of the stream) for this clustering. The number of micro-clusters varies significantly over the stream. This behavior can be explained by the changes in the distribution of the data. Figure~\ref{figure:intrusion}(b) shows the number of different classes (normal and different types of intrusions) in each 1000 point segment. It is easy to see that the number of micro-clusters is directly related to the number of different classes in the data. Figure~\ref{figure:intrusion}(c) reports the clustering speed in number of points per second. We use here R 3.1.2 on Linux 3.16.0-28 with an Intel i5 processor at 1.9GHz and 8GB of memory, and the algorithm is implemented as a mixture of R and C++ code using the \pkg{Rcpp} interface package~\citep{Eddelbuettel:2011,Eddelbuettel:2013}. The speed varies significantly between 7,559 and 384,600~points per second with an average throughput of 280,200~points per second (this measure excludes delays caused by the network connection). The throughput remains very high for a long stretch between point 1.5 and 3.5 million. It is easy to see that the performance is inversely related to the number of micro-clusters since more micro-clusters increase the search time for updates. Clustering the 4 million data points took a total of 65 seconds. In comparison, $k$-means clustering using \code{kmeans} (in package \pkg{stats}) with eight clusters (number of classes) took 186 seconds and used at its peek 80\% of 8GB of the available main memory (the whole dataset is stored in memory). \newpage \section{Conclusion and future work} \label{sec:conclusion} Package \pkg{stream} is a data stream modeling framework for \proglang{R} that provides both, a variety of data stream generation tools as well as a component for performing data stream mining tasks. The flexibility offered by the framework allows the user to create a multitude of easily reproducible experiments to compare the performance of these tasks. While \proglang{R} is not an ideal environment to process high-throughput streams in real-time, \pkg{stream} provides an infrastructure to develop and test these algorithms. \pkg{stream} can be directly used for applications where new points are produced at slower speeds (less than 100,000 points per second depending on the algorithm). Another important application of \pkg{stream} is for processing data point by point which otherwise would not fit into main memory. The presented infrastructure can be extended by adding new data sources and algorithms, or by defining whole new data stream mining tasks. We have abstracted each component to only require a small set of functions that are defined in each base class. Writing the framework in \proglang{R} means that developers have the ability to design components either directly in \proglang{R}, or implement components in \proglang{Java}, \proglang{Python} or \proglang{C}/\proglang{C++}, and then write a small \proglang{R} wrapper as we did for some MOA algorithms in \pkg{streamMOA}. This approach makes it easy to experiment with a multitude of algorithms in a consistent way. Currently, \pkg{stream} focuses on the data stream clustering and outlier detection tasks, but we are working on incorporating classification (incorporating the algorithms interfaced by \pkg{RMOA}~\citep{stream:Wijffels:2014}) and frequent pattern mining algorithms as an extension of the base DST class. \section*{Acknowledgments} Matthew Bola\~nos and John Forrest worked on \pkg{stream} when they were undergraduate students at the Lyle School of Engineering at SMU. Both were supported in part by the U.S. National Science Foundation as a research experience for undergraduates (REU) under contract number IIS-0948893. Part of this work was also supported by the National Human Genome Research Institute under contract number R21HG005912. %\pagebreak[4] \bibliography{stream} \end{document}