Institute of Computing Science, Poznan University of Technology, Poland
Adaptive ensembles for evolving data streams – combining block-based and on-line solutions
The rapid development of the information technology facilities collecting big data sets which cause challenges for their storage and processing. In particular, one faces the difficulties with massive volumes of data in the form of data streams. Compared to the static environments mining data streams implies new requirements for algorithms, such as constraints on memory usage, restricted processing time, and one scan of incoming examples. The other critical issue is reacting to concept drifts. These requirements make mining evolving data streams a complex problem which asks for new dedicated solutions.
As standard approaches are not appropriate for learning classifiers from changing data streams, several new algorithms have been introduced in the last years. Ensembles are among the most often studied classifiers. Due to their modularity, they provide a natural way of adapting to changes by modifying their structure, either by retraining ensemble members, replacing old component classifiers with new ones, or updating rules for aggregating component predictions. Most of current ensembles do not contain a drift detector and react to changes in the adaptive way. They can be further divided into block-based and online approaches.
This talk starts with a general overview of the current data stream ensembles. Then, we study differences between block-based and on-line ensembles with respect to: different reaction to various types of drifts, time and memory requirements and strategies to learn component classifiers. We hypothesize that it is still possible to develop new types of hybrid ensembles that combine the most beneficial properties of these both types of approaches. In the next part of this talk we present experiences from using such two algorithms, recently developed in our group.
The first algorithm, called Accuracy Updated Ensemble (AUE), is a more block-based oriented proposal. It includes elements of incremental updating of component ensembles and a new aggregation rule. Its experimental evaluation shows that it provides the better classification accuracy than other state-of-the-art algorithms, also with acceptable time and memory usage.
The AUE ensemble is, then, generalized into its completely incremental version, called the On-line Accuracy Updated Ensemble. Its experimental evaluation shows advantages with respect to the faster reaction to several types of drifts. Finally, we discuss open research directions for constructing ensembles from complex data streams.