Posts

Showing posts from 2014

Data Analytics Plus Big Data: A Win-win Strategy

We live in a dynamic world where new technologies arise, dominate for a brief blink of an eye and then fall to oblivion rapidly. A part from the academia domain, it is interesting to check what are the current trends in the industry. Or even better: see the whole picture without discriminating. In this regard, a nice analytics tool is found in Google Trends, which allow us to track our terms of interest. The following chart shows the search interest for Machine Learning and Data Analytics . In a similar scale I depicted the search interest for Data Analytics and Big Data . As it is clearly noticeable, knowing both disciplines can open many doors in the industry. Finally, as top and also rising queries, Hadoop is the most popular term.

Gradient Boosting: Fostering accuracy even further

Image
As in many real-world situations, union makes algorithms stronger. With this philosophy in mind, ensemble methods combine several weak classifiers into a massive one---in terms of accuracy. In the last post we learnt a primer with Random Forest . Therefore, the next cornerstone is gradient boosting. I mentioned Gradient Boosting many times in this blog, but I only commented the fundamental ideas, without discussing further the details. In this entry I will share my two cents. Let me introduce a little bit of history, first: recall the Kaggle-Higgs competition . The top scores in the leaderboard have been obtained by using distinct forms of gradient boosting, and XGBoost is the direct responsible of many of these. The question is, hence, how does this algorithm work ? Figure 1. A high-level description of the Gradient Boosting method I programmed. Click to enlarge. Informally, Gradient Boosting generates a sequence of classifiers in the form of an additive expansion, that i

Toward ensemble methods: A primer with Random Forest

The Kaggle-Higgs competition has attracted my attention very much lately. In this particular challenge, the goal is to generate a predictive model out of a bunch of data taken from distinct LHC’s detectors. The numbers are the following: 250000 properly labeled instances, 30 real-valued features (containing missing values), and two classes, namely background { b } and signal { s }. Also, the site provides a test set consisting of 550000 unlabeled instances. There are nearly 1300 participants while I am writing this post, and a lot of distinct methods are put to the test to win the challenge. Very interestingly, ensemble methods are all in the head of the leaderboard, and the surprise is XGBoost , a gradient boosting method that makes use of binary trees.   After checking the computational horsepower of the XGBoost algorithm by myself, I decided to take a closer look at ensemble methods. To start with, I implemented a Random Forest , an algorithm that consists of many independe

Boosting the accuracy rate by means of AdaBoost

Image
Several machine learning techniques foster the cooperation between subsolutions for obtaining an accurate outcome by combining many of them. Michigan-style LCSs are one of these families. However, the most well known are those that implement Boosting, and AdaBoost is the most successful of them (or, at least, the most studied one and the first to implement the ideas of Boosting). AdaBoost generates accurate predictions by combining several weak classifiers (also referred to as “the wisdom of the crowd”). The most outstanding fact about AdaBoost is that it is a deadly simple algorithm. The key of its success lies in the combination of many weak classifiers: these are very limited and their error rate is just slightly better than a random choice (hence their name). The typical implementation uses decision stumps (i.e., binary trees) that minimize the error between the prediction and the ground truth (that is, the desired outcome), so the classifiers have the form " if variable_

The challenge of learning from rare cases

Image
An important challenge is learning from domains that do not have the same proportion of classes, that is, learning from problems that contain class imbalances (Orriols-Puig, 2008). Figure 1 shows a toy example of this issue. Notwithstanding, it is challenging because (1) in many real-world problems we cannot assume a balanced distribution of classes and (2) traditional machine learning algorithms cannot induce accurate models in such domains. Oftentimes it happens that the key knowledge to solve a problem that previously eluded solution is hidden in patterns that are rare . To tackle this issue, practitioners rely on re-sampling techniques, that is, algorithms that pre-process the data sets and either (1) add synthetic instances of the minority pattern to the original data or (2) eliminate instances from the majority class. The first type is called over-sampling and the later, under-sampling .  Figure 1. Our unbalanced data set. Black dots are the majority class (0) and red

Learning Classifier System Open Repository

It is time to announce the Learning Classifier System Open Repository (LCSOR), a meeting point to foster the popularization of these exciting evolutionary computation techniques. In this repository I will upload distinct implementations of different Learning Classifier Systems. To start with, I uploaded a first version of the sUpervised Classifier System (UCS). Notice I created a page for this contend under the label LCSOR . I hope practitioners will enjoy this experience.

Robust on-line neural learning classifier system for data streams

The ever-increasing integration of technology in the different areas of science and industry has fostered practitioners the design of applications that generate stupendous amounts of data on-line (e.g., smart sensors or network monitoring just to mention two real cases). Extracting information from these data is key, in order to gain a better understanding of the processes that the data are describing. However, learning from these data poses new challenges to traditional data mining techniques, which are not designed to deal with data streams: data in which concepts and noise vary over time. In this regard, I am proud to present the supervised neural constructivist system ( SNCS ), a neural Michigan-style Learning Classifier System  that has been designed to provide a fast reaction capacity and adaptability to the distinct possible changes in concept (concept drifts) with varying noisy inputs.  Inheriting the intrinsically on-line fashion of Michigan-style learning classifier sy