CHAPTER 1 Introduction
Predictive analytics and data mining have been growing in popularity in recent years. In the introduction we define the terms “data mining” and “predictive analytics” and their taxonomy. This chapter covers the motivation for and need of data mining, introduces key algorithms, and presents a roadmap for rest of the book.
CHAPTER 2 Data Mining Process
Successfully uncovering patterns using data mining is an iterative process. Chapter 2 provides a framework to solve the data mining problem. The five-step process outlined in this chapter provides guidelines on gathering subject matter expertise; exploring the data with statistics and visualization; building a model using data mining algorithms; testing the model and deploying it in a production environment; and finally reflecting on new knowledge gained in the cycle. Over the years of evolution of data mining practices, different frameworks for the data mining process have been put forward by various academic and commercial bodies, like the Cross Industry Standard Process for Data Mining, knowledge discovery in databases, etc. These data mining frameworks exhibit common characteristics and hence we will be using a generic framework closely resembling the CRISP process.
CHAPTER 3 Data Exploration
Data exploration, also known as exploratory data analysis (EDA), provides a set of simple tools to achieve a basic understanding of the data. The results of data exploration can be extremely useful in grasping the structure of the data, the distribution of the values, and the presence of extreme values and interrelationships within the data set. Descriptive statistics is the process of condensing key characteristics of the data set into simple numeric metrics. Some of the common metrics used are mean, standard deviation, and correlation. Visualization is the process of projecting the data, or parts of it, into Cartesian space or into abstract images. In the data mining process, data exploration is leveraged in many different steps including preprocessing, modeling, and interpretation of results.
CHAPTER 4 Classification
In classification or class prediction, we try to use the information from the predictors or independent variables to sort the data samples into two or more distinct classes or buckets. Classification is the most widely used data mining task in business. There are several ways to build classification models. In this chapter, we will discuss and show the implementation of six of the most commonly used classification algorithms: decision trees, rule induction, k-nearest neighbors, naïve Bayesian, artificial neural networks, and support vector machines. We conclude this chapter with building ensemble classification models and a discussion on bagging, boosting, and random forests.
CHAPTER 5 Regression
This chapter covers two of the most popular function-fitting algorithms. The first is the well-known linear regression method used commonly for numeric prediction. We describe briefly the basics of regression and explain with the classic Boston Housing data set how to implement linear regression in RapidMiner. We also include a discussion on feature selection and provide some checkpoints for correctly implementing linear regression. The second is the more recent logistic regression method used for classification. We explain the basic concepts behind calculation of the logit and how this is used to transform a discrete label variable into a continuous function so that function-fitting methods may be applied. We close with two methods of implementing logistic regression using RapidMiner and show how to use the MetaCost operator to improve the performance of a classification method (such as logistic regression).
CHAPTER 6 Association Analysis
In the retail industry, market basket analysis explores the relationship between products by considering the co-occurrence of purchases in previous transactions. Association analysis is a generalization of applications like market basket analysis and is now commonly applied in clickstream analysis, cross-selling recommendation engines, and information security. Association analysis is an unsupervised data mining technique where there is no target variable to predict. Instead, the algorithm reviews each transaction containing a number of items (products) and extracts useful relationship patterns amongst the items in the form of rules. The challenge in association analysis is to differentiate a significant observation against unscrupulous rules. The Apriori and FP Growth algorithms offer efficient approaches to extract these rules from large data sets in the transaction logs.
CHAPTER 7 Clustering
Clustering is an unsupervised data mining technique where the records in a data set are organized into different logical groupings. The groupings are in such a way that records inside the same group are more similar than records outside the group. Clustering has a wide variety of applications ranging from market segmentation to customer segmentation, electoral grouping, web analytics, and outlier detection. Clustering is also used as a data compression technique and data preprocessing technique for supervised data mining tasks. Many different data mining approaches are available to cluster the data and are developed based on proximity between the records, density in the data set, or novel application of neural networks. K-means clustering, density clustering, and self-organizing map techniques are reviewed in the chapter along with implementations using RapidMiner.
CHAPTER 8 Model Evaluation
This chapter describes three commonly used tools for evaluating the performance of a classification algorithm. We first introduce the confusion matrix and provide the definitions for several terms that are used in conjunction, such as sensitivity, specificity, recall, etc. We then describe how to construct receiver operating characteristic (ROC) curves and show when it would be appropriate to use them along with the area under the curve (AUC) concept. Finally we present lift and gain charts, and show how to construct and interpret them. The RapidMiner implementation includes step-by-step processes for building each of these three very useful evaluation tools.
CHAPTER 9 Text Mining
This chapter provides a detailed look into the emerging area of text mining and text analytics. It starts with a background of the origins of text mining and provides the motivation for this fascinating topic using the example of IBM's Watson, the Jeopardy!-winning computer program that was built almost entirely using concepts from text and data mining. The chapter introduces some key concepts important in the area of text analytics such as TF-IDF scores. Finally it describes two hands-on case studies in which the reader is shown how to use RapidMiner to address problems like document clustering and automatic gender classification based on text content.
CHAPTER 10 Deep Learning
This chapter provides a high-level overview of time series forecasting and related analysis. It starts by pointing out the clear distinction between standard supervised predictive models and time series forecasting models. It provides a basic introduction to the different time series methods, ranging from data-driven moving averages to exponential smoothing, and also discusses model-driven forecasts including polynomial regression and lag-series-based ARIMA methods. Finally it explains how to implement lag-series-based forecasts using the Windowing operation using RapidMiner. It points out that the implementation of time series in RapidMiner is based on a hybrid concept of transforming series data into “cross-sectional” data that is the standard data format for supervised predictive models.
CHAPTER 11 Recommender Engines
Recommendation engines are a class of machine learning techniques that predict a user preference for an item. There are a wide range of techniques available to build a recommendation engine. This chapter discusses the most common methods starting with Collaborative Filtering and Content-based Filtering and their implementation using a practical dataset. The advent of the digital economy has exponentially increased the choices of available items to customers, which can be overwhelming. Personalized recommendation lists help by narrowing the choices to a few items relevant to a particular user and aid users in making the final consumption decision. Recommendation lists, created by Recommendation engines, are one of the most prolific utilities of machine learning in everyday experience.
CHAPTER 12 Time Series Forecasting
This chapter provides a high-level overview of time series forecasting and related analysis. It starts by pointing out the clear distinction between standard supervised predictive models and time series forecasting models. It provides a basic introduction to the different time series methods, ranging from data-driven moving averages to exponential smoothing, and also discusses model-driven forecasts including polynomial regression and lag-series-based ARIMA methods. Finally it explains how to implement lag-series-based forecasts using the Windowing operation using RapidMiner. It points out that the implementation of time series in RapidMiner is based on a hybrid concept of transforming series data into “cross-sectional” data that is the standard data format for supervised predictive models.
CHAPTER 13 Anomaly Detection
Anomaly detection is the process of finding outliers in the data set. Outliers are the data objects that stand out amongst other objects in the data set and do not conform to the normal behavior in a data set. Anomaly detection is a data mining application that combines multiple data mining tasks like classification, regression, and clustering. The target variable to be predicted is whether a transaction is an outlier or not. Since clustering tasks identify outliers as a cluster, distance-based and density-based clustering techniques can be used in anomaly detection tasks.
CHAPTER 14 Feature Selection
This chapter introduces a preprocessing step that is critical for a successful predictive modeling exercise: feature selection. Feature selection is known by several alternative terms such as attribute weighting, dimension reduction, and so on. There are two main styles of feature selection: filtering the key attributes before modeling (filter style) or selecting the attributes during the process of modeling (wrapper style). We discuss a few filter-based methods such as PCA, information gain, and chi-square, and a couple of wrapper-type methods like forward selection and backward elimination.
CHAPTER 15 Getting Started with RapidMiner
The final chapter is the kickstarter for the main tools that one would need to become familiar with in building predictive analytics models using RapidMiner. We start out by introducing the basic graphical user interface for the program and cover the basic data exploration, transformation, and preparation steps for analysis. We also cover the process of optimizing model parameters using RapidMiner’s built-in optimization tools. With this high-level overview, one can go back to any of the earlier chapters to learn about a specific technique and understand how to use RapidMiner to build models using that machine learning algorithm.