Forgot Password?

DATA MINING FOR BEGINNERS - XLMiner

This tutorial aims to introduce the practical aspects of data mining to beginners. It is assumed that the reader is aware of the very basic techniques of data mining – at least, theoretically. At the end of this tutorial, the reader will be aware of the basic procedures of data mining in XLMiner – an add-in for MS Excel.

What is Data Mining?

Definition: Data Mining (DM) is the extraction of hidden, predictive information patterns from large databases.

Data Mining is especially useful now-a-days when there is massive amount of data and identifying the useful portions of it can be a tedious job in itself. DM allows us to be proactive about situations rather than retrospective – this means that we can now try and predict the future trends rather than identifying them after they have already taken place.

Where should my experience with DM start?

For a newbie, all the DM techniques – though very easy to understand theoretically – can be quite baffling practically. You might think that the question – “So where do I DO Data Mining?” is very illogical, but in fact, it is a very good one. So how and where should your first encounter with DM be?

The answer is astonishingly simple. I bet that you have Microsoft Office on your system. MS Excel is the perfect place to wet your feet as far as DM is concerned. Excel is a very under-estimated application. People just think that it is a spreadsheet and that it does some math using macros. However, there is an add-in available for MS Excel that allows us to perform Data Mining on the data sets. This add-in is called XLMiner.

http://www.resample.com/xlminer/images/xl_miner_logo.gif
Data Mining Add-In For Excel

 

XLMiner has been developed by Resampling Stats. Inc. Resampling Stats is located in Arlington, Virginia, USA. In the summer of 2006 it was merged into statistics.com , LLC. It usually makes and markets software that are related to statistics.

XLMiner is marketed by Cytel Inc which also deals in statistical software
A 30-day free evaluation copy of XLMiner can be downloaded from
http://www.resample.com/xlminer/download.shtml

XLMiner – an Introduction

XLMiner is the only tool in Excel to do data mining – it includes capabilities that allow a miner to work with partitioning, neural networks, classification and regression trees, association rules, nearest neighbours, etc. Moreover, it is an excellent DM “get started” tool. It can be called a Business Intelligence tool.
XLMiner provides solutions that are statistical as well as machine learning oriented. Hence, there are numerous ways to try to solve a problem and it is the task of a miner to determine which method would be most appropriate to his problem.

Capabilities of XLMiner

XLMiners offers the following capabilities.

Partitioning: -

As the name suggests, partitioning means dividing the data into partitions that do not overlap. To mine a large data base, it is divided into three partitions: training, validation and test.

There are two ways of partitioning in XLMiner:

  1. Random:The training set is chosen randomly from among any of the sets available in the main database. The following screenshot shows a very simple set of data I have used as an example here.

Description of the sample data: The following data set contains the names of the students (NAME) , their year of passing out (YEAR), the percentage scored by them (%) and their mentor during that passing out year (MENTOR).

Sample Data
To partition this data randomly, do the following:

Choose XLMiner -> Partition Data -> Standard Partition


The above window will appear
The “Variables in the partitioned data” specify which variables (columns) will be present in the data sets. You can choose them from “Variables”.
There are three ways the partitioning percentages can be chosen:

When you click on OK, XLMiner processes the data and then partitions it.

The Output Sheet
When you click on “Training Data” or “Validation Data” or “Test Data”, the “Selected variables” shows the appropriate data set. In the above screenshot, the “Training data” is selected.

  1. User-defined Partitioning:

In this method, an extra partitioning column is added to the data. Its values can be any one of the following:
t – Training Data
v – Validation Data
s – Test Data
At least one row in the data has to have a v value in this column. The new data looks like this:

New Data set – with training, validation and test rows labelled in the “VAR” column

When you select VAR as the partitioning variable and click OK, the data is partitioned.


The Output Sheet

 

Data Utilities: -

SAMPLING:
When huge amounts of data are involved, statisticians prefer taking a sample of the data that represents the entire database. However, such a representative sample is very difficult to obtain. XLMiner provides sampling facilities.

Choose XLMiner -> Data Utilities -> Sample From Worksheet
Description of the sample data: The following data set contains the names of the students (NAME) , their year of passing out (YEAR), the percentage scored by them (%) and their mentor during that passing out year (MENTOR).


Sampling in XLMiner can be done in two ways:


Output of Simple Random Sampling

XLMiner allows the choice of a stratum variable which has less than 30 distinct values only. The records can be picked from the strata in three ways:


Output of Stratified Random Sampling

Prediction: -

XLMiner provides four different techniques for prediction. They are as follows:

  1. Multiple Linear Regression

This procedure allows us to perform linear regression on a particular dataset. In linear regression, X is said to be a predictor variable (independent) and Y is said to be a prediction (dependent) variable. We predict the Y value for a set of X values after observing the X and Y values of a known data set.
Y= b0 + b1X1 + b2X2+ .... + bkXk+ e 

b = coefficients to assign weight
e = error
Example:
Description of sample data: The data set consists of the name of the student (NAME), the year of passing out (YEAR), the percentage of marks obtained (%), the name of the student’s mentor (MENTOR), the grade in subject of SUPW (SUPW) and the grade in internal assessment (INTERNAL).

Sample Data
We then partition the data using random partitioning.

Partitioned Data
Now select the Prediction -> Multiple Linear Regression from the XLMiner menu.

Select Multiple Linear Regression

 

The “Input Variables” are the independent variables and the “Output variables” are the dependent variables.  The variable that assigns a weight to each of the records is mentioned in the “weight variable”.
The following dialog box is used to select the various statistics to be displayed along with the output.

The outputs are as follows:


 

  1. K-Nearest Neighbours

The k-NN technique is like the regression technique – here, the nearest neighbours to a particular object have more weight than the distant ones. An object is classified according to a vote by its neighbours. It is then classified to the class most common in its k-neighbours.

  1. Regression Tree:

A single output (prediction) variable, which should be numerical, and one or more input (predictor) variables exist. The input variables can be a mixture of continuous and categorical variables. A regression tree is a decision tree where each node of the tree tests the value of the predictor variable to determine the prediction variable. The leaf nodes of the tree contain the output variables. 
Regression trees are not used for classification; rather, they are used to approximate real-valued functions.

  1. Neural Networks

The model for the artificial neural network is made when the records from the data base are processed one at a time and the computed value of their output is then compared with the actual value. The difference is again taken into account and fed back to the model so as to perfect it. This can go on for many iterations.
XLMiner offers the architecture of multilayer feed-forward for modelling a neural network.
After partitioning the data (as done in the previous example of prediction), select XLMiner -> Prediction -> Neural Network (Multilayer Feedback). The training of the model is a three step process.
Step 1:

 

Step 2:

 

Step 3:

The output is as follows:

Affinity – Association Rules

 

Association rules are used to find out the interesting and useful relationships between data that occur frequently enough to be called a pattern (or a trend) and hence, can be formulated into a rule. Each of these rules has an if-then structure with an antecedent and a consequent and has three properties associated with it – support, confidence and lift.
Support is the number of records that contain both the antecedent and consequent i.e. the number of records for which the rule holds true.
Confidence is the ratio of the support to the number of the records where the antecedent occurs (i.e. a ratio of the number of records where the rule holds true to the total number of records where antecedent occurs).
The third parameter is the lift.
Lift = confidence/ (ratio of the number of records containing the consequent to the total number of records)
Example:

Description of the sample data: The following data set is a very simplified version of the data of an Anti-Virus firm. FREEWARE, BASIC, MEAN and PRO are the anti-virus solutions it offers to its customers.

Select XLMiner -> Affinity -> Association Rules...

Note: Make sure that you choose the data range from column B instead of column A as column A contains non-binary values and the input data format is specified as binary.
The output is as follows:

Once you click on a cell, it shows the corresponding rule in the yellow part of the sheet.

Classification

  1. Classification Tree/Decision Tree: -

These trees are very useful to classify/predict outcomes. They generate simple rules that can easily be translated to a natural query language. The decision trees work by binary recursive partitioning – i.e. they keep on classifying a record by checking whether it meets the criteria at a node or not. Since the partitioning is binary, it is essential that the nodes be divided such that they represent mutually exclusive conditions.
To create a classification tree, first partition the data randomly as already described above. Then,  from the XLMiner menu, choose Classification -> Classification Tree.
Step 1:

Step 2:

Step 3:

Output:




  1. Classification using Naive Bayes: -

If you ever did probability in school, you will be aware of the Bayes Theorem. This theorem is applicable to independent events only, i.e. the value of one variable will not affect that of the others. If there are say, 10 variables that a classification technique has to consider, the Bayes theorem does classification by taking each variable into account separately.

 

  1. K-nearest neighbours: -

In k-nearest neighbours classification (k-NN), for each record, the k-nearest neighbours (nearness is defined by the Euclidean distance to the record in question) are identified and the class a majority of them belong to is determined. The original record is also attributed to the same class.

Example: (using the same data set as used in classification tree example)
Do a standard partition of the data as explained above. Then, from the XLMiner menu, choose Classification -> k-Nearest Neighbours.
Step 1:

Step 2:

The output is as follows:




  1. Logistic regression: -

This form of regression is very similar to the linear regression but is done when the output variable can have only two values (like true/false, 0/1) and the input variable can have any type of value – categorical, numerical, continuous, etc.

  1. Discriminant analysis: -

Discriminant analysis is used to find the linear combination of features that separate two classes. Though discriminant analysis is related to regression techniques mentioned earlier on in the tutorial, the difference lies in the fact that linear regression deals with numerical data whereas discriminant analysis deals with categorical data.

Time Series Analysis

Time series analysis is done to understand the distribution of data points over time and such an analysis is useful for the purpose of prediction – for example, the future trends over time can be predicted by information extracted from the past performance.
In XLMiner, there are two exploratory techniques used for time series analysis:

  1. ACF – Auto Correlation Function:

XLMiner finds the correlation between variables with different lags. For example, a lag of 1 compares variable 1 with variable 2 and variable 2 with variable 3 and so on. A lag of 2 compares variable 1 with variable 3 and variable 2 with variable 4 and so on. When we plot this multitude of sets, we can determine whether a correlation exists between variables if some of the plots lie beyond the two confidence levels.

  1. PACF - Partial Auto Correlation Function:

The PACF is used to find out and plot the partial autocorrelations of a time series dataset.
The steps that are usually followed in time series analysis are:

 

Data Reduction and Exploration

Data exploration is an approach to analyze data for the purpose of formulating hypothesis that can be worth testing.

  1. PRINCIPAL COMPONENTS ANALYSIS: -

The PCA is a mathematical function that is used to transform a number of correlated variables into a smaller number of uncorrelated variables. These uncorrelated variables are called Principal Components. Thus, we get a data set which has a lesser number of variables but the variability of data is maintained since the first principle component takes into consideration the maximum amount of variation in data and others after it consider slightly lesser amounts of variability into account. 

  1. CLUSTER ANALYSIS: -

Cluster analysis is also called data segmentation. Its primary objective is to assign objects to the same clusters such that those within a cluster have marked similarities and those in different clusters have marked differences.
Cluster analysis can be done in two ways:

  1. k-Means Clustering: -

In k-means clustering, the clustering procedure begins with a single cluster that is successively split into two clusters. This continues till the required number of clusters is obtained.

  1. Hierarchical Cluster Analysis: -

Hierarchical clustering itself can be done in two ways – agglomerative and divisive clustering. In agglomerative clustering, as the name suggests, distinct objects are combined to form a group of objects having some similarities. In divisive clustering, objects are grouped into finer groups successively.
Description of Sample Data: The sample data set consists of six fields – NAME (of a state), ID, A, B, C and D, where the last four are certain parameters on the basis of which the states are to be clustered to find out the similarities or differences between them.
Step 1:

Step 2:

Step 3:

The output is as follows:



 

 

Charting

XLMiner provides a charting feature too to chalk up graphs and charts for a more visual and convenient representation of data.

  1. HISTOGRAM: -

A histogram is a bar graph. It has frequency of occurrence on the Y axis and the variable to be examined on the X axis.
Select XLMiner -> Charts -> Histogram.
The following example will show the distribution of the sales of a company.

Output:

  1. BOX PLOT: -  

A box plot is an efficient method of displaying a five member data summary. The five members are:

Also, the box plot is not affected by outliers – i.e. inconsistent or aberrant data.


We can also have a box plot of two Y variables that have the same or different X values.

  1. MATRIX PLOT: -

A matrix plot is like a scatter graph – it is used to summarize the relationship between several variables in a matrix of true X-Y plots. In XLMiner, we can compare up to maximum 8 variables.

Output:

Case Study

 

Let us begin with the simplest and most common data set– the following data is the record of the stock prices of a company registered on a stock exchange.
Description: The following data set is a set of 25 days’ stock prices of a stock registered on the stock market. In this case study, we aim to train a model that will help to predict the adjusted closing rate (using the technique of multiple linear regression MLR) of the stock with volume of the stock as the weight variable and the Open, High, Low and Close rates as the input variables.
Step 1:

Step 2:

The output is as follows:
   
Notice the difference between the predicted value and the actual value. The model seems to be acceptable since the values are close enough.