Data Mining with MS Excel Data Mining
Even though the advancement in the field of data mining has been stupendous, it still remains a complex task to be performed by experts.
In order to bridge the gap between the common user and the complex data mining process, Microsoft has introduced a new and efficient data mining tool ,the Microsoft SQL Server 2005 Data Mining Add-Ins for Office 2007 putting data mining within the reach of every user or desktop.
The add-in provides an easy way to reap the benefits of the data mining by harnessing the sophisticated data mining algorithms of Microsoft SQL Server 2005 Analysis Services within the familiar Microsoft Office environment at every desktop.
We shall now dig into the more intricate details of this new tool. The add-in leverages the power of SQL Server Analysis Services; the user need not have any prior knowledge of data mining. We can use this add-in to gain valuable insight into complex sets of data with just a few mouse clicks directly in Microsoft Excel.
Who can use this add-in?
This add-in combines the powerful mining engine of SQL Server Analysis Service and the intuitive and user –friendly interface of Microsoft Excel .This add-in can be used by any person with the basic knowledge Excel and no prior experience in data mining is necessary. The add-in can be used to perform data-mining using a few clicks and the add-in employs advanced mining algorithm and also eliminate the difficult task of configuring the SQL server .For those with past experience in data mining, the add-in could be used to perform very complex and accurate data mining with ease.
This add-in can be used by SME business managers who want to keep track of their company’s growth, measure its performance on a day-to-day basis. It can also be used by financial officers who wish to perform simple or complex analysis of the incoming and outgoing cash/credit (flow of money) , or could be used by a provision merchant who wishes to forecast sales, analyze previous buying habits of customers or by supermarkets where data mining could prove very useful in improving the sales. The add-in adequately serves the purpose of data mining if any researcher with a non-computer background or students desire to do data mining .This tool can allow colleges or schools to analyze which topics are more difficult or easy, which teaching techniques are better, the performance of individual students so that more attention could be paid towards weak students . It can also be used to analyze the monthly expenses at home and forecast of examine the monthly expenditure (as which month was expenditure on luxury more, what are the items that are unnecessary or costly) etc or it can be employed by anyone who just wishes to learn more about data mining or analyze some data to find interesting patterns in it.
The add-in can be downloaded from the following link: DOWNLOAD LINK.
The software pre-requisites for using the add-in are:
- Microsoft Office 2007 installed.
- Microsoft SQL Server 2005 or above installed.
- Microsoft .NET 2.0 framework or higher (for SQL server 2008 only).
- Microsoft PowerShell (for SQL server 2008 only)

Once the add-in is installed, you can see the DATA MINING tab in the EXCEL ribbon.
The tab contains different options like:
- Data Preparation
- Explore Data.
- Clean Data.
- Partition Data.
- Data Modeling.
- Classify.
- Estimate.
- Cluster.
- Associate.
- Forecast.
- Advanced.
- Accuracy and Validation
- Accuracy Chart.
- Classification Matrix.
- Profit Chart.
- Model Usage
- Browse.
- Query.
- Management.
- Connection
- No Connection.
- Trace.
- Help.
Data Preparation:
As the name suggests this block deals with preparing the data for mining, converting it to the proper format. The data preparation is the most important part of the data mining process as data can only be analyzed if it is structured in a proper format if accurate reports are our goal. This is done by the three tools provided for this purpose for this purpose:
- Explore Data: This tool helps us to create a histogram for any column in the table.
- Clean Data : Using this tool we can specify maximum and minimum values for data that we require in particular column,
- Explore Data: This tool uses a given column from the table and plots histogram .The histogram provides us insight on the distribution of data and the occurrence of a set of values enabling us to explore which discrete value of group of values dominate our data set .
For example:

Here we have used the tool to explore the Income column of the data set. We can see that maximum of the customers have income between the range of 30000 to 50000 and very few people have income in the range 150000-170000, so that we may market our product accordingly.
If required we can add this data as a column in our table.

Here, we can see how maximum of the customers are have ages between 35 years – 40 years with most customers 40 years old.
- Clean Data:
- Outliers: This tool helps to identify outlying values or rare values that exist beyond a give value or below it within the table which may be exceptions thus making the table data inconsistent. After detecting outliers we may choose to change their values to average or null.


Here we apply the tool to the income table in detect the values that are rare or discrete and to change values beyond or above a given values. Here we wish to set the values bet min (27580) and max (144500) and choose to delete the values of records outside these limits to specified limits causing 164 records that were out of limits to get deleted automatically. We may choose any of the above options available.
- Re-label Data:
This tool helps us identify all the unique values that exist in a column .It also allow us to set new labels to these unique values. For example, in the Gender column we have “Male “ and “Female” but want it to be denoted by “M” and “F” we may re-label it using this tools.

Here we have “Number of children” with unique values as 1,2,3,4 and five. We may choose to change these original labels “one”,”two”,”three”,”four”,”five” respectively by using the re-label tool as above.

- Partition data:
This tools helps us partition data into two parts(one for training and one for testing) or different blocks which are discrete or overlapping and then analyze each one and decide on the most accurate conclusion .The three types of partitions are
i) Training and Testing: Based on given ratios the tool divides the data set into two random partitions, the “Training set” and the “Testing set”. The training set will be used to create a mining model and based on it this model will be applied on the testing set to test the accuracy of the model generated using Accuracy and Validation tools.

Here we specify training data to contain 60 % of randomly selected records while remaining is contained in testing partition.

This is the testing set. (40%)

Ii) Random Sampling:
Suppose we contain 1000 records in a data set and wish to analyze a percentage or number of rows, we can this tool to either partition a subset of rows we quantify or based on percentage it will calculate the number of rows to partition.

For example we may specify 20 rows of 1000 rows to be made a partition or 15% (150) rows as a partition. Note that the rows contained in the partition re randomly selected from the entire table and every record has an equal probability of ending in the portioned set. This partition will be copied on a new worksheet.
Here we have selected 70% data to be split to a new worksheet
iii) Oversampling:
In oversampling we can create a partition that contains a balance of a particular column.


For example we might select to have 40%unmarried and 60% married people in our partition. Here we select a partition of 30 rows containing randomly selected people in a ratio of 30% married and rest single.
Data modeling:
This part of tools is the actual tools that will perform the data mining. This provides tools that can provide us with various data mining models using complex algorithms in built in SQL Server without the hassle of configuring the server.
It also has options that can be used to configure our model further or select a desired data mining algorithm.

- Classify: This tool allows us to build a decision tree using the data that is present in your worksheet. A decision tree is a model that extracts patterns that predict individual values of one column based on other columns.
We can choose the column whose value has to be predicted and also choose the columns that have to be used during classification.

Here we can choose to use a temporary model or a persistent model during the analysis. It must be noted that unchecking “Use temporary model” means selecting to use persistent model that causes all current temporary models created by SQL Server Analysis Services to be destroyed.
The “Drill Through” option specifies whether the training data developed during analysis must be saved and associated with the data patterns formed from the data set.
Here we can see how a decision tree structure has been built using the table data which can help us deduce patterns in the data. It utilizes the Microsoft Decision Tree Algorithm.
- Estimate:
This tool creates an estimation model using the data available in the table.
The patterns created while developing the estimate model is used to predict continuous values on one column based on the values of other column. Using these results, we can understand how value of one column is being affected by the values of other column and the factors contributing to the change in its values.
This tool also uses the Microsoft Decision Tree Algorithm.

- Cluster:
This tool helps us create clustering model which can be used to detect which group of columns share similar characteristics. This tool can be used to mine data of all kinds.

Here we can see various clusters connected to each other. The base color specifies the percentage value of the selected .Here we have chosen Education in “Shading variable”. Hence Cluster two is dark as it has a 52% of customers as bachelors while cluster 6 has just 1% bachelors and hence white.
- Associate:
This tools creates Association Rules based model that uses data from the excel table. This model analyzes the data to detect items that appear together in transaction and is most suitable for giving recommendations to buy other related products based on the products they have brought and is mostly used in online shopping and market basket analysis.
It employs the Microsoft Association Algorithm and finds patterns (associations) between different items of the data set. The data provided to the Associate must have its Identifier attribute (ID) sorted and the associate must be informed which I the ID column and the columns containing he items for transaction.

This figure shows % association between the various items in the Item Sets.

This Dependency network shows which item is dependent on which other item/items. For example the customers who bought Bikes also bought Fenders (see above figure for association percentages)
- Forecast:
This tool is used to create a Forecasting model that uses data available in the excel worksheet to analyze and predicts the future values with continuous data. It requires us to specify the column that provides the time stamp if any (like dates, time).
It employs Microsoft Time Series Algorithm to form predictive patterns and is usually used in product sales analysis, stock market analysis, company growth etc
Here we can see that the Forecast tool has plot a graph that predicts the sales in 11th month of 2004 by mining sales data of the previous months
- Advanced:
This has two options:
- Create mining model:
This option allows us to create new mining models using the data available in excel tables and it gives us an option to choose the mining algorithm to be used for the creation of the new mining model.
The various algorithms available to be used for data mining are
- Microsoft Decision trees
- Microsoft Clustering
- Microsoft Linear Regression
- Microsoft Logistic Regression
- Microsoft Naïve Bayes.
- Microsoft Neural Networks
- Microsoft sequence Clustering.
- Microsoft Time Sequence.

We can select any algorithm to create the mining model.
- Add model to structure: This tool is used to add an already developed mining model to a structure and create a new mining model for that structure.

We can use any of the previously developed models for adding to the structure.
This concludes the Data modeling part.
Accuracy and Validation:
In this part, we can find tools that can be used to test and validate our mining models. It is important that we know how well the mining models developed by us work with real world data, and by checking their accuracy we can validate the mining models.
- Accuracy:
This tool helps us to apply previously developed mining model on a set of real world data so that we can see how well it performs. The mining model chosen must be quite similar to the data to be tested.
Here we apply a previously designed mining model to a new set of data and check how close the predicted values are to the values in the ideal model to know if the model is accurate enough or it still needs to be trained to give better results.

The above accuracy result shows comparison between the ideal and predicted value.
- Classification Matrix:
This tools applies the previously mined model to a new data set and compares the values predicted by the mining model and compares it with the actual model and then gives a result in terms of percentage how correctly the model predicted and also how much wrong values it predicted.

In this above Classification Matrix, we can see that the mining model when applied to the new data set predicted about 69.20% of the values correctely.If attained values are less than the expected accuracy values, and then we must train the mining model better.
- Profit Chart:
This tools when used to apply a previously developed mining model to a new set of data plots a graph of the expected profit that would occur if the mining model was used i.e. the increase of conversion of potential customers who have been approached who become customers. The graph generally raises high until a point but then begins to fall as the number of customers starts increasing.

Here we can see that the profit would first increase i.e. If only 1-15% of the customers that are predicted by the mining model are approached; chance that they respond is high. But this profit begins to reduce as the number of customers begins to increase.
Query:
The Query Model tool lets you use the existing mining models to make predictions using the data in an Excel table. This process of applying new data to an existing model to predict trends is called a prediction query.
Management:
This opens a wizard that helps us manage the previously developed models by doing required changes to the mining models.
- Renaming a mining model or structure
- Deleting a mining model or structure
- Clearing a mining model or structure
- Processing a mining structure, using either new or existing data
- Exporting or import a mining model or structure.

The Various options in the Mining Models management tool.
Connection:

Default Host:
This is the configuration wizard that we use to connect the Data Mining Add-in in Excel to the SQL Server analysis Services.
Trace:
Clicking this open a tracer window with logs all the data that has been sent to the server for analysis.
Help:

The tool opens a Compiled HTML Help file that contains description of all the features of the Add-in and details of how to use the tools.
Conclusion:
The Microsoft SQL Server Data mining add-in for Microsoft Excel provides users with an easy to use interface that is capable of performing complex data mining tasks with ease. The add-in can be extremely useful for both, people who just want to get more out of their data and also for those interested in serious data mining.