A machine learning algorithm processes a very large dataset to find the best fit in the observed data to learn, and also makes use of the prior knowledge of the system for learning or forecasting. Data mining forms an integral part of the search and understanding of the electronically stored data.
Graduate student Quan Sun’s win at the University of California San Diego student datamining competition surely highlights Weka. Sun claims he used open source software to claim his win in the elite “hard” category for graduate students. In fact, says Sun, at least half the competitors in the competition used the software, called Weka, which he describes as the “Microsoft Word of data mining”.
Waikato Environment for Knowledge Analysis (WEKA), developed at the University of Waikato, New Zealand, is a collection of machine learning algorithms. It has data preprocessing tools to provide inputs to these algoritms.This tool uses Java as its base and is compatible with both Windows and Linux. It is open source software available under the terms of GNU General Public License. Inputs to the machine learning algorithms implemented in WEKA are in the form of relational tables in Attribute Relational File Format (ARFF).
Some of the key features of WEKA
- WEKA has inbuilt format converters to convert the dataset available in any format say a spreadsheet to ARFF type. In addition it also incorporates filter to delete specified attributes from the dataset.
- WEKA trains and tests the learning algorithms that perform classification and regression and also allows the user to create his own classifier interactively.
- WEKA allows the user to handle cluster and their instances.Inaddition it provides access to several methods for attribute selection that might involve either the full data set or a cross validation.
Working with WEKA is very simple. This is mainly because it uses a GUI Explorer. The classification process is not cumbersome, as it involves the selection of the attributes to be related and the algorithm to be used, by the user. The results provide a matrix of both the classified and the misclassified data. The classification error mentioning the mean and the standard deviation are also displayed. WEKA helps in realizing the goal of data mining; by predicting missing values and validating that the predicted values are correct. WEKA is a tool that permits users to develop and analyze new machine learning algorithms to make their job easier.
What can weka do?
reprocess – Weka has file format converters for spreadsheets, C4.5 file formats and serialized instances. It can also open a URL and use HTTP to download an ARFF file from the Web or open a database using JDBC, and retrieve instances using SQL. It also provides a list of filters to delete specified attributes from a dataset.
Direct Hit!
Applies To: Researchers in Data Mining and Artificial Intelligence
USP: Applying machine learning algorithms for data mining
Primary Link: www.cs.waikato.ac.nz/ml/weka Search Engine
Keywords: Machine Learning, Data Mining
Classify – Weka trains and tests learning schemes that perform classification or regression. The classifiers can be divided into Bayesian, trees, rules, functions and lazy. It also builds a linear regression model and allows the user to build their own classifiers interactively. It also provides options for a number of meta learners.
Cluster – Weka shows the clusters and the number of instances in the cluster. Thereafter it determines the majority class in each cluster and gives the confusion matrix.
Associate – Weka contains three algorithms for determining
association rules-apriory, predictive apriory and filtered associators. It has no methods for evaluating such rules.
Attribute Selection – Weka gives access to several methods for attribute selection, which involves an attribute evaluator and a search method. Attribute selection can be performed using the full training set or cross-validation.
In the Preprocess tab, you can view attributes in the input file, properties of the selected attribute, and visualisation of class distribution for each attribute. Building a Naïve Bayes Classifier with 10 fold cross-validation. The correctly classified instances can be viewed by right clicking on Classifier in Results Window.
Visualization – It displays a matrix of two-dimensional scatter plots of each pair of attributes.
Preparing input
Major effort in the process of data mining/machine learning goes into the preparation of input. In order to analyze data using Weka, you need to prepare it in the Attribute Relation File Format (ARFF) and then load it in its Explorer. Spreadsheets, Comma Separated Value (CSV) files and databases can be converted to ARFF. In ARFF, there is an @relation tag, @attribute tag and @data tag to represent the dataset name, attribute information and values respectively.
Classifying data
Weka should preferably be used through a graphical user interface called ‘Explorer’ than the command-line interface. The other two interfaces are ‘Knowledge Flow Interface,’ which supports design configuration for streamed data processing and ‘Experimenter,’ which helps users compare a variety of learning techniques. In this example, we use an ARFF named age.arff which contains a few selective words in the attribute and @data contains their number of occurrences per 10,000 words in a blog dataset written by bloggers belonging to various age groups.
1. Open the file you want to analyze using the Open file option in the Preprocess tab in Weka explorer, ie open the age data file, age.arff.
2. Once the input file has been opened, all attributes in the input file are shown in the Attributes Window. Properties of the selected attributes like Attribute Name, Attribute Type, number of missing values, etc are displayed in the ‘Selected Attribute’ window. Here, you can select attributes that you want to include in working relations, eg age prediction.
3. Select the classifier algorithm in the Classify tab. In this example, we selected Naïve Bayes with 10 fold Cross-Validation. Next, click on Start. The result is displayed in the Classifier Output window as shown in figure on the left.
Output of the Naïve Bayes Classifier in terms of errors, accuracy by class and confusion matrix, on Age dataset. View of an ARFF dataset which consists of a list of instances, and the attribute values for each instance separated by commas.
Analyzing the result
The result displays the summary of the data set followed by the algorithm used to analyze it. It also gives the predictive performance of the machine-learning algorithm applied on the dataset. Thereafter the confusion matrix displays the number of instances classified properly and those misclassified. The classification error is displayed mentioning the mean absolute error and the root mean squared error of the class probability estimates.
Processing huge datasets
If the dataset is too huge, running to a few thousand attributes and a few lakh records, it can happen that Weka runs into an ‘OutOfMemory’ exception. Most Java virtual machines allocate a certain maximum amount of memory which is much less than the amount of RAM to run Java programs. However, we can extend the memory available for the virtual machine by setting appropriate options. Alternately, Weka offers several filters for re-sampling a dataset and generating a new dataset reduced in size. Besides, there are schemes that can be trained in an incremental fashion, not just in batch mode unlike most classifiers which require all the data before they can be trained. Such a classifier will load the dataset incrementally and feed the data instance by instance to the classifier.
Preprocess – Weka has file format converters for spreadsheets, C4.5 file formats and serialized instances. It can also open a URL and use HTTP to download an ARFF file from the Web or open a database using JDBC, and retrieve instances using SQL. It also provides a list of filters to delete specified attributes from a dataset.
Classify – Weka trains and tests learning schemes that perform classification or regression. The classifiers can be divided into Bayesian, trees, rules, functions and lazy. It also builds a linear regression model and allows the user to build their own classifiers interactively. It also provides options for a number of meta learners.
Cluster – Weka shows the clusters and the number of instances in the cluster. Thereafter it determines the majority class in each cluster and gives the confusion matrix.
Associate – Weka contains three algorithms for determining association rules-apriory, predictive apriory and filtered associators.
Attribute Selection – Weka gives access to several methods for attribute selection, which involves an attribute evaluator and a search method. Attribute selection can be performed using the full training set or cross-validation.
Visualization – It displays a matrix of two-dimensional scatter plots of each pair of attributes.

Other ML/DM software (R, Weka, Yale)
Lluís Belanche
Processing huge datasets
If the dataset is too huge, running to a few thousand attributes and a few lakh records, can lead Weka into ‘OutOfMemory’ exception. Most Java virtual machines allocate a certain maximum amount of memory which is much less than the amount of RAM to run Java programs. However, we can extend the memory available for the virtual machine by setting appropriate options. For large data processing we can take a look at Mahout, an open source scalable, Apache licensed machine learning libraries
Read more at computer world.