Data mining

Data mining (DM), also known as Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining (KDD), is the process of automatically searching large volumes of data for patterns. Data mining is a fairly recent and contemporary topic in computer science. However, Data mining applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition.

Definition
Data mining can be defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases". Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts. It is usually associated with a business or other organization's need to identify trends.

Data mining involves the process of analysing data to show patterns or relationships; sorting through large amounts of data; and picking out pieces of relative information or patterns that occur e.g., picking out statistical information from some data.

A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk shirts. The sales department will look at that information and may begin direct mail marketing of silk shirts to that customer, or it may alternatively attempt to get the customer to buy a wider range of products. In this case, the data mining system used by the retail store discovered new information about the customer that was previously unknown to the company. Another widely used (though hypothetical) example is that of a very large North American chain of supermarkets. Through intensive analysis of the transactions and the goods bought over a period of time, analysts found that beer and diapers were often bought together. Though explaining this interrelation might be difficult, taking advantage of it, on the other hand, should not be hard (e.g. placing the high-profit diapers next to the high-profit beers). This technique is often referred to as Market Basket Analysis.

In statistical analyses, in which there is no underlying theoretical model, data mining is often approximated via stepwise regression methods wherein the space of 2k possible relationships between a single outcome variable and k potential explanatory variables is smartly searched. With the advent of parallel computing, it became possible (when k is less than approximately 40) to examine all 2k models. This procedure is called all subsets or exhaustive regression. Some of the first applications of exhaustive regression involved the study of plant data.

Generally, data mining (also called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for a long time. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while reducing the cost.

For example, one mythical Midwest grocery chain used the data mining capacity of Oracle software to analyze local purchasing patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

Data dredging
Used in the technical context of data warehousing and analysis, the term "data mining" is neutral. However, it sometimes has a more pejorative usage that implies imposing patterns (and particularly causal relationships) on data where none exist. This imposition of irrelevant, misleading or trivial attribute correlation is more properly criticized as "data dredging" in the statistical literature. Another term for this misuse of statistics is data fishing.

Used in this latter sense, data dredging implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be highly suspect. In spite of this, some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear.

One common approach to evaluating the fitness of a model generated via data mining techniques is called cross validation. Cross validation is a technique that produces an estimate of generalization error based on resampling. In simple terms, the general idea behind cross validation is that dividing the data into two or more separate data subsets allows one subset to be used to evaluate the generalizeability of the model learned from the other data subset(s). A data subset used to build a model is called a training set; the evaluation data subset is called the test set. Common cross validation techniques include the holdout method, k-fold cross validation, and the leave-one-out method.

Another pitfall of using data mining is that it may lead to discovering correlations that exist due to chance rather than due to an underlying relationship. "There have always been a considerable number of people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it." . However, when properly done, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and furthermore correlation analysis has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels.

Most data mining efforts are focused on developing highly detailed models of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data.

Privacy concerns
There are also privacy concerns associated with data mining - specifically regarding the source of the data analyzed. For example, if an employer has access to medical records, they may screen out people who have diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems.

Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns.

There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs exhibiting harmful interactions. Since any particular combination may occur in only 1 out of 1000 people, a great deal of data would need to be examined to discover such an interaction. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.

Essentially, data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.

Combinatorial game data mining

 * Data mining from combinatorial game oracles:

Since the early 1990s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. This is pattern-recognition at too high an abstraction for known Statistical Pattern Recognition algorithms or any other algorithmic approaches to be applied: at least, no one knows how to do it yet (as of January 2005). The method used is the full force of Scientific Method: extensive experimentation with the tablebases combined with intensive study of tablebase-answers to well designed problems, combined with knowledge of prior art i.e. pre-tablebase knowledge, leading to flashes of insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of people doing this work, though they were not and are not involved in tablebase generation.

Notable uses of data mining

 * Data mining has been cited as the method by which the U.S. Army unit Able Danger supposedly had identified the 9/11 attack leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an al Qaeda cell operating in the U.S. more than a year before the attack.
 * See also: Able Danger, U.S. Army intelligence had detected 9/11 terrorists year before, says officer.
 * As is the case for economic models which successfully predict 10 of the last 3 recessions, one must of course know which other names came up on the "possible members" list before being confident this was not an exercise in data dredging.
 * It has been suggested that both the CIA and their Canadian counterparts, CSIS, have put this method of interpreting data to work for them as well, although they have not said how.

Structured Data Mining

 * Database mining
 * Relational data mining
 * Database
 * Document warehouse
 * Data warehouse
 * Graph mining
 * Molecule mining
 * Sequence mining
 * Data stream mining
 * Tree mining
 * Text mining
 * Web mining

Supervised learning

 * Artificial neural network
 * Decision tree
 * Linear discriminant analysis
 * Logit (in reference to logistic regression)
 * Naive Bayes
 * Nearest neighbor (pattern recognition)
 * Neural network
 * Quadratic classifier
 * Random forest
 * Support Vector Machine

Unsupervised learning

 * Apriori algorithm
 * Data clustering
 * Self-organizing map

Dimensionality reduction

 * Feature selection
 * Information gain
 * Feature extraction
 * Principal components analysis

Application areas

 * Business intelligence
 * Business performance management
 * Discovery Science
 * Loyalty card
 * Cheminformatics
 * Quantitative structure-activity relationship
 * Bioinformatics

Software

 * YALE is a free tool for machine learning and data mining
 * R programming language - R is statistical environment and programming language that fits well for machine learning and data mining
 * Microsoft Analysis Services - Microsoft SQL Server 2005 contains a full suite of data mining algorithms and tools integrated with the database, OLAP, Reporting, ETL pipeline, and the development environment.
 * Weka - Open source data mining software written in Java
 * Neural network software
 * Java Data Mining
 * Teradata Warehouse Miner - Teradata contains datamining tools such as data exploration, data preprocessing, analytic modelling, scoring and deployment within a database.

General references

 * Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7 (companion book site)
 * Kurt Thearling, An Introduction to Data Mining (also available is a corresponding online tutorial)
 * Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, Wiley Interscience, ISBN 0-471-05669-3, (see also Powerpoint slides)
 * Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (2000), ISBN 1-55860-552-5, (see also Free Weka software)
 * Yike Guo and Robert Grossman, editors: High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers, 1999.

Software (external)

 * Software with Wikipedia entry should be placed under the see also section.


 * IlliMine Open source data mining project written in C++
 * Tanagra Open source data mining and statistical software
 * Orange Open source Python toolkit for data mining and machine learning.
 * MDR Open source Java software for detecting attribute interactions using the multifactor dimensionality reduction (MDR) method.
 * Pimiento A Java-based application framework for Text Mining.
 * KDB2000 A free C++ tool which integrates database access, data preprocessing, transformation techniques and a full range of data mining algorithms.
 * LingPipe Java text data mining API distributed with source.
 * InfoCodex Data Mining Application with a Linguistical Database.

Data mining Data-Mining Minería de datos fa:کاوش‌های ماشینی در داده‌ها Exploration de données 데이터 마이닝 Data mining כריית מידע Duomenų išgavimas Adatbányászat Datamining データマイニング Data mining Eksploracja danych Data Mining Podatkovno rudarjenje Data mining Data mining การทำเหมืองข้อมูล Khai phá dữ liệu 数据挖掘