What Is Data Mining?

IT giants like Google, Facebook, and Twitter harvest their users’ data to feed their advanced AI algorithms. Their objective may be purely profit-driven and commercial, but the benefits a user draws from them in terms of experience and ease have scaled their businesses to exponential heights.

Whether a business wants to identify its target customer segments, increase its sales, enter a new market, or revamp its business strategy, data is vital. Naturally, extracting information and deriving valuable data insights is an essential stage in any business decision. It is where data mining comes into the picture.

Contents show

Data mining is a process of discovering anomalies, patterns, and correlations within large raw data to extract useful information.

In today’s world, data is everywhere. It is constantly being generated from tasks as simple as a click on a website. With data mining, a business can:

Boil down chunks of raw data into actionable insights,
Find hidden patterns and trends,
Create predictive models to forecast key events such as customer churn,
Automate analytical systems and cut cost on human capabilities, and
Expand itself into the Automation and Artificial Intelligence (AI) industry

Since the key purpose is to discover and unearth hidden knowledge, data mining is also called Knowledge Discovery in Data (KDD).

Consider Instagram – this social media giant tracks its users’ online activities to customise their feed with content similar to what they like, save and interact with. Every interaction on Instagram creates entries to the huge databases the organisation maintains, fueling its Artificial Intelligence algorithms to predict its customer’s behaviour. This frequently used strategy is the essence of data mining.

How Data Mining Works?

Data mining is not simply model creation – it involves a sequence of steps from defining the problem, gathering and preprocessing data, building and evaluating automated models, to the deployment of knowledge.

There is an old saying in Computer Science, “Garbage in, Garbage out” or ‘GIGO’. It means that nonsensical or flawed data input generates nonsense output called ‘garbage’.

When a business mines data, it has to ensure that the data goes through a series of well-defined stages to generate meaningful and actionable results.

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Cross-Industry Standard Process for Data Mining (CRISP-DM) is a standard data mining model that systematically defines the key steps in any data mining project. The model involves the following steps:

Business Understanding: The first phase in a data mining project starts with the definition of the problem statement. Once objectives are determined, the project team then assesses the project’s potential risks, costs, and technologies required. Finally, a complete project plan is developed that details the operations at each phase.
Data Understanding: The team collects raw data and assesses its quality (whether data is clean or not).
Data Preparation: Data is prepared for the model. Raw data is not clean or formatted and may have critical errors that generate faulty insights. For example, empty or null values in database entries can cause potential errors and must be removed. In this phase, data is cleaned and preprocessed for the core model.
Modelling: Project team develops a model that is best suited for the preprocessed data. Modelling depends on multiple criteria, including business problems, data being fed, algorithms’ efficiency, system requirements, etc. At its core, a model relies on data mining techniques like Classification, Clustering, Regression, etc.
Evaluation: Next, the team evaluates the project against the defined set of goals and ascertains, whether it is production-ready or not.
Deployment: Finally, the model is deployed and made accessible to the customers. According to the CRISP-DM guide, “Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise”.

Data Mining Techniques

Data mining techniques vary across business problems and goals. Every business adopts different data mining techniques to solve different problems.

Take Amazon, for example. Its powerful recommendation system collects its customers’ data to recommend products based on their interests, past purchases, etc. Simultaneously, Amazon monitors its customers’ purchase history and return activities to detect fraudulent purchases. It shows that based on different business problems (or use-cases), the same data can be used for different purposes.

To address the specific needs of a business, several data mining techniques have been developed:

Association Rule finds associations and relationships among data items. It comprises simple if/then statements. Following is an example of an association rule, “if a customer buys a mobile phone, they are 60% likely to buy a phone cover”. Retailers frequently use this rule in Market Basket Analysis to see if certain types of items are purchased together.
Classification differentiates data into predefined classes. This technique works on the principle of ‘learning from history’; that is, a classification model first learns from already classified data (training phase) and classifies an unknown sample into a class (validation/testing phase). For example, determining customer churn is a classification problem with two possible classes – Churn/Not Churn.
Clustering divides a huge data set into different groups (or clusters) based on similarities within each cluster. It is an unsupervised classification counterpart – a supervised technique – that is, unlike classification, clustering does not have a training phase and works directly on unknown samples. For example, when target customer segments are not predefined, they can be found using the clustering technique.
Regression finds relationships between variables (i.e., columns in a database). For example, a company’s HR department can use regression to determine the probability of an employee’s attrition (scored between 0 and 1).
Prediction finds value from historical patterns and trends. Netflix’s recommendation system that customises user’s feed is a prime example of data mining’s predictive application.
Outlier Detection finds distortions, anomalies or outliers in data. Outlier detection is used in fraud detection, fault detection, etc.

What Skills Are Required For Data Mining?

A data mining project calls for an array of soft skills and hard skills to create for successful application or deployment.

The technical skills or the hard skills ensure that the tools and technologies are used correctly, and it includes the following:

Programming Languages: These include statistical software for data analysis. For example, R, Python, SQL, etc.
Business Intelligence Softwares: Business Intelligence software are special-purpose software designed to generate insights from data. They are typically used for data visualisation and descriptive analytics (deriving initial hunches from data)—for example, Tableau, PowerBI, Zoho Analytics, etc.
Machine Learning and Statistics: This represents the heart of data mining. Machine Learning is a subclass of Artificial Intelligence that defines any data mining model’s core functionality, be it classification, clustering, etc. Traditional statistics are often used in conjunction with Machine Learning to derive early insights and create final reports.
Software Engineering: This skill is used in project planning and various system analyses (for example, assessing whether the technology being used will become outdated).
Big Data Processing Frameworks: When data is huge (also called Big Data), traditional data mining and analytics frameworks do not render the required results. This is where the businesses opt for Big Data Processing Frameworks. Examples of big data processing frameworks include Hadoop, Spark, Storm, etc.
Database Management Systems (DBMS): These include relational and non-relational database systems for storing and retrieving datasets. Examples include SQL systems (MySQL, Oracle) and No-SQL systems (MongoDB, Firebase, Cassandra).

Among the non-technical skills required to develop a successful data mining project, the following are the most significant:

Domain-Knowledge: This includes industry knowledge and experiences that make an individual fit to work on specific types of projects.
Communication and Presentation: Creating final reports, presenting findings in clear and concise terms, and communicating results to the stakeholders are necessary for a successful project.

Data Mining Applications

Data mining is applied across multiple sectors, functions, and industries. Following are the most common data mining applications:

Fraud Detection: Financial institutions and credit-card companies are sensitive to fraudulent transactions like false insurance claims. With data mining, a business can identify hidden patterns to isolate and reject frauds.
Customer Segmentation: Companies use data mining to divide their target customer base into different segments (or clusters).
Retail Industry: Retailers use Market Basket Analysis to find associations among the items their customers purchase.
Healthcare: Drug trials and biomedical research in fields like genetics heavily use automated data mining systems.
Intrusion Detection: Supervision of network traffic and flagging suspicious activities have been achieved by Intrusion Detection Systems (IDS) via analysing network data
Banking Systems: Leading banks like JP Morgan use data mining for credit scoring, fraud detection, predicting payment defaults, etc.
Other Applications: Data mining is used in many engineering branches for anomaly detection (detecting abnormalities). It also finds applications in Lie Detection, Criminal Investigation, Counter-Terrorism, etc.

Privacy Concerns

In 2018, Facebook came to the spotlight with a massive data breach scandal that compromised millions of its users’ personal information to a British Consulting firm called Cambridge Analytica. This scandal called into question the ethical practices of not only Facebook but other AI-driven companies.

The Facebook-Cambridge Analytica scandal is a prime example of unethical data mining called data harvesting.

Data mining is a powerful technique but must be practised within ethical constraints. Hence, a business must ensure that its privacy policies are defined in all its stakeholders’ interests, including customers.