Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data .Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD
The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
(1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation.
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm Clustering based on density (local cluster criterion), such as density-connected points
- Handle noise
- One scan
- Discover clusters of arbitrary shape
- Need density parameters as termination condition
A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts
Input: D : a data set containing n objects, ε : the radius parameter, and MinPts: the neighborhood density threshold.
Output: A set of density-based clusters.
1) mark all objects as unvisited;
3) randomly select an unvisited object p;
4) mark p as visited;
5) if the ε -neighborhood of p has at least MinPts objects
6) create a new cluster C, and add p to C;
7) let N be the set of objects in the ε -neighborhood of p;
8) for each point p' in N
9) if p' is unvisited
10) mark p' as visited;
11) if the -neighborhood of p' has at least MinPts points,add those points to N ;
12) if p' is not yet a member of any cluster, add p' to C;
13) end for
15) else mark p as noise;
16) until no object is unvisited;
- DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means.
- DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
- DBSCAN has a notion of noise, and is robust to outliers.
- DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database.
- DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an R* tree. . Disadvantages 1.DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data is processed. 2.The quality of DBSCAN depends on the distance measure used in the function regionQuery(P,ε). The most common distance metric used is Euclidean distance. 3.DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters. 4.If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.