What are the major issues in Data Mining?

435views

written 8.1 years ago by

There are many major issues in data mining:

Mining methodology and user interaction:

• Mining different kinds of knowledge in databases.

 Because different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge discovery tasks.

 Eg: data characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.

• Interactive mining of knowledge at multiple levels of abstraction.

 Interactive mining allows users to focus the search for patterns, providing and refining data mining requests based on returned results.

 Knowledge should be mined by drilling down, rolling up, and pivoting through the data space and knowledge space interactively.

• Incorporation of background knowledge.

 Background knowledge, may be used to guide the discovery process and allow discovered patterns to be expressed in concise terms at different levels of abstraction.

 Domain knowledge can help focus and speed up a data mining process, or judge the interestingness of discovered patterns.

• Data mining query languages and ad-hoc data mining.

 Relational query languages (such as SQL) allow users to pose ad hoc queries for data retrieval.

 High-level data mining query languages need to be developed to allow users to describe ad hoc data mining tasks.

 Such a language should be integrated with a database or data warehouse query language and optimized for efficient and flexible data mining.

• Presentation and visualization of data mining results.

 Discovered knowledge should be expressed in high-level languages, visual representations, or other expressive forms so that the knowledge can be easily understood and directly usable by humans.

 The system must adopt expressive knowledge representation techniques,such as trees, tables, rules, graphs, charts, crosstabs, matrices, or curves.

• Handling noise and incomplete data.

 The data stored in a database may reflect noise, exceptional cases, or incomplete data objects.

 These objects may confuse the process. As a result, the accuracy of the discovered patterns can be poor.

 Data cleaning methods and data analysis methods that can handle noise are required.

 Pattern evaluation: the interestingness problem.

 A data mining system can uncover thousands of patterns. Many of the patterns discovered may be uninteresting to the given user, either because they represent common knowledge or lack novelty.

Performance and scalability:

• Efficiency and scalability of data mining algorithms.

 To effectively extract information from a huge amount of data in databases, data mining algorithms must be efficient and scalable.

 In other words, the running time of a data mining algorithm must be predictable and acceptable in large databases. • Parallel, distributed and incremental mining methods.

 The huge size of many databases, the wide distribution of data, the high cost of some data mining processes and the computational complexity of some data mining methods are factors motivating the development of parallel and distributed data mining algorithms.

 Such algorithms divide the data into partitions, which are processed in parallel. The results from the partitions are then merged.

 They incorporate database updates without having to mine the entire data again “from scratch.”

Issues relating to the diversity of data types:

• Handling relational and complex types of data.

 It is unrealistic to expect one system to mine all kinds of data, given the diversity of data types and different goals of data mining.

 Specific data mining systems should be constructed for mining specific kinds of data.

 Therefore, one may expect to have different data mining systems for different kinds of data.

• Mining information from heterogeneous database and global information systems (WWW).

 Data mining may help disclose high-level data regularities in multiple heterogeneous databases that are unlikely to be discovered by simple query systems and may improve information exchange and interoperability in heterogeneous databases.

 Web mining, uncovers interesting knowledge about Web contents, Web structures, Web usage, and Web dynamics.

Issues related to applications and social impacts:

• Application of discovered knowledge

 Domain specific data mining tools.

 Intelligent query answering.

 Process control and decision making.

• Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem.

• Protection of data security, integrity, and privacy.

ADD COMMENT EDIT