What is text mining? Explain different approaches to text mining?

488views

written 8.0 years ago by

modified 8.0 years ago by

ramnath • 100

Fig: Text Mining

Text Mining is the procedure of synthesizing information, by analyzing relations, patterns, and rules among textual data-semi structured or unstructured text.
This procedure contains text summarization, text categorization and text clustering.
Text summarization is the procedure to extract its partial content reflection to its whole contents automatically.
Text categorization is the procedure of assigning a category to the text among categories predefined by users.
Text clustering is the procedure of segmenting texts into several clusters, depending on the substantial relevance.

Techniques:

Text Mining Approaches:

Keyword based Association Analysis:
- Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationship among them.
- First preprocess the text data by parsing, stemming, removing stop words, etc.
- Then evoke association mining algorithms -Consider each document as a transaction -View a set of keywords in the document as set of items in the transaction.
- Term level association mining
- No need for human effort in tagging documents. -The number of meaningless results and the execution time is greatly reduced.
Document Classification Analysis:

Automatic document classification:
- Automatic classification for the tremendous number of on-line text documents(Web pages, emails, etc)
- Text document classification differs from the classification of relational data as document databases are not structured according to attribute-value pairs.

Association-Based Document Classification:

Extract keywords and terms by information retrieval and simple association analysis techniques.
Obtain concept hierarchies of keywords and terms using Available term classes such as WordNet, Expert knowledge.
Classify documents in the training set into class hierarchies.
Apply term association mining method to discover sets of associated terms.
Use the term to maximally distinguish one class of documents from others.
Derive a set of association rules associated with each document class.
Order the classification rule based on their occurrence frequency and discriminative power.
Used the rules to classify new documents.

3. Document Clustering Analysis:

Automatically group related documents based on their contents.
Require no training sets or predetermined taxonomies, generate a taxonomy at runtime,
Major steps:
Preprocessing: Remove stop words, stem, feature extraction.
Hierarchical clustering: Compute similarities applying clustering algorithms.
Slicing: Fan out controls; flatten the tree to configurable number of levels.

ADD COMMENT EDIT