What is text mining? Explain different approaches to text mining?
1 Answer

enter image description here

Fig: Text Mining

  • Text Mining is the procedure of synthesizing information, by analyzing relations, patterns, and rules among textual data-semi structured or unstructured text.
  • This procedure contains text summarization, text categorization and text clustering.
  • Text summarization is the procedure to extract its partial content reflection to its whole contents automatically.
  • Text categorization is the procedure of assigning a category to the text among categories predefined by users.
  • Text clustering is the procedure of segmenting texts into several clusters, depending on the substantial relevance.


  • Data mining
  • Machine learning
  • Information retrieval
  • Statistics
  • Natural –language understanding
  • Case-based reasoning

Text Mining Approaches:

  1. Keyword based Association Analysis:

    • Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationship among them.
    • First preprocess the text data by parsing, stemming, removing stop words, etc.
    • Then evoke association mining algorithms -Consider each document as a transaction -View a set of keywords in the document as set of items in the transaction.
    • Term level association mining
    • No need for human effort in tagging documents. -The number of meaningless results and the execution time is greatly reduced.
  2. Document Classification Analysis:

    Automatic document classification:

    • Automatic classification for the tremendous number of on-line text documents(Web pages, emails, etc)
    • Text document classification differs from the classification of relational data as document databases are not structured according to attribute-value pairs.

Association-Based Document Classification:

  • Extract keywords and terms by information retrieval and simple association analysis techniques.
  • Obtain concept hierarchies of keywords and terms using Available term classes such as WordNet, Expert knowledge.
  • Classify documents in the training set into class hierarchies.
  • Apply term association mining method to discover sets of associated terms.
  • Use the term to maximally distinguish one class of documents from others.
  • Derive a set of association rules associated with each document class.
  • Order the classification rule based on their occurrence frequency and discriminative power.
  • Used the rules to classify new documents.

3. Document Clustering Analysis:

  • Automatically group related documents based on their contents.
  • Require no training sets or predetermined taxonomies, generate a taxonomy at runtime,
  • Major steps:
  • Preprocessing: Remove stop words, stem, feature extraction.
  • Hierarchical clustering: Compute similarities applying clustering algorithms.
  • Slicing: Fan out controls; flatten the tree to configurable number of levels.
Please log in to add an answer.