• There are many approaches to text mining, which can be classified from different perspectives, based on the inputs taken in the text mining system and the data mining tasks to be performed.
• In general, the major approaches, based on the kinds of data they take as input, are:
(1) the keyword-based approach, where the input is a set of keywords or terms in the documents,
(2) the tagging approach, where the input is a set of tags, and
(3) the information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction.
• A simple keyword-based approach may only discover relationships at a relatively shallow level, such as rediscovery of compound nouns (e.g., “database” and “systems”) or co-occurring patterns with less significance (e.g., “terrorist” and “explosion”).
• It may not bring much deep understanding to the text.
• The tagging approach may rely on tags obtained by manual tagging (which is costly and is unfeasible for large collections of documents) or by some automated categorization algorithm (which may process a relatively small set of tags and require defining the categories beforehand).
• The information-extraction approach is more advanced and may lead to the discovery of some deep knowledge, but it requires semantic analysis of text by natural language understanding and machine learning methods. This is a challenging knowledge discovery task.
• Various text mining tasks can be performed on the extracted keywords, tags, or semantic information. These include document clustering, classification, information extraction, association analysis, and trend analysis.