short note on : Stemming

65views

written 7.2 years ago by

teamques10 ★ 70k

The uncontrolled vocabulary has another limitation, that is, a given word may occur in several different forms. For example, process, processes, processing, processor, processors, processed, and other words can be formed with a basic word. It is not acceptable that just because a particular word, for example, ‘processor’ in a document has less frequency count as compared to other forms, the intended relevant word is not found. Also, most of the time, all these words are very closely related and deal with a common concept.

Stemming is a process where the word ending is stripped, and the word is reduced to a common core or stem. Thus, all the forms of the word are reduced to a stem resulting in a higher frequency count and thus increasing the significance of the term. For example, for a word ‘processor’ the stem could be ‘process’. While processing a query, stemming assures that the result does not get affected by words frequency of occurrence.

The stemming process is realised with the help of stemming algorithms. Over the years, many stemming algorithms had been developed. They work iteratively for example, a word ‘processing’s’ would be first stripped to ‘processing’ then to ‘process’. Thus, the word finally reduces to a stem. The iterative approach followed by stemming algorithms needs multiple passes. The algorithm first looks for the longest suffix and does not need any initial knowledge about all the possible forms of the word. The efficiency of any algorithm depends on the code and identifying and stripping of the suffixes.

There are many words with prefixes which can be possible candidates for stripping. However, most stemming algorithms do not strip prefixes. Striping prefixes is not easy because it is difficult to decide whether the combination of letters in a specified word are a prefix or not. For example, im is used in words such as impossible, impractical and important. The first two words have in as a prefix, whereas in the third word im is a part of the whole word. Also, the meaning and the context of the words get changed. Therefore, it is undesirable to strip prefixes as it may or may not be beneficial.

Stemming has two problems. The first is the stripping of suffix due to misinterpretation. For example, ing can be stripped from processing. But not from ring. A solution to this problem can be given by having a maximum acceptable stem length along with a list of excluded words which are exceptional. Some rules can always be defined for special cases, like a word cannot be stripped if it is in the exception list. The second problem is changing stem in various forms of words. This problem can also be solved by defining some special rules to handle such words. For example, life and lives are such words. The stemming algorithm should handle such changes also.

Stemming a large document database is time-consuming. The original document representation could be changed with a very less factor. Therefore, searching for stems based on frequency counts is not much affected. Stemming the queries only and then using a wild card in matching the query to the document could be a better solution. Thus, if the word processing is there in a query, the stem could be proce and then find a match for proce*, the asterisk stands for wildcards. The stemming algorithms need to be a bit different in such cases.

ADD COMMENT EDIT