Differentiate Boolean based and probabilistic based matching process.

61views

written 5.0 years ago by

In Boolean retrieval systems, the design of queries is based on the set operations and logical connectives. The response to any query generated is either true or false, that is, yes or no, this means if the document and a query have a term matching, then the response to a query is positive. This forces the viewpoint that queries here are not points in the document space, rather query is a logical function of the given words. In the absence of the structural similarity between document and query, the query is regarded as a separate entity. As a result, the retrieval process with respect to a given query is regarded as a characteristic function defined on the document space.

A Boolean query applied to any document is wither satisfied by the document or it is not, thus, there is no scope for the growth of the similarity measures. Hence, the characteristic function (the mapping of Boolean query to the document) of the query defined divides the document space into two distinct sets: those documents that specify the query and those that do not.

Grading of the retrieved set of documents is possible by modifying the Boolean query system. Consider a query P OR Q OR R. In response to this query, the document which contains one or more than one terms will be retrieved. Depending on the number of terms identified in the document, it can be graded. Say a document containing all the three terms will have higher grades than the other documents which contain less than three terms. Although Boolean query model permits the presence or absence of terms, the frequency of terms cannot be included and hence organisation of the retrieved set on the basis of similarity measure depending on frequency cannot be done.

The Boolean and vector based matching is rigid. In Boolean based matching, either the document meets the logical conditioner it does not, whereas in vector based matching, the document is either above or below the similarity threshold which determines whether a document is the member of a set or not. The drawback of Boolean model and vector model is that both do not address the uncertainties in text retrieval directly. Retrieval models such as probabilistic model and fuzzy model are more promising and try to represent the text retrieval uncertainties more directly.

The probabilistic matching works on the principle that for the given document and query, the probability can be calculated such that the document is relevant to the query. To understand the concept of probabilistic matching, we need to understand the concept of probability theory. The basic assumption is that a single query is taken into consideration. Also, the number of documents which are relevant to the query is known. Thus, all of the probabilities are taken in-context to the query and that a randomly selected document from the database is accompanied with a certain probability which is relevant to the query. If a database contains N documents, out of which n are relevant, then the probability that the document is relevant to the query is estimated as

enter image description here

similarly, the probability that a document is not relevant to the query is then given by

enter image description here

Thus, the documents are matched on the fact of how well they match a query and not just selecting document randomly. The matching process is based on the analysis of the terms (syntactic, semantic and other pragmatic cues) contained in both the document and the query. Thus, the relevance of the terms in a query is matched with the document in which they occur.

The conditional probability of occurrence of event X after event Y has occurred is denoted by P(X)(y). Thus the perception of the probability of X can depend on the awareness about event Y. For example, assume that a randomly picked word from some document is Mahal. Thus, there is a certain probability that this word can be picked. But, whenever a word Taj is picked, there is a higher probability that Mahal would be the next word following Taj. This can also be proven by counting the relative number of occurrences of Mahal in the document by estimating P(Mahal) and a relative number of occurrences following Taj estimate P(Mahal/Taj).

ADD COMMENT EDIT