Calculating relevance - Fluid Topics

AFS Taruqa

Technical Notes

When Taruqa calculates a document's relevance score, it is all about probability. Taruqa's algorithm calculates the probability that the language model of a given document within a corpus will generate the user's query.

In other words: what is the likelihood that a document contains the query?

In this way, Taruqa transforms a document's relevance score into its probability score.

Meanwhile, the underlying assumption is not a perfect truth. This is because queries can contain multiple keywords, and Taruqa calculates a document's score for each of them. For example, if a query has two keywords, the probability that the first keyword will occur is independent of the probability that the second one will. This is called a unigram model.

In summary, a document's relevance score is based on the probability that each keyword in the query will belong to the document, as illustrated in the following formula:

score(w)freq(w,d)dscore(w) \approx \frac{freq(w,d)}{\mid d \mid}

In this formula, the probability that the keyword ww of the query will belong to the document dd is proportional to the number of times that the keyword appears in the document divided by the document's word count.

If a keyword does not occur in the document, the result of the formula is a score of "zero," meaning that a typical search engine would exclude the document from the search results.

However, while a short document about "information retrieval" might not contain the keyword "search," this does not mean that the document is not at all relevant if the query was "text search."

To avoid the pitfall of having a "zero" probability score, Taruqa uses a smoothing method. Smoothing methods for language models assign a nonzero probability to words that are "unseen" in the data in order to overcome the "sparse data" problem.

As there are more words by which to divide, the probability distribution becomes "smoother." Of the many methods available, Taruqa uses the Dirichlet smoothing method.

Taruqa's scores are based on the following formula:

score(w)freq(w,d)+(μfreq(w,C)C)freq(d,C)+μ score(w) \approx \frac{freq(w, d) + (\mu * \frac{freq(w, C)}{C})}{freq(d, C) + \mu}

where μ\mu is the smoothing coefficient. More details about the formula are available in the next section.