Heuristic for score computation - Fluid Topics

AFS Taruqa

Category
Technical Notes
Audience
public

To improve response times, the freq(w,C,ctx)freq(w, C, ctx) (total number of matches in the context ctx for the whole corpus) component of the score computation is not computed precisely for windows. Instead, a heuristic is used.

For each field in the context ctx, the frequency of the rarest term in this field is used as the frequency of the window. The frequency of the window in the whole context is the sum of the values retained for each field.

For example, for the following window:

{"ow": {"query": "user guide", 
        "context": ["afs_title", "afs_abstract"]
       }
}

If:

  • user appears 10 times in afs_title and 1000 times in afs_abstract
  • guide appears 15 times in afs_title and 200 times in afs_abstract

Then, for afs_title, the rarest term is user with 10 occurrences. For afs_abstract, the rarest term is guide with 200 occurrences.

Thus, the value used as the frequency of the window in the whole context is 210.

This heuristic value is an upper bound for the actual value, as each occurrence of the window must include one occurrence of its rarest term.

The actual frequency of the window is between 0 (when the window never appears and the heuristic value (when every occurrence of the rarest term is part of a matching window).

Using this heuristic instead of the exact frequency only changes the baseline probability used for smoothing document scores, which is constant across documents.

The difference in score between documents is the same as if the actual window frequency was used.

Example

To query back office guide as an exact search (appearing in that exact order), the syntax is as follows:

afs:taruqa={"ow": {"query": "back office guide"}

It is possible to widen the window size as follows:

afs:taruqa={"ow": {"query": "back office guide", "size": 1}

In this case, the query will match back office guide as well as back office configuration guide.