Text Retrieval - part 1 | szymanskilukasz

Probably you’ve heard about some great search platforms like Elasticsearch or Solr - they have one thing in common, they are built on top of the powerful search engine Apache Lucene. To provide such powerful features Apache Lucene implements solutions for problem defined as a Text Retrieval.

What is text retrieval?

We can define it as a task where the system responds to user query with relevant documents. Sounds easy right? Well, not quite.

The main problem is that we can’t mathematically prove that one method is better than another. Our challenge is to design most effective ranking function, and empirically evaluate it - involving users which will going to use are search engine.

Mathematics

Lets define our problem

Vocabulary: V = {w₁,…,w_N} - Set of words - might be multiple languages.

Query: q = q₁,…,q_m where qi ∈ V

Document: d_i = d_i1,…,d_imj where d_ij ∈ V

Collection: C = {d₁,…,d_M}

Set of relevant documents: R(q) ⊆ C

Our task is to compute R’(q), an approximation of R(q)

How to compute R’(q)

We can use one of the following techniques.

Document selection

$R'(q) = \{d \in C \vert f(d,q) = 1\}$ where $f(d,q) \in \{0,1\}$

f(d,q) here is binary classifier - system decides is a doc is relevant or not.

Document ranking

$R'(q) = \{d \in C \vert f(d,q) > \theta \}$

where $f(d,q) \in \Re$ is a relevance measure function;

$\theta$ is a cutoff determined by the user

Document selection vs ranking

Main problem with the document selection classifier is fact, that it is very hard to make it accurate. Most of the time it will be over-constrained or under-constrained - having no relevant documents to return or having it too much.

Even if we make our classifier highly accurate, we must remember that documents are not equally relevant - they need have a prioritization.

Because of that document ranking is generally preferred.

Retrieval models

As I said before our goal is to design most effective ranking function. We can tell that our ranking function is good when it ranks relevant documents on top of non-releveant documents.

OK, but - what means relevant?

How we can measure that document “d” is relevant to query “q”?

Retrieval models are our answer for relevance formalization.

Common ideas

Lets assume that we have a query like

q = “Best Retrieval Models”

So our ranking function will be

f(q=”Best Retrieval Models”, d)

Score of query would be based on each individual word

g(“Best”, d)
g(“Retrieval”, d)
g(“Models”, d)

How to calculate the score?

Term Frequency (TF) - How many times does a word occur in document “d”, e.g.

How many times does “Best” occur in document “d”?
How many times does “Retrieval” occur in document “d”?
How many times does “Models” occur in document “d”?

Document Length - How long is document “d”?

If term occurs with equal frequency, but one of the documents is shorter the score will be higher. Same in the other hand - if document is longer, there will be higher probability that term occurs in that document.

Document Frequency - How often do we see a word in entire collection?

If term is a common term, score will be smaller.

Summary

Now you know what text retrieval is, why document ranking is better than selection and what are common ideas that retrieval models rely on.

In the next part we will talk about Vector Space Model as an example of retrieval model.