Yioop - PHP Search Engine

Practice Final Answers.

Originally Posted By: sctice

Problem 3

Question: Explain why it is reasonable to guess that Δ-values for posting lists follow a geometric distribution.

The geometric distribution is defined by Pr[X = k] = (1 - p)^{k - 1} * p.

Assume that the probability of finding term t in a random document is N_t / N, where N_t is the number of documents that contain t, and N is the total number of documents. Also, assume that documents are independent of each other.

The probability a Δ-value is equal to k is the probability of finding a term t in a document given that the previous k-1 documents did not contain t. This probability is

Pr[Δ = k] = (1 - (N_t / N))^{k-1} * (N_t / N)

which is just the geometric distribution with p = (N_t / N).

'''Originally Posted By: sctice''' Problem 3 Question: Explain why it is reasonable to guess that Δ-values for posting lists follow a geometric distribution. The geometric distribution is defined by Pr[X = k] = (1 - p)^{k - 1} * p. Assume that the probability of finding term t in a random document is N_t / N, where N_t is the number of documents that contain t, and N is the total number of documents. Also, assume that documents are independent of each other. The probability a Δ-value is equal to k is the probability of finding a term t in a document given that the previous k-1 documents did not contain t. This probability is Pr[Δ = k] = (1 - (N_t / N))^{k-1} * (N_t / N) which is just the geometric distribution with p = (N_t / N).

-- Practice Final Answers

Originally Posted By: dhruvjalota

Question 9: Explain one method to estimate P1 and one method to estimate P2 in the divergence-from-randomness approach to coming up with a relevance measure.

Answer:
The basic starting formula for DFR is the following: (1-P2).(-logP1)
P1 represents the probability that a random document d contains exactly f_t,d occurrences of t.
The second factor, P2 is related to eliteness and compensates for this rapid change. A document is said to be elite for the term t when it is "about" the topic associated with the term.
To determine the relevance of a document to a query with this model we calculate: Sum[t_belongs_q](q_t)(1-P_2,t).(-logP_1,t) where P_1,t and P_2,t are the P1 and P2 associated with the particular term t.
To estimate each, we use Binomial co-efficient: [N+lt_lt-1] = [N+ lt - 1]!/[N-1]!(lt)! (*)

To compute P1 suppose d is found to contain f_t,d occurrences of t.
Using the binomial coefficient, we get:
[(N-1)+ (lt - f_t,d) - 1)_lt-f_t,d] = [(N-1)+(lt-f_t,d)-1]!/(N-2)!(lt-f_t,d)! (**)
We can estimate P1 as the ratio (**)/(*)

Thus, estimates for P1 and -logP1:
P1=(1/[1+(lt/N)])([lt/N]/[1+(lt/N)])_f_t,d
and
-logP1=log(1+[lt/N])+f_t,d.log(1+[N/lt])

For P2, using law of succession:
estimate P2 = {(f_t,d)/[(f_t,d)+1]}

Using this, estimate (1-P2).(-logP1) =
{log(1+[lt/N])/(f_t,d+1)}+(f_t,d)log(1+[N/lt])

'''Originally Posted By: dhruvjalota''' Question 9: Explain one method to estimate P1 and one method to estimate P2 in the divergence-from-randomness approach to coming up with a relevance measure. Answer: The basic starting formula for DFR is the following: (1-P2).(-logP1) P1 represents the probability that a random document d contains exactly f_t,d occurrences of t. The second factor, P2 is related to eliteness and compensates for this rapid change. A document is said to be elite for the term t when it is "about" the topic associated with the term. To determine the relevance of a document to a query with this model we calculate: Sum[t_belongs_q](q_t)(1-P_2,t).(-logP_1,t) where P_1,t and P_2,t are the P1 and P2 associated with the particular term t. To estimate each, we use Binomial co-efficient: [N+lt_lt-1] = [N+ lt - 1]!/[N-1]!(lt)! (*) To compute P1 suppose d is found to contain f_t,d occurrences of t. Using the binomial coefficient, we get: [(N-1)+ (lt - f_t,d) - 1)_lt-f_t,d] = [(N-1)+(lt-f_t,d)-1]!/(N-2)!(lt-f_t,d)! (**) We can estimate P1 as the ratio (**)/(*) Thus, estimates for P1 and -logP1: P1=(1/[1+(lt/N)])([lt/N]/[1+(lt/N)])_f_t,d and -logP1=log(1+[lt/N])+f_t,d.log(1+[N/lt]) For P2, using law of succession: estimate P2 = {(f_t,d)/[(f_t,d)+1]} Using this, estimate (1-P2).(-logP1) = {log(1+[lt/N])/(f_t,d+1)}+(f_t,d)log(1+[N/lt])

-- Practice Final Answers

Originally Posted By: sheetalg

Problem 5. What are some advantages and disadvantages of the IMMEDIATE MERGE versus NO MERGE index update strategies.

IMMEDIATE MERGE:
Advantages:
1. Works well when the size of the posting list to be added is not much smaller than the entire posting list for that term.
2. Better query processing because of contiguous on-disk posting lists.
Disadvantages:
1. Impractical for large indices because total number of tokens to transfer to/from disk is quadratic in number of tokens in the collection.
2. Require the search engine to read the entire index from disk every time it runs out of memory.

NO MERGE:
Advantages:
1. Practical for large indices
2. Since there is no merging, there is no need to read the entire index from disk.
Disadvantages:
1. Large number of disk seeks required to process a query
2. Query processing is slower and inefficient because of large number of disk seeks required.

'''Originally Posted By: sheetalg''' Problem 5. What are some advantages and disadvantages of the IMMEDIATE MERGE versus NO MERGE index update strategies. IMMEDIATE MERGE: Advantages: 1. Works well when the size of the posting list to be added is not much smaller than the entire posting list for that term. 2. Better query processing because of contiguous on-disk posting lists. Disadvantages: 1. Impractical for large indices because total number of tokens to transfer to/from disk is quadratic in number of tokens in the collection. 2. Require the search engine to read the entire index from disk every time it runs out of memory. NO MERGE: Advantages: 1. Practical for large indices 2. Since there is no merging, there is no need to read the entire index from disk. Disadvantages: 1. Large number of disk seeks required to process a query 2. Query processing is slower and inefficient because of large number of disk seeks required.

-- Practice Final Answers

Originally Posted By: sheetalg

Problem 6. Describe how the Hybrid Index Maintenance system works.

From Nov 23rd, slide 4
IMMEDIATE MERGE works well when the size of the new postings we are adding is not too much smaller than the entire posting list for the term.
INPLACE works better when the number of disk accesses is kept low.
The book describes a hybrid set-up where we have two indexes a merge index and a inplace index.
Initially, all new terms are placed into the merge index.
When a list size reaches a certain threshold (based on the disk seek time ~ 0.5MB for typical hard drives), it is switched from the merge to the inplace index.
Since most terms never reach this threshold, for the majority of terms we use IMMEDIATE MERGE and this is fast because the posting lists in question are all short.
Since there are only a few longer postings lists, we don't need to perform that many seeks and we achieve the performance benefits of INPLACE for these longer posting lists.

'''Originally Posted By: sheetalg''' Problem 6. Describe how the Hybrid Index Maintenance system works. From Nov 23rd, slide 4 IMMEDIATE MERGE works well when the size of the new postings we are adding is not too much smaller than the entire posting list for the term. INPLACE works better when the number of disk accesses is kept low. The book describes a hybrid set-up where we have two indexes a merge index and a inplace index. Initially, all new terms are placed into the merge index. When a list size reaches a certain threshold (based on the disk seek time ~ 0.5MB for typical hard drives), it is switched from the merge to the inplace index. Since most terms never reach this threshold, for the majority of terms we use IMMEDIATE MERGE and this is fast because the posting lists in question are all short. Since there are only a few longer postings lists, we don't need to perform that many seeks and we achieve the performance benefits of INPLACE for these longer posting lists.

-- Practice Final Answers

Originally Posted By: Amruta

Can others please post the remaining answers?

'''Originally Posted By: Amruta''' Can others please post the remaining answers?

-- Practice Final Answers

Originally Posted By: jnewth

Problem 4: Briefly describe the REBUILD and REMERGE batch index operations.

This discussion is on page 228-231 of the text and the notes of November 21st.

The REBUILD and REMERGE strategies are BATCH update strategies where the index is periodically updated rather than continually updated.

REBUILD is:
1. The index is built for the corpus.
2. Time passes and the corpus changes.
3. A new index is built for the updated corpus.
4. The new index replaces the old index.

REMERGE is:
1. The index is built for the corpus.
2. Time passes and the corpus changes.
3. A partial index is built ONLY for the documents added to the corpus.
4. At some point the new index is merged in to the original index.

If there are only additions, REMERGE is the clear winner. However, when there are deletions in the corpus, there is a tradeoff between the two strategies. The tradeoff is derived in the following manner:

REBUILD:

The documents in the new corpus are the old documents, minus and deletions, plus any additions:

dnew = dold - ddelete + dinsert

REBUILD_TIME = c1*(dold - ddelete + dinsert) //for some constant c

REMERGE:

The total time to compute is a function of the new documents added and the final merge operation. We estimate this merge takes roughly 25% of the time to build the full index.

REMERGE_TIME = c*dinsert + (c/4)*(dold+dnew)

We set the two equal and solve for ddelete.

ddelete = (3/4)*dold - 1/4*dinsert

Assuming that the number of insertions and deletions roughly equals (ie, very gradual change in size), then

dinsert = ddelete

and
ddelete + (1/4) ddelete = 3/4 dold or more simply

ddelete = (3/5) dold

The point where the two strategies take about the same time is when we are deleting 60% of the existing corpus. This doesn't happen very often so REMERGE is almost always the better strategy.

'''Originally Posted By: jnewth''' Problem 4: Briefly describe the REBUILD and REMERGE batch index operations. This discussion is on page 228-231 of the text and the notes of November 21st. The REBUILD and REMERGE strategies are BATCH update strategies where the index is periodically updated rather than continually updated. REBUILD is: 1. The index is built for the corpus. 2. Time passes and the corpus changes. 3. A new index is built for the updated corpus. 4. The new index replaces the old index. REMERGE is: 1. The index is built for the corpus. 2. Time passes and the corpus changes. 3. A partial index is built ONLY for the documents added to the corpus. 4. At some point the new index is merged in to the original index. If there are only additions, REMERGE is the clear winner. However, when there are deletions in the corpus, there is a tradeoff between the two strategies. The tradeoff is derived in the following manner: REBUILD: The documents in the new corpus are the old documents, minus and deletions, plus any additions: dnew = dold - ddelete + dinsert REBUILD_TIME = c1*(dold - ddelete + dinsert) //for some constant c REMERGE: The total time to compute is a function of the new documents added and the final merge operation. We estimate this merge takes roughly 25% of the time to build the full index. REMERGE_TIME = c*dinsert + (c/4)*(dold+dnew) We set the two equal and solve for ddelete. ddelete = (3/4)*dold - 1/4*dinsert Assuming that the number of insertions and deletions roughly equals (ie, very gradual change in size), then dinsert = ddelete and ddelete + (1/4) ddelete = 3/4 dold or more simply ddelete = (3/5) dold The point where the two strategies take about the same time is when we are deleting 60% of the existing corpus. This doesn't happen very often so REMERGE is almost always the better strategy.

-- Practice Final Answers

Originally Posted By: Rohit Kulkarni

For each of the following give the distribution for which it is an optimal code: (a) unary code, gamma code, delta code.

Answers:

a) Optimal distribution for unary code: 2^-k

b) Optimal distribution for gamma code: 1/(2*k*k)

c) Optimal distribution for delta code: 1/ (2k*((log k)^2))

'''Originally Posted By: Rohit Kulkarni''' For each of the following give the distribution for which it is an optimal code: (a) unary code, gamma code, delta code. Answers: a) Optimal distribution for unary code: 2^-k b) Optimal distribution for gamma code: 1/(2*k*k) c) Optimal distribution for delta code: 1/ (2k*((log k)^2))

-- Practice Final Answers

Originally Posted By: sayali

Problem 8. Give the equations for the LMJM and LMD relevance measures. Do one example calculation with each.

Ans:
i. Equation for LMJM and LMD
Link - http://www.cs.sjsu.edu/faculty/pollett/ ... ml#%289%29
where,
qt = frequency of term 't' in query q, (q = )
lambda = 0.5
ft,d = frequency of term t in document d
ld = length of document d .. (description for each symbol to be mentioned)

ii. Example calculation
For LMJM, refer page 295 in the book
For LMD, reduced form of equation (equation 9.33 on page 295) is used for calculation.
For LMD calculation, refer page 296 in the book.

'''Originally Posted By: sayali''' Problem 8. Give the equations for the LMJM and LMD relevance measures. Do one example calculation with each. Ans: i. Equation for LMJM and LMD Link - http://www.cs.sjsu.edu/faculty/pollett/ ... ml#%289%29 where, qt = frequency of term 't' in query q, (q = ) lambda = 0.5 ft,d = frequency of term t in document d ld = length of document d .. (description for each symbol to be mentioned) ii. Example calculation For LMJM, refer page 295 in the book For LMD, reduced form of equation (equation 9.33 on page 295) is used for calculation. For LMD calculation, refer page 296 in the book.

-- Practice Final Answers

Originally Posted By: sandybb

Q7. Briefly explain the BM25F relevance measure. Briefly explain how pseudo-relevance feedback works.
Ans-
In the BM25F relevance measure, a document is split into different components. BM25 scores are computed for each component. The weighted sum of BM25 scores for the individual components is the BM25F score for the document.
For example, the document can be split into title and body. BM25 scores are calculated for each. The weights of 10 and 1 are assigned to title and the body. So, the final weighted sum will be the BM25F score for the document.

Pseudo relevance feedback -
An initial query is performed to compute top m results. The retrieval system assumes the top m results to be relevant. Then using a scoring system the terms for which these results are the most relevant, are found.
These terms are added to the initial query and query results are computed based on this expanded query.

'''Originally Posted By: sandybb''' Q7. Briefly explain the BM25F relevance measure. Briefly explain how pseudo-relevance feedback works. Ans- In the BM25F relevance measure, a document is split into different components. BM25 scores are computed for each component. The weighted sum of BM25 scores for the individual components is the BM25F score for the document. For example, the document can be split into title and body. BM25 scores are calculated for each. The weights of 10 and 1 are assigned to title and the body. So, the final weighted sum will be the BM25F score for the document. Pseudo relevance feedback - An initial query is performed to compute top m results. The retrieval system assumes the top m results to be relevant. Then using a scoring system the terms for which these results are the most relevant, are found. These terms are added to the initial query and query results are computed based on this expanded query.

-- Practice Final Answers

Originally Posted By: Rohit Kulkarni

Problem 10: Map Reduce algorithm for most common word

Step1: map (k,v) =
split v into tokens
for each token t do
output(t, 1)
return

Step2: reduce (k, =
count )
max_count max_count
max_count Rohit Kulkarni — Tue Dec 13, 2011 10:04 am <hr>

'''Originally Posted By: Rohit Kulkarni''' Problem 10: Map Reduce algorithm for most common word Step1: map (k,v) = split v into tokens for each token t do output(t, 1) return Step2: reduce (k, = count ) max_count max_count max_count Rohit Kulkarni — Tue Dec 13, 2011 10:04 am <hr>