2011-11-02

Practice Midterm #2 Answers.

Originally Posted By: jnewth

Problem 4: With regard to hash-table chaining, what are the move-to-front and insert-at-back heuristics?


This is described in the book on page 121 and in the notes of October 17, 2011.

We are building a dictionary that maps terms to posting lists. These are heuristics for improving our hash chain lookup. If we use a hash table, there will likely be the case that multiple terms hash to the same location. To resolve this, we assemble the posting lists in chains. Then, to find a term's posting list, you have to first hash the term to find the right chain, then index along the chain until you find the posting list of the term you are looking for. To speed this up, we want to make more frequently searched terms near the start of the chain.

Two methods of doing this are Insert-at-back and Move-to-front.

Insert-at-back suggests that, when building the dictionary, if a term is seen early during dictionary construction, it is likely a frequent term in the corpus and likely a term frequently searched for. In addition, if a term is not seen until later on during construction, it probably isn't a frequently searched for term. We want frequently searched for terms to be close to the start and infrequently searched for terms to be in the back, so we always insert-at-back.

Move-to-front suggests that, when building a dictionary, if a term is searched for (for example, we have a new posting for a given term and want to add it to that terms posting list) we will hash the term, find the chain, then search along the chain till we find a posting list for the term, and then add that posting to that term's posting list. Then, we move that posting list we just found to the front of the hash. You can imagine if this runs for a while, those terms that are frequently searched for will constantly being moved towards the front, and infrequently searched for terms will gradually drift to the back.
'''Originally Posted By: jnewth''' <br>Problem 4: With regard to hash-table chaining, what are the move-to-front and insert-at-back heuristics?<br><br><br>This is described in the book on page 121 and in the notes of October 17, 2011.<br><br>We are building a dictionary that maps terms to posting lists. These are heuristics for improving our hash chain lookup. If we use a hash table, there will likely be the case that multiple terms hash to the same location. To resolve this, we assemble the posting lists in chains. Then, to find a term's posting list, you have to first hash the term to find the right chain, then index along the chain until you find the posting list of the term you are looking for. To speed this up, we want to make more frequently searched terms near the start of the chain.<br><br>Two methods of doing this are Insert-at-back and Move-to-front.<br><br>Insert-at-back suggests that, when building the dictionary, if a term is seen early during dictionary construction, it is likely a frequent term in the corpus and likely a term frequently searched for. In addition, if a term is not seen until later on during construction, it probably isn't a frequently searched for term. We want frequently searched for terms to be close to the start and infrequently searched for terms to be in the back, so we always insert-at-back.<br><br>Move-to-front suggests that, when building a dictionary, if a term is searched for (for example, we have a new posting for a given term and want to add it to that terms posting list) we will hash the term, find the chain, then search along the chain till we find a posting list for the term, and then add that posting to that term's posting list. Then, we move that posting list we just found to the front of the hash. You can imagine if this runs for a while, those terms that are frequently searched for will constantly being moved towards the front, and infrequently searched for terms will gradually drift to the back.

-- Practice Midterm #2 Answers
Originally Posted By: jnewth

#10. What is a gamma code?


This is a prefix-free encoding of posting list gaps (as opposed to absolute locations or offsets from the start) described on page 193 of the book and in the notes of October 31, 2011.

The idea is that every integer K >= 1 is specified as a string of the form:

[selector] [body]

where selector is a string of 0's followed by a single 1. The length of this string is the number of bits required to encode K. For example, if k = 5, in base 2 this is 101. We need 3 bits to encode this, therefore our selector is 001.

The body is the binary representation of K (with one minor change). If k = 5, in base 2 this is 101.

The book makes the observation that the leading bit of the body for every k is 1, so we can drop it from the encoding.

Therefore, k = 5 is written as 001 01.

The length of a gamma code formed in this manner is 2*floor( log2(k) ) + 1.
'''Originally Posted By: jnewth''' <br>#10. What is a gamma code?<br><br><br>This is a prefix-free encoding of posting list gaps (as opposed to absolute locations or offsets from the start) described on page 193 of the book and in the notes of October 31, 2011.<br><br>The idea is that every integer K &gt;= 1 is specified as a string of the form:<br><br>[selector] [body]<br><br>where selector is a string of 0's followed by a single 1. The length of this string is the number of bits required to encode K. For example, if k = 5, in base 2 this is 101. We need 3 bits to encode this, therefore our selector is 001.<br><br>The body is the binary representation of K (with one minor change). If k = 5, in base 2 this is 101.<br><br>The book makes the observation that the leading bit of the body for every k is 1, so we can drop it from the encoding.<br><br>Therefore, k = 5 is written as 001 01.<br><br>The length of a gamma code formed in this manner is 2*floor( log2(k) ) + 1.

-- Practice Midterm #2 Answers
Originally Posted By: sayali
Q5. Give the formula for BM25. How can binary heaps be used to speed up document-at-a-time query processing?

1. Formula for BM25
ScoreBM25(q,d)=∑t∈qIDF(t)⋅TFBM25(t,d), where
IDF(t)=log(N/Nt), and
TFBM25=ft,d⋅(k1+1)ft,d+k1⋅((1-b)+b⋅(ld/lavg))
Formula from link - http://www.cs.sjsu.edu/faculty/pollett/ ... ml#%286%29
where, q = t1,t2...tn, N = total no. of documents, Nt = No. of documents containing t.
ft,d = frequency of t in document d. k1 = 1.2 & b = 0.75
ld = length of document d, lavg = average length of documents in collection.

2. In document-at-a-time query processing algorithm, two binary heaps can be used as follows
i. To manage the query terms and, for each term t, keep track of the next document that contains t.
ii. To maintain the set of the top k search results seen so far.
The algorithm using heap has complexity O(Nq. log n + Nq . log k).
This is less than O(m.n + m log m) for algorithm without heap.
'''Originally Posted By: sayali''' Q5. Give the formula for BM25. How can binary heaps be used to speed up document-at-a-time query processing?<br><br>1. Formula for BM25<br>ScoreBM25(q,d)=&sum;t&isin;qIDF(t)&sdot;TFBM25(t,d), where<br>IDF(t)=log(N/Nt), and<br>TFBM25=ft,d&sdot;(k1+1)ft,d+k1&sdot;((1-b)+b&sdot;(ld/lavg))<br>Formula from link - http://www.cs.sjsu.edu/faculty/pollett/ ... ml#%286%29<br>where, q = t1,t2...tn, N = total no. of documents, Nt = No. of documents containing t.<br>ft,d = frequency of t in document d. k1 = 1.2 & b = 0.75<br>ld = length of document d, lavg = average length of documents in collection.<br><br>2. In document-at-a-time query processing algorithm, two binary heaps can be used as follows<br> i. To manage the query terms and, for each term t, keep track of the next document that contains t.<br> ii. To maintain the set of the top k search results seen so far.<br> The algorithm using heap has complexity O(Nq. log n + Nq . log k). <br> This is less than O(m.n + m log m) for algorithm without heap.
2011-11-04

-- Practice Midterm #2 Answers
Originally Posted By: sheetalg

Q 6: What are some advantages of computing the disjunction of the query terms as opposed to the conjunction? What some advantages of computing the conjunction over the disjunction?


Advantages of disjunctive over conjunctive
1. Better recall because fraction of relevant documents that appear in the result set is higher over those in conjunctive retrieval model. For example, say if 2 out of 3 query terms appear in some relevant documents then those will appear in the search results of disjunctive model. However, in conjunctive retrieval model, it will never be returned.
2. Longer queries do not worsen the search results.

Advantages of conjunctive over disjunctive
1. Faster query processing because there are fewer documents to be scored and ranked. But this comes at the cost of lower recall.
'''Originally Posted By: sheetalg''' <br>Q 6: What are some advantages of computing the disjunction of the query terms as opposed to the conjunction? What some advantages of computing the conjunction over the disjunction?<br><br><br>Advantages of disjunctive over conjunctive<br>1. Better recall because fraction of relevant documents that appear in the result set is higher over those in conjunctive retrieval model. For example, say if 2 out of 3 query terms appear in some relevant documents then those will appear in the search results of disjunctive model. However, in conjunctive retrieval model, it will never be returned.<br>2. Longer queries do not worsen the search results.<br><br>Advantages of conjunctive over disjunctive<br>1. Faster query processing because there are fewer documents to be scored and ranked. But this comes at the cost of lower recall.

-- Practice Midterm #2 Answers
Originally Posted By: shailbenq

What is dictionary interleaving? What does the dictionary look like when dictionary interleaving is used?


-- It is a method for storing the inverted index in which all entries are stored on disk, such that each dictionary entry is right before its respective posting list. In addition we store some dictionary entries in memory with pointers to its location on disk.

In Memory
Align :Location on disk(409345) Aligns:Location on disk (450990)Alignment:Location on disk (560989)

On Disk
Align{1205,1234..}Aligned{920,948,..…}Alignment{1102,1690,....}Aligns{2394,6784,...}
'''Originally Posted By: shailbenq''' <br>What is dictionary interleaving? What does the dictionary look like when dictionary interleaving is used?<br><br><br> -- It is a method for storing the inverted index in which all entries are stored on disk, such that each dictionary entry is right before its respective posting list. In addition we store some dictionary entries in memory with pointers to its location on disk.<br><br>In Memory<br>Align :Location on disk(409345) Aligns:Location on disk (450990)Alignment:Location on disk (560989)<br><br> On Disk<br>Align{1205,1234..}Aligned{920,948,..&hellip;}Alignment{1102,1690,....}Aligns{2394,6784,...}
2011-11-06

-- Practice Midterm #2 Answers
Originally Posted By: chhawe
Can the others please post the remaining answers?

Thanks
Chinmay
'''Originally Posted By: chhawe''' Can the others please post the remaining answers?<br><br>Thanks<br>Chinmay

-- Practice Midterm #2 Answers
Originally Posted By: sctice

Problem 2: What is a per-term index? How does it improve posting list look-up speed over binary search?


A per-term index is a disk-based data structure used to locate postings. It is a subset of a term's postings that is small enough to fit in memory, and it's stored in the header of the term's posting list. To find a particular posting p, the per-term index is loaded into memory and searched for a candidate range that contains p. That range is then loaded from the full posting list and searched in memory to find p. This procedure improves look-up speed over a binary search on the full posting list by avoiding random accesses on disk, which are much more expensive than random accesses in memory. Using a per-term index, data can be read sequentially from disk, and it's only necessary to read two contiguous blocks from disk in order to find a particular posting: first the per-term index, and then the candidate range from the full posting list.
'''Originally Posted By: sctice''' <br>Problem 2: What is a per-term index? How does it improve posting list look-up speed over binary search?<br><br><br>A per-term index is a disk-based data structure used to locate postings. It is a subset of a term's postings that is small enough to fit in memory, and it's stored in the header of the term's posting list. To find a particular posting p, the per-term index is loaded into memory and searched for a candidate range that contains p. That range is then loaded from the full posting list and searched in memory to find p. This procedure improves look-up speed over a binary search on the full posting list by avoiding random accesses on disk, which are much more expensive than random accesses in memory. Using a per-term index, data can be read sequentially from disk, and it's only necessary to read two contiguous blocks from disk in order to find a particular posting: first the per-term index, and then the candidate range from the full posting list.
2011-11-07

-- Practice Midterm #2 Answers
Originally Posted By: Amruta

Q7. What is an accumulator in query processing? How does accumulator pruning work?


An accumulator is a data structure (for e.g. an array) which stores the final score for all matching documents. Its value is updated as each posting gets processed.

In most of the cases, an accumulator is large enough not to fit into the memory. So we have to impose an upper limit on the number of accumulators those can be created. In that case, we want to make sure that those documents which most likely will make it into the top k results are assigned an accumulator.
The threshold value is defined which will be recomputed after every u postings. If the posting's score is greater than this threshold, acc is assigned to it. If its lower then no acc gets assigned to that posting.
'''Originally Posted By: Amruta''' <br>Q7. What is an accumulator in query processing? How does accumulator pruning work?<br><br><br>An accumulator is a data structure (for e.g. an array) which stores the final score for all matching documents. Its value is updated as each posting gets processed.<br><br>In most of the cases, an accumulator is large enough not to fit into the memory. So we have to impose an upper limit on the number of accumulators those can be created. In that case, we want to make sure that those documents which most likely will make it into the top k results are assigned an accumulator.<br>The threshold value is defined which will be recomputed after every u postings. If the posting's score is greater than this threshold, acc is assigned to it. If its lower then no acc gets assigned to that posting.
X