2012-12-12

Practice final solution for Problem 7.

Originally Posted By: sushma
BM25 Formula :
BM25.png
From the above BM25 formula, the components like N, Nt, ld, lavg,f(t,d) need to be precomputed before query time.
lavg, N and Nt are the components which would benefit from being calculated using map reduce algorithm.

Map Reduce implementation to calculate Average Document Length :
The average document length is the total number of tokens divided by the total number of documents.

Let our initial key value pairs be (docid, document). The mapper will produce key value pairs of the form ("doc_count", 1) and ("token_count", 1) according to the following (we assume or arrange both keys go to same reducer).
Code: map(docid, document) {
  output("doc_count", 1);
  for each token t in document {
     output ("token_count", 1);
  }
  return;
}

The reducer then adds these pairs as follows:
Code: reduce(key, (v[1],... v[n])) { // remark v[i] will always be 1
    static total_docs = 0;
    static total_tokens = 0;
    for i = 1 to n {
        if(key == "doc_count") {
            total_docs += v[i];
        }
        if(key == "token_count") {
            total_tokens += v[i];
        }
    }
    if(total_docs > 0) {
       return total_tokens/total_docs;
    }
    return -1;
}
We assume if we see -1, we should wait for the second return value of the function


Group Members : Jiten Oswal, Sushma Bandekar, Geetha Ranjini Viswanathan
'''Originally Posted By: sushma''' BM25 Formula :<br>BM25.png<br>From the above BM25 formula, the components like N, Nt, ld, lavg,f(t,d) need to be precomputed before query time.<br>lavg, N and Nt are the components which would benefit from being calculated using map reduce algorithm.<br><br>Map Reduce implementation to calculate Average Document Length :<br>The average document length is the total number of tokens divided by the total number of documents. <br><br>Let our initial key value pairs be (docid, document). The mapper will produce key value pairs of the form (&quot;doc_count&quot;, 1) and (&quot;token_count&quot;, 1) according to the following (we assume or arrange both keys go to same reducer).<br>Code: map(docid, document) {<br>&nbsp; output(&quot;doc_count&quot;, 1);<br>&nbsp; for each token t in document {<br>&nbsp; &nbsp; &nbsp;output (&quot;token_count&quot;, 1);<br>&nbsp; }<br>&nbsp; return;<br>}<br><br>The reducer then adds these pairs as follows:<br>Code: reduce(key, (v[1],... v[n])) { // remark v[i] will always be 1<br>&nbsp; &nbsp; static total_docs = 0;<br>&nbsp; &nbsp; static total_tokens = 0;<br>&nbsp; &nbsp; for i = 1 to n {<br>&nbsp; &nbsp; &nbsp; &nbsp; if(key == &quot;doc_count&quot;) {<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; total_docs += v[i];<br>&nbsp; &nbsp; &nbsp; &nbsp; }<br>&nbsp; &nbsp; &nbsp; &nbsp; if(key == &quot;token_count&quot;) {<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; total_tokens += v[i];<br>&nbsp; &nbsp; &nbsp; &nbsp; }<br>&nbsp; &nbsp; }<br>&nbsp; &nbsp; if(total_docs &gt; 0) {<br>&nbsp; &nbsp; &nbsp; &nbsp;return total_tokens/total_docs;<br>&nbsp; &nbsp; }<br>&nbsp; &nbsp; return -1;<br>}<br>We assume if we see -1, we should wait for the second return value of the function<br><br><br>Group Members : Jiten Oswal, Sushma Bandekar, Geetha Ranjini Viswanathan
X