2012-12-10

Practice final solution for Problem 4.

Originally Posted By: fchan
4) What is document eliteness? How is it estimated in DFR? How is DFR modified to handled document-length normalization

document eliteness - A document is considered elite for a term, if the document is about the topic associated with the term.

In divergence from randomness (DFR), a generalized formula can be of the following...

Code: (1 - P2)(-logP1), where P2 is the part associated with "eliteness".
To calculate P2, we can use a modified version of "LaPlace's Theorum"

Code: (m/m+1), where we substitute m -> f(t,d)
to give us...

Code: P2 = f(t,d)/(f(t,d) + 1)
The formula above assumes all documents are of the same length so we can apply some form of normalization to factor in documents of different lengths within our corpus.

Therefore, we can modify our version of "LaPlace's Theroum" with the following...

Code: f'(t,d) = f(t,d) * (log(1+ (lavg/ld)))
this will accommodate fluctuations in document lengths.

Team members:
Frank Chan
Ranjith Jidigam
Hardik Rana
Nitin Tenali
'''Originally Posted By: fchan''' 4) What is document eliteness? How is it estimated in DFR? How is DFR modified to handled document-length normalization<br><br>document eliteness - A document is considered elite for a term, if the document is about the topic associated with the term.<br><br>In divergence from randomness (DFR), a generalized formula can be of the following...<br><br>Code: (1 - P2)(-logP1), where P2 is the part associated with &quot;eliteness&quot;.<br>To calculate P2, we can use a modified version of &quot;LaPlace's Theorum&quot;<br><br>Code: (m/m+1), where we substitute m -&gt; f(t,d)<br>to give us...<br><br>Code: P2 = f(t,d)/(f(t,d) + 1)<br>The formula above assumes all documents are of the same length so we can apply some form of normalization to factor in documents of different lengths within our corpus.<br><br>Therefore, we can modify our version of &quot;LaPlace's Theroum&quot; with the following...<br><br>Code: f'(t,d) = f(t,d) * (log(1+ (lavg/ld)))<br>this will accommodate fluctuations in document lengths. <br><br>Team members:<br>Frank Chan<br>Ranjith Jidigam<br>Hardik Rana<br>Nitin Tenali
X