2012-12-10

Practice Final Solution for Question 6.

Originally Posted By: avinash anantharamu
Distinguish intra-query and inter-query parallelism as far as information retrieval.

What bottlenecks exists to intra-query parallelism if partition-by-document is used.

[*] Document partitioning works best when the index data on the individual nodes can be stored in main memory or on SSD.
[*]To see what happens when data is stored on disk, suppose queries were on average 3 words and we want the search engine to handle 100 queries per second.
[*]Due to queueing effects, we cannot achieve a utilization of more than 50% typically, without experiencing latency jumps.
[*]So a query load of 100qps translates to a require service rate of at least 200 qps. For 3 word queries, this translates to at least 600 random access operations per second.
[*]Assuming an average disk latency of 10ms, a single hard disk drive cannot perform more than 100 random access operations per second, one sixth of what we need on each of our nodes.
[*]Adding more machines doesn't affect the minimum of what each machine must do, so this is a bottleneck for the document partitioning approach.
[*]Term partitioning addresses this problem by splitting the collection into sets of terms and assigning nodes to each of these sets.
[*]This resolves the problem above because, each node won't be responsible for handling every query.

Team Members
[*]Avinash Anantharamu
[*]Chetan Sharma
[*]Lok Kei Leong
'''Originally Posted By: avinash anantharamu''' Distinguish intra-query and inter-query parallelism as far as information retrieval. <br><br>What bottlenecks exists to intra-query parallelism if partition-by-document is used.<br><br>[*] Document partitioning works best when the index data on the individual nodes can be stored in main memory or on SSD.<br>[*]To see what happens when data is stored on disk, suppose queries were on average 3 words and we want the search engine to handle 100 queries per second.<br>[*]Due to queueing effects, we cannot achieve a utilization of more than 50% typically, without experiencing latency jumps.<br>[*]So a query load of 100qps translates to a require service rate of at least 200 qps. For 3 word queries, this translates to at least 600 random access operations per second.<br>[*]Assuming an average disk latency of 10ms, a single hard disk drive cannot perform more than 100 random access operations per second, one sixth of what we need on each of our nodes.<br>[*]Adding more machines doesn't affect the minimum of what each machine must do, so this is a bottleneck for the document partitioning approach.<br>[*]Term partitioning addresses this problem by splitting the collection into sets of terms and assigning nodes to each of these sets.<br>[*]This resolves the problem above because, each node won't be responsible for handling every query.<br><br>Team Members<br>[*]Avinash Anantharamu<br>[*]Chetan Sharma<br>[*]Lok Kei Leong
X