Yioop - PHP Search Engine

2012-12-10

Practice Final Solution for Question 6.

Originally Posted By: avinash anantharamu

Distinguish intra-query and inter-query parallelism as far as information retrieval.

What bottlenecks exists to intra-query parallelism if partition-by-document is used.

[*] Document partitioning works best when the index data on the individual nodes can be stored in main memory or on SSD.
[*]To see what happens when data is stored on disk, suppose queries were on average 3 words and we want the search engine to handle 100 queries per second.
[*]Due to queueing effects, we cannot achieve a utilization of more than 50% typically, without experiencing latency jumps.
[*]So a query load of 100qps translates to a require service rate of at least 200 qps. For 3 word queries, this translates to at least 600 random access operations per second.
[*]Assuming an average disk latency of 10ms, a single hard disk drive cannot perform more than 100 random access operations per second, one sixth of what we need on each of our nodes.
[*]Adding more machines doesn't affect the minimum of what each machine must do, so this is a bottleneck for the document partitioning approach.
[*]Term partitioning addresses this problem by splitting the collection into sets of terms and assigning nodes to each of these sets.
[*]This resolves the problem above because, each node won't be responsible for handling every query.

Team Members
[*]Avinash Anantharamu
[*]Chetan Sharma
[*]Lok Kei Leong

'''Originally Posted By: avinash anantharamu''' Distinguish intra-query and inter-query parallelism as far as information retrieval. What bottlenecks exists to intra-query parallelism if partition-by-document is used. [*] Document partitioning works best when the index data on the individual nodes can be stored in main memory or on SSD. [*]To see what happens when data is stored on disk, suppose queries were on average 3 words and we want the search engine to handle 100 queries per second. [*]Due to queueing effects, we cannot achieve a utilization of more than 50% typically, without experiencing latency jumps. [*]So a query load of 100qps translates to a require service rate of at least 200 qps. For 3 word queries, this translates to at least 600 random access operations per second. [*]Assuming an average disk latency of 10ms, a single hard disk drive cannot perform more than 100 random access operations per second, one sixth of what we need on each of our nodes. [*]Adding more machines doesn't affect the minimum of what each machine must do, so this is a bottleneck for the document partitioning approach. [*]Term partitioning addresses this problem by splitting the collection into sets of terms and assigning nodes to each of these sets. [*]This resolves the problem above because, each node won't be responsible for handling every query. Team Members [*]Avinash Anantharamu [*]Chetan Sharma [*]Lok Kei Leong