2016-11-03

Clarification re: Queue Servers.

Hi there....
I wanted to clarify the role of the queue servers. As I understand, the queue server stores the URL's and tells the fetchers to retrieve them. When working with multiple queue servers, how does each queue server determine or syncronize amongst themselves? I seen in the docs that RSS needs to be enabled for this to function. If I have four queue servers, and I add my list of URL's to one of them then this replicates somehow across the rest? I'm assuming that you do not need to split the URL's on a per queue server basis manually?
Sorry for all the questions - just wanted to understand better.
Thank you!
Hi there.... I wanted to clarify the role of the queue servers. As I understand, the queue server stores the URL's and tells the fetchers to retrieve them. When working with multiple queue servers, how does each queue server determine or syncronize amongst themselves? I seen in the docs that RSS needs to be enabled for this to function. If I have four queue servers, and I add my list of URL's to one of them then this replicates somehow across the rest? I'm assuming that you do not need to split the URL's on a per queue server basis manually? Sorry for all the questions - just wanted to understand better. Thank you!

-- Clarification re: Queue Servers
Queue Servers manage both urls and portions of indexes. The name server keeps track of what queue servers are present for a given crawl. To keep a crawl consistent (i.e., working), you should not change the number of queue servers in it after the crawl starts. You can change the number of fetchers though. Fetchers know which queue servers exist by talking to the name server. When fetchers discover new urls they compute the host for the url and hash that and send it to a queue server based on that hash, with different queue servers being responsible for different hash ranges. Since urls to download for a given host are always being handled by the same queue server, robots.txt files will be obeyed including things like crawl-delay, etc.
Queue Servers manage both urls and portions of indexes. The name server keeps track of what queue servers are present for a given crawl. To keep a crawl consistent (i.e., working), you should not change the number of queue servers in it after the crawl starts. You can change the number of fetchers though. Fetchers know which queue servers exist by talking to the name server. When fetchers discover new urls they compute the host for the url and hash that and send it to a queue server based on that hash, with different queue servers being responsible for different hash ranges. Since urls to download for a given host are always being handled by the same queue server, robots.txt files will be obeyed including things like crawl-delay, etc.
X