-- Performance Increases? (crawling)
I am guessing this is for the Canadian crawl you mentioned and that this is being done on a single machine. The numbers seem like number I have gotten for findcan.ca
crawls where I had 1 queue server and 1 fetcher. I am not sure on your hardware.
findcan is
with 16GB of ram and a 2tb ssd, of which less than half is used. In my experience, I was able to get the crawler speed to go up going to 3-4 fetchers and the fastest speeds I saw were 80,000 pages/hour. After crawling a week though and not allowing re-visits of pages, my speeds dropped to about 20,000. My suspicion is that having restricted the pages that could be crawled together with not allowing recrawls was what resulted in the drop-off. After a certain amount of RAM, the speed of crawling is more contingent on the number of cores your CPU has not the memory. Most of the time, it is evaluating regexes in the fetcher's that are sucking up the CPU, although creating new schedules and adding index data in a queue server is also computationally intensive, but less frequent. So I would suggest not having more fetchers than the number of cores on your machines and trying to get away with as few queue servers as possible. A single queue server should be able to handle 100,000-200,000 page downloads/hour without much problem with variation in that range based on how old your machines are (my machines for yioop.com are from 2011, have 2 cores and 8gb of RAM).
One important issue to keep an eye out for is CPU temperature. I installed the sensors command line utility to monitor this. I noticed some of my machines would shutdown from overheating. If it is running too hot you can try to adjust FETCHER_PROCESS_DELAY in either a LocalConfig.php file or directly in Config.php
I am guessing this is for the Canadian crawl you mentioned and that this is being done on a single machine. The numbers seem like number I have gotten for findcan.ca
crawls where I had 1 queue server and 1 fetcher. I am not sure on your hardware.
findcan is
[[https://www.amazon.com/gp/product/B0195XYOV8/ref=oh_aui_detailpage_o06_s00?ie=UTF8&psc=1|https://www.amazon.com/gp/product/B0195XYOV8/ref=oh_aui_detailpage_o06_s00?ie=UTF8&psc=1]]
with 16GB of ram and a 2tb ssd, of which less than half is used. In my experience, I was able to get the crawler speed to go up going to 3-4 fetchers and the fastest speeds I saw were 80,000 pages/hour. After crawling a week though and not allowing re-visits of pages, my speeds dropped to about 20,000. My suspicion is that having restricted the pages that could be crawled together with not allowing recrawls was what resulted in the drop-off. After a certain amount of RAM, the speed of crawling is more contingent on the number of cores your CPU has not the memory. Most of the time, it is evaluating regexes in the fetcher's that are sucking up the CPU, although creating new schedules and adding index data in a queue server is also computationally intensive, but less frequent. So I would suggest not having more fetchers than the number of cores on your machines and trying to get away with as few queue servers as possible. A single queue server should be able to handle 100,000-200,000 page downloads/hour without much problem with variation in that range based on how old your machines are (my machines for yioop.com are from 2011, have 2 cores and 8gb of RAM).
One important issue to keep an eye out for is CPU temperature. I installed the sensors command line utility to monitor this. I noticed some of my machines would shutdown from overheating. If it is running too hot you can try to adjust FETCHER_PROCESS_DELAY in either a LocalConfig.php file or directly in Config.php