2016-11-03

Performance Increases? (crawling).

Hi again.... thanks Chris for all your suggestions/help!
I've been running a crawl now for about 18 hours as a test .. one name server/que server and three crawlers (each with 4 spiders):
Indexer Peak Memory: 1869911864 Scheduler Peak Memory: 724183248 Fetcher Peak Memory: 301186760 Web App Peak Memory: 54172176 Visited Urls/Hour: 22629.14 Visited Urls Count: 433239 Total Urls Seen: 9028098
Does this seem average and any suggestions on increasing number of URL's per hour?
Thanks! Paul
Hi again.... thanks Chris for all your suggestions/help! I've been running a crawl now for about 18 hours as a test .. one name server/que server and three crawlers (each with 4 spiders): Indexer Peak Memory: 1869911864 Scheduler Peak Memory: 724183248 Fetcher Peak Memory: 301186760 Web App Peak Memory: 54172176 Visited Urls/Hour: 22629.14 Visited Urls Count: 433239 Total Urls Seen: 9028098 Does this seem average and any suggestions on increasing number of URL's per hour? Thanks! Paul

-- Performance Increases? (crawling)
I am guessing this is for the Canadian crawl you mentioned and that this is being done on a single machine. The numbers seem like number I have gotten for findcan.ca crawls where I had 1 queue server and 1 fetcher. I am not sure on your hardware. findcan is
with 16GB of ram and a 2tb ssd, of which less than half is used. In my experience, I was able to get the crawler speed to go up going to 3-4 fetchers and the fastest speeds I saw were 80,000 pages/hour. After crawling a week though and not allowing re-visits of pages, my speeds dropped to about 20,000. My suspicion is that having restricted the pages that could be crawled together with not allowing recrawls was what resulted in the drop-off. After a certain amount of RAM, the speed of crawling is more contingent on the number of cores your CPU has not the memory. Most of the time, it is evaluating regexes in the fetcher's that are sucking up the CPU, although creating new schedules and adding index data in a queue server is also computationally intensive, but less frequent. So I would suggest not having more fetchers than the number of cores on your machines and trying to get away with as few queue servers as possible. A single queue server should be able to handle 100,000-200,000 page downloads/hour without much problem with variation in that range based on how old your machines are (my machines for yioop.com are from 2011, have 2 cores and 8gb of RAM).
One important issue to keep an eye out for is CPU temperature. I installed the sensors command line utility to monitor this. I noticed some of my machines would shutdown from overheating. If it is running too hot you can try to adjust FETCHER_PROCESS_DELAY in either a LocalConfig.php file or directly in Config.php
I am guessing this is for the Canadian crawl you mentioned and that this is being done on a single machine. The numbers seem like number I have gotten for findcan.ca crawls where I had 1 queue server and 1 fetcher. I am not sure on your hardware. findcan is [[https://www.amazon.com/gp/product/B0195XYOV8/ref=oh_aui_detailpage_o06_s00?ie=UTF8&psc=1|https://www.amazon.com/gp/product/B0195XYOV8/ref=oh_aui_detailpage_o06_s00?ie=UTF8&psc=1]] with 16GB of ram and a 2tb ssd, of which less than half is used. In my experience, I was able to get the crawler speed to go up going to 3-4 fetchers and the fastest speeds I saw were 80,000 pages/hour. After crawling a week though and not allowing re-visits of pages, my speeds dropped to about 20,000. My suspicion is that having restricted the pages that could be crawled together with not allowing recrawls was what resulted in the drop-off. After a certain amount of RAM, the speed of crawling is more contingent on the number of cores your CPU has not the memory. Most of the time, it is evaluating regexes in the fetcher's that are sucking up the CPU, although creating new schedules and adding index data in a queue server is also computationally intensive, but less frequent. So I would suggest not having more fetchers than the number of cores on your machines and trying to get away with as few queue servers as possible. A single queue server should be able to handle 100,000-200,000 page downloads/hour without much problem with variation in that range based on how old your machines are (my machines for yioop.com are from 2011, have 2 cores and 8gb of RAM). One important issue to keep an eye out for is CPU temperature. I installed the sensors command line utility to monitor this. I noticed some of my machines would shutdown from overheating. If it is running too hot you can try to adjust FETCHER_PROCESS_DELAY in either a LocalConfig.php file or directly in Config.php
X