Yioop - PHP Search Engine

2019-04-05

Crawl Delay.

As I understand, in config.php there are these parameters by default:

/**

     * Delay in microseconds between processing pages to try to avoid
     * CPU overheating. On some systems, you can set this to 0.
     */
    nsconddefine('FETCHER_PROCESS_DELAY', 10000);
   /** number of multi curl page requests in one go */
   nsconddefine('NUM_MULTI_CURL_PAGES', 100);

I just wanted to confirm that these are utilized by default and this would translate to 10 seconds before the fetcher would contact the next URL from the same domain? I believe, according to docs, that the same fetcher will always be assigned for the same domain so I don't need to worry about multiple fetchers but rather the delay towards the same website?

Hope my question makes sense .. I just don't want to hammer any websites and get banned :)

Hi Chris .. I know that Yioop will honor crawl delay if present in robots.txt etc however I'm looking to make sure I can enforce a certain level of politeness for those websites that don't have an entry in robots.txt As I understand, in config.php there are these parameters by default: /** * Delay in microseconds between processing pages to try to avoid * CPU overheating. On some systems, you can set this to 0. */ nsconddefine('FETCHER_PROCESS_DELAY', 10000); /** number of multi curl page requests in one go */ nsconddefine('NUM_MULTI_CURL_PAGES', 100); I just wanted to confirm that these are utilized by default and this would translate to 10 seconds before the fetcher would contact the next URL from the same domain? I believe, according to docs, that the same fetcher will always be assigned for the same domain so I don't need to worry about multiple fetchers but rather the delay towards the same website? Hope my question makes sense .. I just don't want to hammer any websites and get banned :)

-- Crawl Delay

The following lines in Fetcher.php try to detect hosts that are getting swamped:

    if (($response_code >= 400 && $response_code != 404) ||
        $response_code < 100) {
        // < 100 will capture failures to connect which are returned
        // as strings
        $was_error = true;
        $this->hosts_with_errors[$host]++;
    }

If a host has more than DOWNLOAD_ERROR_THRESHOLD many errors then it is treated as if it had a crawl delay of ERROR_CRAWL_DELAY.

Best,

Chris

(Edited: 2019-04-05)

Those two constants have nothing to do with crawl delay. NUM_MULTI_CURL_PAGES refers to the number of urls a fetcher will try to download at the same time (using threads). Suppose it is 100. Then Yioop might download a batch of 100 pages in one go, then process those pages, then download the next batch of 100 pages, and so on. When it processes pages, extracting content, etc, it does so in one thread serially. Between pages it sleeps FETCHER_PROCESS_DELAY microseconds. So if FETCHER_PROCESS_DELAY is 10000, then between processing pages it sleeps for 1/100 of a second. This prevented overheating on one of my older Linux boxes. The following lines in Fetcher.php try to detect hosts that are getting swamped: if (($response_code >= 400 && $response_code != 404) || $response_code < 100) { // < 100 will capture failures to connect which are returned // as strings $was_error = true; $this->hosts_with_errors[$host]++; } If a host has more than DOWNLOAD_ERROR_THRESHOLD many errors then it is treated as if it had a crawl delay of ERROR_CRAWL_DELAY. Best, Chris