-- What is the ROBOT File (Bot) which is responsible for downloading WEB pages with cURL and how to modify YIOOP so that it searches the WEB continuously instead of sporadically ???
The main program that downloads web pages is:
src/executables/Fetcher.php
the code that actually does the downloading is in:
src/library/FetchUrl.php (getPages method and getPage method).
Yioop, as it is written, starts crawling when you start a crawl under Manage Crawls and stops crawling when you click stop or it has exhausted all the urls it could find that match your search criteria. Up until 2015, I did progressively longer and longer crawls, trying to improve the crawling code. By the end of 2015, I had a billion page crawl. I wanted to then both improve the code to make the indexing faster and also wait until SSD prices came down, so that I could hold a billion pages all on SSD, so that the search results would usably fast. Since then, I have been doing smaller scale crawls on findcan.ca to test on improvements on how the software scrapes data from individual pages. As SSD prices are not dropping as fast as I'd like, my next goal is to try for a future version of Yioop to improve index compression. I agree that a continuous crawling set up could be cool.
Best,
Chris
The main program that downloads web pages is:
src/executables/Fetcher.php
the code that actually does the downloading is in:
src/library/FetchUrl.php (getPages method and getPage method).
Yioop, as it is written, starts crawling when you start a crawl under Manage Crawls and stops crawling when you click stop or it has exhausted all the urls it could find that match your search criteria. Up until 2015, I did progressively longer and longer crawls, trying to improve the crawling code. By the end of 2015, I had a billion page crawl. I wanted to then both improve the code to make the indexing faster and also wait until SSD prices came down, so that I could hold a billion pages all on SSD, so that the search results would usably fast. Since then, I have been doing smaller scale crawls on findcan.ca to test on improvements on how the software scrapes data from individual pages. As SSD prices are not dropping as fast as I'd like, my next goal is to try for a future version of Yioop to improve index compression. I agree that a continuous crawling set up could be cool.
Best,
Chris