2017-07-18

Very big crawl.ini.

Hello, Chris! I get all domains in .ru zone and put all of them in crawl.ini file in work_directory folder. The problem is that crawl.ini size now is more than 300 Mb and yioop cannot read such huge file. How did you solved this problem? In documentation it is stated that you have index with about one billion pages.
Hello, Chris! I get all domains in .ru zone and put all of them in crawl.ini file in work_directory folder. The problem is that crawl.ini size now is more than 300 Mb and yioop cannot read such huge file. How did you solved this problem? In documentation it is stated that you have index with about one billion pages.

-- Very big crawl.ini
Yes, a billion page crawl was done. However, that crawl made use of only a few hundred seed sites. In the way I've been using Yioop for web crawls, it usually discovers new urls and crawls them as it goes. Page discovery order also plays a role in how pages are ranked in Yioop. If you want to get Yioop to crawl a lot of urls (say millions) in a fixed order, you probably could write a sequence of "At....txt" files in work_directory/schedules/ScheduleDataTIMESTAMP, where TIMESTAMP is the time stamp of the crawl you want to have the data go to. These files are typically created with data from fetchers as they discover new urls, so would be tricky to create by hand, but doable with a short script.
(Edited: 2017-07-18)
Yes, a billion page crawl was done. However, that crawl made use of only a few hundred seed sites. In the way I've been using Yioop for web crawls, it usually discovers new urls and crawls them as it goes. Page discovery order also plays a role in how pages are ranked in Yioop. If you want to get Yioop to crawl a lot of urls (say millions) in a fixed order, you probably could write a sequence of "At....txt" files in work_directory/schedules/ScheduleDataTIMESTAMP, where TIMESTAMP is the time stamp of the crawl you want to have the data go to. These files are typically created with data from fetchers as they discover new urls, so would be tricky to create by hand, but doable with a short script.
2017-07-21

-- Very big crawl.ini
Thanks, Chris, for your reply! But I don't clear understand what should be in these "At....txt" files. I look at one at work_directory/schedules/ScheduleData1500661345/17368/At1500661360From127-0-0-1WithHasho9vV9YUvZOQ.txt and its content (https://pastebin.com/tij6mWVP) is quite cryptic. Could you please provide more details about content of these files?
Thanks, Chris, for your reply! But I don't clear understand what should be in these "At....txt" files. I look at one at work_directory/schedules/ScheduleData1500661345/17368/At1500661360From127-0-0-1WithHasho9vV9YUvZOQ.txt and its content (https://pastebin.com/tij6mWVP) is quite cryptic. Could you please provide more details about content of these files?
X