-- Archive Crawl Question
Yes, to both questions. Each crawl (including archive crawls) has a timestamp. If it is a single machine crawl, then all index data for the crawl will be stored in a folder:
work_directory/cache/IndexDataTHE_TIMESTAMP.
If you crawl on one instance of Yioop, get such a folder, then copy it to some other instance of Yioop. That other instance of Yioop could be used to serve search results for that crawl. This is good if you want to serve search results from a shared hosting set up.
All queue data (what to crawl next) is kept in
work_directory/cache/QueueBundleTHE_TIMESTAMP.
Fetchers, processes actually perform the download of pages, keep their processing info in files of the form
work_directory/cache/FETCHER_NUM-ArchiveTHE_TIMESTAMP.
In the dev version of Yioop, one can have repeating crawls and multiple crawls at the same time so there is also a CHANNEL_NUM in front of the -Archive name as well as a new kind of DoubleIndexData index folder.
Data sent from fetchers but not yet processed by queue servers is held in
work_directory/schedules/ScheduleDataTHE_TIMESTAMP
Best,
Chris
(
Edited: 2019-02-16)
Yes, to both questions. Each crawl (including archive crawls) has a timestamp. If it is a single machine crawl, then all index data for the crawl will be stored in a folder:
work_directory/cache/IndexDataTHE_TIMESTAMP.
If you crawl on one instance of Yioop, get such a folder, then copy it to some other instance of Yioop. That other instance of Yioop could be used to serve search results for that crawl. This is good if you want to serve search results from a shared hosting set up.
All queue data (what to crawl next) is kept in
work_directory/cache/QueueBundleTHE_TIMESTAMP.
Fetchers, processes actually perform the download of pages, keep their processing info in files of the form
work_directory/cache/FETCHER_NUM-ArchiveTHE_TIMESTAMP.
In the dev version of Yioop, one can have repeating crawls and multiple crawls at the same time so there is also a CHANNEL_NUM in front of the -Archive name as well as a new kind of DoubleIndexData index folder.
Data sent from fetchers but not yet processed by queue servers is held in
work_directory/schedules/ScheduleDataTHE_TIMESTAMP
Best,
Chris