Yioop - PHP Search Engine

Thread does not exist! Maybe it was deleted?

2019-06-29

-- Help us answer some very important (very capital) questions to fix some bug on YIOOP ???.

(3) If you see a location in the search result, it is because while crawling Yioop tried to download the given URL and it got a redirect. It doesn't immediately download the redirect because it needs to make sure the redirected to url doesn't violate any robots.txt directives. So you'll see a result like that in the search results (assuming the site was deemed important enough so that you see it in the rankings) until the site the redirect points to is downloaded.

(5) The file src/controllers/AdminController.php handles the logic behind all the admin functionality of Yioop. It is a composite object which has on it several components responsible for groups of acitivities. For example, the logic for the Crawls group of activities is provided by src/controllers/components/CrawlComponent.php. The drawing of the admin page is done by src/views/AdminView.php making use of an element for the currently being displayed activity, for example, src/views/elements/ManageCrawls.php

(6) If you check the unknown file extension box, Yioop will try to download and index a file even if it doesn't know the file extension/mimetypes. To add a new extension/mimetype Yioop can handle you need to add a new page processor for that type. You can look at some of the existing processors in src/library/page_processors to see how that is done. In particular, it is via the constructor of such a file and the file work_directory/crawl.ini that the list of extensions is set up.

(8) If you read https://www.seekquarry.com/p/Ranking it actually says the files that are responsible for different parts of the rankings.

(9) The logic for Security is handled in src/controllers/AdminController.php, src/controllers/components/SystemComponent.php (security() method). The activity is drawn by src/views/elements/SecurityElement.php

(10) Yioop doesn't support Wikipedia API crawls. You can get Yioop to index a whole data dump of Wikipedia using the Archive Crawl mechanism (do CTRL-F on the Yioop documentation to search for Archive crawl).

(11) You can restrict a search to a particular file type using the filetype: meta word. Using this you could add a Subsearch for PDF or Doc.

(12) I am not sure what this question is asking. The search APIs are described in the Yioop documentation under Building Sites with Yioop.

Anyway, that was a lot of questions. Hope some of the responses help.

Best,

Chris

There are a lot of questions there. I will try to answer a few of them. (1), (2), (4), (7), (13), (14) Images are indexed by the crawler using one of processors in src/library/processors. For example, JPGs would be processed by JpgProcessor.php. Each of these extend the base class ImageProcessor.php which has a createThumb method in it. This method uses the constant THUMB_DIM defined in src/configs/Config.php to determine the thumb size. Other parameters, you were interested are also set in this file. The THUMB_DIM value could be overriden to whatever you want by defining it in a file src/configs/LocalConfig.php. Image, Video, and News all appear as links on the top of the web interface. Which links appear there is controlled by the Search Sources activity in the admin interface (i.e., if you log in as root) under the Add/Edit Subsearches form. The method by which a web page is determined to be a web page with a video is done by a scraper called Video Site in the Web Scrapers activity. News is updated hourly by the src/executables/MediaUpdater.php process which runs the src/library/media_jobs/FeedsUpdateJob.php . This does the downloading and indexing of the news feed sites specified in the Search Sources activity. There, the language of the source can be selected. On Yioop.com, I have added feed sources for a variety of languages. The main page that does the drawing of search results is src/views/SearchView.php . It makes use of a variety of helpers. For example, src/views/helpers/ImagesHelper.php is used to draw images in search results. (3) If you see a location in the search result, it is because while crawling Yioop tried to download the given URL and it got a redirect. It doesn't immediately download the redirect because it needs to make sure the redirected to url doesn't violate any robots.txt directives. So you'll see a result like that in the search results (assuming the site was deemed important enough so that you see it in the rankings) until the site the redirect points to is downloaded. (5) The file src/controllers/AdminController.php handles the logic behind all the admin functionality of Yioop. It is a composite object which has on it several components responsible for groups of acitivities. For example, the logic for the Crawls group of activities is provided by src/controllers/components/CrawlComponent.php. The drawing of the admin page is done by src/views/AdminView.php making use of an element for the currently being displayed activity, for example, src/views/elements/ManageCrawls.php (6) If you check the unknown file extension box, Yioop will try to download and index a file even if it doesn't know the file extension/mimetypes. To add a new extension/mimetype Yioop can handle you need to add a new page processor for that type. You can look at some of the existing processors in src/library/page_processors to see how that is done. In particular, it is via the constructor of such a file and the file work_directory/crawl.ini that the list of extensions is set up. (8) If you read https://www.seekquarry.com/p/Ranking it actually says the files that are responsible for different parts of the rankings. (9) The logic for Security is handled in src/controllers/AdminController.php, src/controllers/components/SystemComponent.php (security() method). The activity is drawn by src/views/elements/SecurityElement.php (10) Yioop doesn't support Wikipedia API crawls. You can get Yioop to index a whole data dump of Wikipedia using the Archive Crawl mechanism (do CTRL-F on the Yioop documentation to search for Archive crawl). (11) You can restrict a search to a particular file type using the filetype: meta word. Using this you could add a Subsearch for PDF or Doc. (12) I am not sure what this question is asking. The search APIs are described in the Yioop documentation under Building Sites with Yioop. Anyway, that was a lot of questions. Hope some of the responses help. Best, Chris