Exclude invalid or connection affected links (HTTP 404) from crawling

A fast recommended solution is to go into the GrabMap for the site you want to exclude the invalid pages links and set them to “No Index, No follow” and afterwards run the crawler and publish.

For example, if you want to exclude the invalid web pages for the crawled Host “www.example.com”, proceed with the steps below:

1. In the InSite administrative site go to the Crawler Setup > Hosts > GrabMap.
2. Select the host for which you want to remove the invalid pages (in this case “www.example.com”)
3. Notice the checkboxes that have red labels and select only the “Connection” checkbox. Also, observe the number of invalid pages near the connection label, between paranthesis.
4. Click “Update” button and then “Show Pages” button at the bottom of the GrabMap page.
5. Follow the folder path to the invalid pages by clicking the yellow folder icons. Each folder that has invalid pages will have a number value in the count field.
6. After having reached the page level (pages can be identified by the folded yellow icon on the right-side and the underlined link/URL). The invalid pages will have on their “Last Grab” field the following red text “HTTP 404 returned – Not found”
Select the “No Index No Follow” option from the drop-down list and click the “Save” button at the bottom of the page. This will ensure that the next time the crawler runs it will ignore these pages.
7. Run the crawler and publish the content.
After the new publish, the invalid pages should no longer appear in the search results.

An alternative would be to remove the databases from the *SearchHost*\Data folder and generate new databases for MondoSearch, for which you will need to set up the search engine and recrawl the content you want in the database.

Need more help with this?

Thanks for your feedback.