Web Robot

Any piece of code which automatically send HTTP requests.

Reasons to have this:

  • grab contents of non-local pages to populate a Search Engine

  • grab pages for Off-Line browsing

  • scan a list of pages to find which ones have changed (to then alert the robot's user)

  • does an RssAggregator count? It probably should, esp if it also grabs non-RSS URLs based on RSS contents.

There are Game Rule-s for robots to follow (Robot Exclusion):

  • http://www.robotstxt.org/wc/faq.html

  • Robots Txt file

    • the problem with this is it's hierarchical in terms of defining protect URL-space. Which can be sufficient for content-type URLs, but is less likely to work for more dynamic pages (that's a poor distinction, but I won't clarify for now)

    • another case where it's a problem - maybe you want to block most agents (like RssAggregator-s) from grabbing your full content, but let them take your RSS file.

  • RobotsMetaTags (No Index, No Follow)

    • the problem is that the robot has already hit that given page, which is bad if such a page is processor-intensive
  • ?

Some additional rules I think are appropriate:

  • a single required substring ("robot"?) to include somewhere in the User Agent HTTP param, so that a dynamic server can take that into account without having to try and maintain a list of various robot agent names

  • an HREF No Follow attribute similar to the RobotsMetaTags, so that a given page could have some HREFs in it which a robot is welcome to follow, while excluding others.

  • Site List Txt URL to list the URLs on a site in reverse-mod-date order (most recently changed at top), with a last-mod-date-time attribute included. This allows a robot to avoid grabbing pages which haven't changed since its last visit. (Needed detail to handle - dead URLs, so they can be removed from a Search Engine.)

  • some way for a robot to recognize that multiple hosts/domains might be served by the same piece of hardware, so it can cap its request-rate at the hardware level. (Dave Winer has historically complained about this.)

    • in general, some way for a host to request limits on how often its content gets grabbed by any given robot - this is particularly an issue with RssAggregator-s, where some people want to grab every 5 minutes


Edited:    |       |    Search Twitter for discussion