allow_url_revisit allows multiple downloads of the same URL.
allowed_domains is a domain whitelist. Leave it blank to allow any domains to be visited.
cache_dir specifies a location where GET requests are cached as files. When it's not defined, caching is disabled.
check_head performs a HEAD request before every GET to pre-validate the response.
detect_charset can enable character encoding detection for non-utf8 response bodies without explicit charset declaration.
disallowed_domains is a domain blacklist.
disallowed_url_filters is a list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request will be stopped. disallowed_url_filters will be evaluated before URLFilters. Leave it blank to allow any URLs to be visited.
id is the unique identifier of a crawler.
ignore_robots_txt allows the Crawler to ignore any restrictions set by the target host's robots.txt file. See http://www.robotstxt.org/ for more information.
max_body_size is the limit of the retrieved response body in bytes. 0 means unlimited. The default value for max_body_size is 10MB (10 * 1024 * 1024 bytes).
max_depth limits the recursion depth of visited URLs. Set it to 0 for infinite recursion (default).
parse_http_error_response allows parsing HTTP responses with non 2xx status codes. By default, only successful HTTP responses will be parsed. Set parse_http_error_response to true to enable it.
url_filters is a list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request won't be stopped. disallowed_url_filters will be evaluated before url_filters. Leave it blank to allow any URLs to be visited.
user_agent is the User-Agent string used by HTTP requests.
Options that modify the Crawler's behavior.
Example