Interface Options

Options that modify the Crawler's behavior.

const c = new Crawler({ max_depth: 2, parse_http_error_response: true });
interface Options {
    allow_url_revisit: boolean;
    allowed_domains: string[];
    cache_dir: string;
    check_head: boolean;
    detect_charset: boolean;
    disallowed_domains: string[];
    disallowed_url_filters: string[];
    id: number;
    ignore_robots_txt: boolean;
    max_body_size: number;
    max_depth: number;
    parse_http_error_response: boolean;
    url_filters: string[];
    user_agent: string;
}

Properties

allow_url_revisit: boolean

allow_url_revisit allows multiple downloads of the same URL.

allowed_domains: string[]

allowed_domains is a domain whitelist. Leave it blank to allow any domains to be visited.

cache_dir: string

cache_dir specifies a location where GET requests are cached as files. When it's not defined, caching is disabled.

check_head: boolean

check_head performs a HEAD request before every GET to pre-validate the response.

detect_charset: boolean

detect_charset can enable character encoding detection for non-utf8 response bodies without explicit charset declaration.

disallowed_domains: string[]

disallowed_domains is a domain blacklist.

disallowed_url_filters: string[]

disallowed_url_filters is a list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request will be stopped. disallowed_url_filters will be evaluated before URLFilters. Leave it blank to allow any URLs to be visited.

id: number

id is the unique identifier of a crawler.

ignore_robots_txt: boolean

ignore_robots_txt allows the Crawler to ignore any restrictions set by the target host's robots.txt file. See http://www.robotstxt.org/ for more information.

max_body_size: number

max_body_size is the limit of the retrieved response body in bytes. 0 means unlimited. The default value for max_body_size is 10MB (10 * 1024 * 1024 bytes).

max_depth: number

max_depth limits the recursion depth of visited URLs. Set it to 0 for infinite recursion (default).

parse_http_error_response: boolean

parse_http_error_response allows parsing HTTP responses with non 2xx status codes. By default, only successful HTTP responses will be parsed. Set parse_http_error_response to true to enable it.

url_filters: string[]

url_filters is a list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request won't be stopped. disallowed_url_filters will be evaluated before url_filters. Leave it blank to allow any URLs to be visited.

user_agent: string

user_agent is the User-Agent string used by HTTP requests.