Download Command¶
The wpextract download command downloads the content of a site using the REST API.
Command Usage¶
TARGET- The base path of the WordPress installation, e.g. "https://example.org/"
OUT_JSON- Directory to output the downloaded JSON to. It must be an existing empty directory or a non-existent directory which will be created.
- optional arguments
--media-dest- Path to a directory to download media files to, skipped if not supplied
--json-prefix JSON_PREFIX- Prefix to add to output file names, e.g. supplying 20240101-example will output posts to
out_dir/20240101-example-posts.json
skip data
--skip-type [categories|media|pages|posts|tags|users]- Don't download the provided types. All others will be downloaded, default is to download all.
authentication
--proxy PROXY- Proxy server for requests
--auth AUTH- HTTP Basic credentials for requests (format
username:password) --cookies COOKIES- Cookies for requests (format
cookie1=foo; cookie2=bar)
request behaviour
--timeout TIMEOUT- Timeout for request in seconds (default: 30)
--wait WAIT- Time to wait between requests in seconds. Does not affect retries. (default: 0)
--random-wait- Randomly varies the time between requests to between 0.5 and 1.5 times the number of seconds set by --wait
--max-retries MAX_RETRIES- Maximum number of retries before giving up (default: 10)
--backoff-factor BACKOFF_FACTOR- Factor to apply delaying retries. Default will sleep for 0.0, 0.2, 0.4, 0.8,… (default: 0.1)
--max-redirects MAX_REDIRECTS- Maximum number of redirects before giving up (default: 20)
--user-agent USER_AGENT- User agent to use for requests. Default is a recent version of Chrome on Linux (see
requestsession.DEFAULT_UA)
logging
--log FILE,-l FILE- File to log to, will suppress stdout.
--verbose,-v- Increase log level to include debug logs
Download Process¶
For each enabled data type (categories, media, pages, posts, tags, users; all by default), the command will use the REST API to download the data. The API is paginated and the command will show a progress bar for each page of data.
Endpoints¶
To produce each file, the following list endpoints are used:
| File Name | Endpoint |
|---|---|
categories.json |
/wp/v2/categories |
comments.json |
/wp/v2/comments |
media.json |
/wp/v2/media |
pages.json |
/wp/v2/pages |
posts.json |
/wp/v2/posts |
tags.json |
/wp/v2/tags |
users.json |
/wp/v2/users |
Bot Protection and Considerate Scraping¶
It's unlikely this will trigger bot protection mechanisms for the following reasons:
- it is accessing intended API endpoints, which are likely to have lower levels of bot protection
- it has been configured to use a browser user agent
The following measures are taken to be considerate to the server:
- a backoff factor is applied to retries
We would also suggest enabling the following options, with consideration for how they will affect the download speed:
--waitto space out requests--random-waitto vary the time between requests to avoid patterns
You may also wish to consider:
- The reputation of the IP used to make requests. IPs in ranges belonging to common VPS providers, e.g. DigitalOcean or AWS, may be more likely to be rate limited.
--user-agentto set a custom user agent. The default is a recent version of Chrome on Linux, but this may become outdated. If using authentication, this may need to match the user agent of the browser used to log in.
Error Handling¶
If an HTTP error occurs, the command will retry the request up to --max-retries times, with the backoff set by --backoff-factor. If the maximum number of retries is reached, the command will output the error, stop collecting the given data type, and start collecting the following data type. This is because it's presumed that if a given page is non-functional, the following one will be too.
To ensure the integrity of the scrape, it is suggested to check the logs for errors afterwards.
There is currently no mechanism to resume interrupted downloads.