December 20th, 2008
Wget is a good way to collect interesting websites or HTML e-books for off-line reading. The program is capable of pretty much anything and it took me some trial and error to get reliable results. Some hosts have unfavorable robots.txt files or send different results depending on the user-agent. I've even seen some hosts configured to deny the wget user-agent. After some iteration I have found a good set of options that just works all the time.
wget -rSNpk -np --execute robots=off -U "Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101"
The above line contains the following options in -rSNpk -np. The rest of the line just tells wget to ignore robots.txt and to send a bogus user-agent.
- -r recursive fetching
- -S preserve server modified date
- -N download only if file on server is newer than local copy
- -p fetch page requisites (images, sounds, stylesheets)
- -k convert links for local viewing
- -np no parent, recurses only on siblings and children of a url
The --follow-ftp option is also useful for some sites having ftp assets, which is common for pdf and video.
If you don't want to swamp the host there are the following options.
wait n seconds between fetches
- --random-wait modify wait time by 50% to 150%
The following line is in my ~/.bashrc
alias wgetr="wget -rSNpk -np --execute robots=off -U \"Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101\" $@"
Usage: wgetr http://example.com/ebook-with-no-gzipped-download-link
And that is how I use wget. If it's worth reading it's worth saving.