wget: And that which wget begot

In my mind, wget ranks up there with top and dmesg as an underappreciated tool on a *nix system, and worthy of better attention. Everybody knows it’s there, but no one really takes the time to learn it, because of all the other gizmos that are available.

In its simplest form, wget is terrifically easy.

kmandla@6m47421: ~/downloads$ wget http://old-releases.ubuntu.com/releases/lucid/ubuntu-10.04-alternate-i386.iso
--2014-06-22 14:01:09--  http://old-releases.ubuntu.com/releases/lucid/ubuntu-10.04-alternate-i386.iso
Resolving old-releases.ubuntu.com (old-releases.ubuntu.com)... 91.189.88.17
Connecting to old-releases.ubuntu.com (old-releases.ubuntu.com)|91.189.88.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 722683904 (689M) [application/x-iso9660-image]
Saving to: ‘ubuntu-10.04-alternate-i386.iso’

 6% [===>                                                                 ] 50,015,901  1.59MB/s  eta 6m 49s

And if that’s all you ever see of it, you’re doing all right. But wget has a few fillips that are worth knowing.

Here’s the mirror flag, -m or --mirror.

kmandla@6m47421: ~/downloads$ wget -m http://old-releases.ubuntu.com/releases/lucid/
--2014-06-22 14:05:00--  http://old-releases.ubuntu.com/releases/lucid/
Resolving old-releases.ubuntu.com (old-releases.ubuntu.com)... 91.189.88.17
Connecting to old-releases.ubuntu.com (old-releases.ubuntu.com)|91.189.88.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘old-releases.ubuntu.com/releases/lucid/index.html’

    [                                                                  ] 56,915       189KB/s   in 0.3s   

Last-modified header missing -- time-stamps turned off.
2014-06-22 14:05:01 (189 KB/s) - ‘old-releases.ubuntu.com/releases/lucid/index.html’ saved [56915]

Loading robots.txt; please ignore errors.
--2014-06-22 14:05:01--  http://old-releases.ubuntu.com/robots.txt
Reusing existing connection to old-releases.ubuntu.com:80.
HTTP request sent, awaiting response... 404 Not Found
2014-06-22 14:05:01 ERROR 404: Not Found.

--2014-06-22 14:05:01--  http://old-releases.ubuntu.com/releases/lucid/ubuntu-10.04.4-desktop-i386.iso
Reusing existing connection to old-releases.ubuntu.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 728150016 (694M) [application/x-iso9660-image]
Saving to: ‘old-releases.ubuntu.com/releases/lucid/ubuntu-10.04.4-desktop-i386.iso’

 0% [                                                                     ] 6,561,822   1.68MB/s  eta 7m 26s

Notice that I didn’t specify a file there, but gave it a directory instead: I want everything in that folder. wget would have (if I had let it) mirrored the contents of the remote “lucid” directory to my “downloads” folder. No need to specify each individual file. Cool.

Had I tacked on the --recursive flag (or -r), wget would have dived into every nested subdirectory and mirrored those too, building the same substructure on my local machine.

Surprise: wget has just transformed from a one-shot downloader tool to a webcrawler/mirror utility, and all it took was five keystrokes.

Had I already downloaded some of those files, then I might have been wasting bandwidth. --no-clobber or -nc prevents overwriting existing files. And of course, if you know about the robots.txt file up there, you’ll know -e robots=off allows wget to ignore the contents of that file. It’s bad manners, but hey. The Internet is the wild, wild West of our time. …

So here’s what we have so far.

wget -m -r -nc -e robots=off http://old-releases.ubuntu.com/releases/lucid/

What else can we do? Well, there’s no need to waste time and energy on all those .torrent, .jigdo and other non-ISO files. All we’re really after is the ISOs themselves. We can tell wget specifically what types of files we want with an --accept flag.

List the extensions you want after --accept and separate each with a comma (no spaces). Or conversely, list the extensions you don’t want after --reject.

wget -m -r -nc -e robots=off --accept iso http://old-releases.ubuntu.com/releases/lucid/

And there we have it: One command to mirror an entire directory and its subfolders, skipping over material that was already downloaded and only plucking out the ISO files. There are a few more flags for levels of recursion and timestamping, but I’ll leave it to you to investigate those.

wget alone is a masterful piece of work, but there are a few shortcomings.

For one, it doesn’t handle queues in the same way you might imagine most download managers doing. There is an -i flag to download the URLs listed in a file, but on its own wget can’t manage a queue or follow a changing list of targets. This is the same issue I ran into years ago, when I was whining about the lack of a proper download manager for the console. (What a bellyacher.)

To that end, you could devise some sort of loop to list the links on a server, but that sounds terrifically obtuse. What would be better would be a master download file, and a script that downloads each one, then clips it off the top of the list when it’s finished.

Yeah, that’d be great. …

I’ve seen more than one wget-queue.sh script out in the wild, so I’ll never really know if this is the right one or not. As it is, it seems to work — the script reads out one URL at a time, feeds it to wget, and when it has been completed, it snips the link away and moves on to the next.

The beauty of that particular system is that you can also feed flags directly to wget, just by adding them to the URL. So you could tell it to grab a file and output it to another, or mirror an entire subfolder like we did above. The script doesn’t need to worry about that.

And you can add or move targets within .wget-list, and wget is generally smart enough to make sure it doesn’t pollute the end product. It’s not failsafe, but it is more or less reliable. 😉

There is a similar project in Craig Maloney’s wget-queue.pl, which should achieve the same results, but is written in perl. Craig’s project is dated 2004, but says it’s inspired by a script called wget-queue.sh. That also makes me wonder if the link I have to wget-queue.sh is correct, since that page is dated 2007. Or … time travelers. 😯

Craig’s rendition adds a few much-needed features to the bash version I linked to: designated download directories, logging, file completion lists and so forth. It also has a degree of error trapping and reporting, which might be important at times.

I didn’t playtest Craig’s attempt as well as wget-queue.sh, so be prepared. Actually, be prepared with anything you read on this blog. … O_o

Both wget-queue’s lend the original wget a dimension it doesn’t quite have otherwise. Mix them together and wget is likely to demote one more graphical tool to your electronic dustbin. … 😈 :mrgreen:

Advertisements