Tag Archives: webcrawler

httrack: The website copier

I could have used httrack about four months ago, when I wanted to mirror a fairly large website for my offline perusal, and lacked a proper tool. I tried bew and another graphical webcrawler, and even fell back on wget, but nothing was 100 percent successful. I ended up mass-downloading most of what I needed, and it wasn’t a pretty sight.

httrack might have saved me the trouble, and probably would have done a much better job.

2014-11-04-2sjx281-httrack

httrack is more than capable of patiently stepping through the architecture of a website, and bringing you a copy of everything there.

But on top of that, httrack, like a lot of good network-based software, has so many options, it can be a bit bewildering. If you open the --help flag, be prepared. It’s a couple hundred lines long at least.

For example, there are flags to save files in a cache, to skip files that are available locally, four options for logging, flags to create an index, screen for particular types of files (ie., HTML only, etc.), set directions for following directories (only up or only down), disable bandwidth abuse limits, cap the number of links, continue a broken-off mirror attempt, enter an interactive mode, confine the search to a single site, and dozens upon dozens more.

Most of those other ones are far and beyond anything I would ever need, let alone understand. If you know what they mean, you might find them quite useful. And maybe best of all, httrack has about a dozen shortcuts for common flag combinations, meaning you can ask for just --spider, instead of typing out -p0C0I0t.

The first time you use it, I’d recommend just httrack though, since by itself the command steps you through a simple wizard, letting you pick options menu-style. If you’ve never used httrack before, it’s a good introduction, and will finish with the command line needed to recall the same options you set. Very helpful, if you’re like me and you learn by example. 🙂

Once you get the hang of it, try things like httrack http://example.com -W%v2, which will give you a nice fullscreen progress display and prompt you if it finds any eccentricities. Quite useful.

I’m going to go back now and re-mirror the site I mangled back in July, and hope I can get a cleaner, more complete copy. 😉

linkchecker: Relax your mouse clicker finger

I seem to be on an Internet-based kick these days. It started yesterday with httpry and html-xml-utils; now I’m on to linkchecker, which cascades through pages or sites and checks that the links are … linking.

2014-09-12-6m47421-linkchecker

linkchecker came at a good time, since I got an e-mail a week or so ago, mentioning (not really complaining, just mentioning) that most of the software I touch on seems to have been around for quite a while. The suggestion was, “When are you going to show us some fresh stuff?”

Well, linkchecker has updates within the past few days, and I’d bet wummel is working on it even as you’re reading this. How’s that for fresh? 😈

What linkchecker does and how is probably obvious just from the name, without looking at the screenshot. And of course, you should probably be careful where you aim linkchecker, because as you can see, it had stacked up several thousand links to check within only a minute or two of looking at this page.

Perhaps it would be better kept in-house first, before turning it loose in the wild. 😐

linkchecker has a long, long list of options for you to look over, in case you want to check external URLs (gasp!), use customized configuration files or filter out URLs by regex. Or a lot of other things.

Perhaps the greatest part about using linkchecker though, is that it allows you to relax your mouse clicker finger, and do other things. Interns everywhere will rejoice. 😉

And just for the record, without any snarky undertones, I do have a tendency to pull in a lot of old, or outdated software. If it still works, I’m still willing to use it. I hope you feel the same. 😕

bew: A primitive, if not effective, webcrawler

bew comes from the same mind that created album, which you might remember from early this year.

2013-08-15-v5-122p-bew

bew is tiny, but packs a punch. It does an admirable job following links through a site and pulling data down.

Webcrawlers are like hex editors to me though: It’s exceedingly rare that I should need one, and usually by the time I get one and get it working, whatever I wanted is no longer really necessary.

Regardless, to find a lightweight, small and effective one like bew … well, it’s always good to have around.