dehtml: Another scraping tool

A long time ago I talked about vilistextum, and in passing noted both dehtml and html2text as alternatives.


Today is dehtml. Tomorrow … well, let’s pretend it’s a surprise. 🙂

I suppose there’s nothing particularly unique in pulling out text from html coded documents.

So it probably shouldn’t surprise you that there are three tools vying for the job.

Choosing one or another will depend on your preference for the way they approach the task, I suppose.

dehtml tends to be my favorite, only because it seems to handle the job cleanly and without too many leftovers.

(For what it’s worth, all three tools tend to leave in some code, depending on how complex the page is.)

And now, just to be fair, here’s the obligatory ultra-minimalist web browser screenshot.


But I don’t recommend surfing that way. 😉