Conversion scripts are always good tools to know about, even if I don’t need them frequently enough to keep them installed. wikipedia2text is one that, in spite of its age, still seems sharp.
Technically, the script’s name was just “wiki,” and technically the source link listed on the home page is dead. Late in 2005 though, it made its way into Debian, and is still in a source tarball there. So it seems that it is possible to achieve immortality — all you need to do is somehow find your way into Debian.😉
The script works fine outside of Debian; just decompress it and go. You’ll need to install perl-uri if you’re using Arch. But if you’re in something Debian-ish, it should pull in liburi-perl as a dependency when you install it.
One thing that’s not mentioned outright in the blog post but does appear in the help flag: wikipedia2text will need one of about a half-dozen text-based browsers, to do the actual fetching of the page. I used lynx because … well, just because. Which leads me to this second screenshot.
At this point I’m wondering if wikipedia2text is an improvement over what a text-based browser can show. After all, lynx is showing multiple colors, uses the full terminal width, and I have the option of following links.
What’s more, wikipedia2text — strangely — offers a flag to display its results in a browser, and in my case it was possible to send the output back into lynx. So if you’re keeping track, I ran a script that called a browser to retrieve a page, then rerouted that page back into the browser for my perusal.😕
In the absence of any other instruction, wikipedia2text will default to your $PAGER, which I like because mine is set to most, and I prefer that over almost anything else. Perhaps oddly though, if I ask specifically for pager output, wikipedia2text will arbitrarily commandeer less with no option to change that. Without any instruction for a pager, the output is $PAGER. But with the instruction it jumps to less? That’s also a little confusing. …
Furthermore, I couldn’t get the options for color output to work. And I don’t see a flag or an option to expand the text width beyond what you see in the screenshot, which I believe to be around 80 columns. That alone is almost a dealbreaker for me.
I suppose if I were just looking for a pure text extraction of a page, wikipedia2text has a niche. And it’s definitely worth mentioning that wikipedia2text has a text filtering option with color, which makes for a grep-like effect.
So all in all, wikipedia2text may have a slim focus that you find useful. I might pass it by as an artifact from almost 10 years ago — mostly on the grounds that it has some odd default behavior, and I fail to see a benefit of using this over lynx (or another text-based browser) by itself.😐