Tag Archives: page

html2wikipedia: Converting back and forth

A long time ago I mentioned wikipedia2text, and not long after we ran past wikicurses as an alternative. In both of those cases, the goal was to show Wikipedia pages in the console, without so much congealed dreck. wikicurses in particular seemed like a good option.

But considering that much of Wikipedia is put together in a markdown-ish fashion, wouldn’t it make sense to have some sort of conversion between HTML and Wikipedia format? You could conceivably take a dull .html file and send it straight through, coded and set.

Never fear, true believer.

2015-02-17-6m47421-html2wikipedia

html2wikipedia is a free-ranging program that does very much that same thing. In that case, I grabbed kernel.org, pumped it through html2wikipedia, and got something very close to markdown.

I should mention that it’s not perfect; I wouldn’t blithely slap the results of html2wikipedia straight into a Wikipedia page, mostly because I think the formatting would be off kilter here or there.

But at first glance, it’s certainly in a workable state. The author suggests it should work in Windows too, so if you’re an avid Wiki-gnome (I am not), this might save you save time and work in the future.

Like I mentioned, I don’t see html2wikipedia in either Arch or Debian, but I don’t take the time to go through every distro out there. 😯 Whether it is or isn’t, this is one of those times where it might be quicker and easier to download the source code and build it manually than download all the other packaging materials that accompany a 59Kb executable. πŸ™„

wiki-stream: Less than six degrees of separation

I didn’t intend for there to be two Wikipedia-ish tools on the same day, but one good wiki-related utility deserves another. Or in this case, deserves a gimmick.

Josh Hartigan‘s wiki-stream (executable as wikistream) tells you what you probably already know about Wikipedia: that the longer you spend daydreaming on the site, the more likely you are to find yourself traveling to oddball locations.

2014-12-29-jsgqk71-wiki-stream

You might not think it possible to travel from “Linux” to “physiology” in such a brief adventure, but apparently there are some tangential relationships that will lead you there.

I don’t think Josh would mind if I said out loud that wiki-stream has no real function other than to show the links that link between links, and how they spread out over the web of knowledge. Best I can tell, it takes no flags, doesn’t have much in the way of error trapping, and can blunder into logical circles at times.

But it’s kind of fun to watch.

wiki-stream is in neither Arch nor AUR nor Debian, most likely because it’s only about a month old. You can install it with npm, which might be slightly bewildering since the Arch version placed a symlink to the executable at ~/node_modules/.bin. I’m sure you can correct that if you know much about nodejs.

Now the trick is to somehow jam wiki-stream into wikicurses, and create the ultimate text-based toy for time-wasting. … :\

wikicurses: Information, in brief

If you remember back to wikipedia2text from a couple of months ago, you might have seen where ids1024 left a note about wikicurses, which intends to do something similar.

2014-12-29-jsgqk71-wikicurses-linux

Ordinarily I use most as a $PAGER and it might look like most is working there, but it’s not. That’s the “bundled” pager, with the title of the wikipedia page at the top, and the body text formatted down the space of the terminal.

wikicurses has a few features that I like in particular. Color, of course, and the screen layout are good. I like that the title of the page is placed at the topmost point, and in a fixed position. Score points for all that.

Further, wikicurses can access (to the best of my knowledge) just about any MediaWiki site, and has hotkeys to show a table of contents, or to bookmark pages. Most navigation is vi-style, but you can use arrow keys and page up/down rather than the HJKL-etc. keys.

Pressing “o” gives you a popup search box, and pressing tab while in that search box will complete a term — which is a very nice touch. There are a few other commands, accessible mostly through :+term formats, much like you’d see in vi. Press “q” to exit.

From the command line you can feed wikicurses a search term or a link. You can also jump straight to a particular feed — like Picture of the Day or whatever the site offers. If you hit a disambiguation page, you have the option to select a target and move to that page, sort of like you see here.

2014-12-29-jsgqk71-wikicurses-disambiguation

That’s a very nice way to solve the issue.

There are a couple of things that wikicurses might seem to lack. First, short of re-searching a term, there’s no real way to navigate forward or back through pages. Perhaps that is by design, since adding that might make wikicurses more of an Internet browser than just a data-access tool.

It does make things a little clumsy, particularly if you’ve “navigated” to the wrong page and just want to work back to correct your mistake.

In the same way, pulling page from Wikipedia and displaying it in wikicurses removes any links that were otherwise available. So if you’re tracking family histories or tracing the relationships between evil corporate entities, you’ll have to search, read, then search again, then read again, then search again, then. …

But again, if you’re after a tool to navigate the site, you should probably look into something different. As best I can tell, wikicurses is intended as a one-shot page reader, and not a full-fledged browser, so limiting its scope might be the best idea.

There are a couple of other minor points I would suggest. wikicurses might offer the option to use your $PAGER, rather than its built-in format. I say that mostly because there are minor fillips that a pager might offer — like, for example, page counts or text searching — that wikicurses doesn’t approach.

But wikicurses is a definite step up from wikipedia2text. And since wikicurses seems to know its focus and wisely doesn’t step too far beyond it, it’s worth keeping around for one-shot searches or for specialized wikis that don’t warrant full-scale browser searches. Or for times like nowadays, when half of Wikipedia’s display is commandeered by a plea for contributions. … πŸ™„ 😑

pup: Playing fetch with HTML

Every month I export the posts from this site, grind away at the XML file, pluck out titles and links, and rearrange them to form an index page. Don’t say thank you; I do it for me as much as anyone else. I can’t remember everything I’ve covered in the past two years, and that index has saved me more than once. :\

Point being, it takes a small measure of grep, plus some rather tedious vim footwork to get everything arranged in the proper order and working.

You know what would be nice? If some tool could skim through that XML file, extract just the link and title fields, and prettify them to make my task a bit easier.

pup can do that.

2014-11-08-2sjx281-pup-01

Oh, that is so wonderful. … πŸ™„

In that very rudimentary example, pup took the file, the field I wanted, and sifted through for all the matching tags before dumping it into the index file.

pup will also colorize and format HTML for the sake of easy viewing, and the effect is again, oh-so wonderful.

2014-11-08-2sjx281-pup-02

That might remind you of tidyhtml, the savior of sloppy HTML coders everywhere, and you could conceivably use it that way. pup can do a lot more than that, though.

You can parse for multiple tags with pup, filter out specific IDs nestled in <span> tags, print from selected nodes and pluck out selectors. And a lot more that I don’t quite understand fully. 😳

It is possible that you could do some of what pup does with a crafty combination of things like sed or grep. Then again, pup seems confident in its HTML expertise, and the way it is designed is easy to figure out.

And for those of you who won’t deal with software more than a few months old, I can see that at the time of this writing, pup had been updated within the week. So it’s quite fresh. Try pup without fear of poisoning your system with year-old programs. πŸ˜‰

wikipedia2text: Looking well-preserved, thanks to Debian

Conversion scripts are always good tools to know about, even if I don’t need them frequently enough to keep them installed. wikipedia2text is one that, in spite of its age, still seems sharp.

wikipedia2text

Technically, the script’s name was just “wiki,” and technically the source link listed on the home page is dead. Late in 2005 though, it made its way into Debian, and is still in a source tarball there. So it seems that it is possible to achieve immortality — all you need to do is somehow find your way into Debian. πŸ˜‰

The script works fine outside of Debian; just decompress it and go. You’ll need to install perl-uri if you’re using Arch. But if you’re in something Debian-ish, it should pull in liburi-perl as a dependency when you install it.

One thing that’s not mentioned outright in the blog post but does appear in the help flag: wikipedia2text will need one of about a half-dozen text-based browsers, to do the actual fetching of the page. I used lynx because … well, just because. Which leads me to this second screenshot.

lynx

At this point I’m wondering if wikipedia2text is an improvement over what a text-based browser can show. After all, lynx is showing multiple colors, uses the full terminal width, and I have the option of following links.

What’s more, wikipedia2text — strangely — offers a flag to display its results in a browser, and in my case it was possible to send the output back into lynx. So if you’re keeping track, I ran a script that called a browser to retrieve a page, then rerouted that page back into the browser for my perusal. πŸ˜• :\

In the absence of any other instruction, wikipedia2text will default to your $PAGER, which I like because mine is set to most, and I prefer that over almost anything else. Perhaps oddly though, if I ask specifically for pager output, wikipedia2text will arbitrarily commandeer less with no option to change that. Without any instruction for a pager, the output is $PAGER. But with the instruction it jumps to less? That’s also a little confusing. …

Furthermore, I couldn’t get the options for color output to work. And I don’t see a flag or an option to expand the text width beyond what you see in the screenshot, which I believe to be around 80 columns. That alone is almost a dealbreaker for me.

I suppose if I were just looking for a pure text extraction of a page, wikipedia2text has a niche. And it’s definitely worth mentioning that wikipedia2text has a text filtering option with color, which makes for a grep-like effect.

So all in all, wikipedia2text may have a slim focus that you find useful. I might pass it by as an artifact from almost 10 years ago — mostly on the grounds that it has some odd default behavior, and I fail to see a benefit of using this over lynx (or another text-based browser) by itself. 😐

linkchecker: Relax your mouse clicker finger

I seem to be on an Internet-based kick these days. It started yesterday with httpry and html-xml-utils; now I’m on to linkchecker, which cascades through pages or sites and checks that the links are … linking.

2014-09-12-6m47421-linkchecker

linkchecker came at a good time, since I got an e-mail a week or so ago, mentioning (not really complaining, just mentioning) that most of the software I touch on seems to have been around for quite a while. The suggestion was, “When are you going to show us some fresh stuff?”

Well, linkchecker has updates within the past few days, and I’d bet wummel is working on it even as you’re reading this. How’s that for fresh? 😈

What linkchecker does and how is probably obvious just from the name, without looking at the screenshot. And of course, you should probably be careful where you aim linkchecker, because as you can see, it had stacked up several thousand links to check within only a minute or two of looking at this page.

Perhaps it would be better kept in-house first, before turning it loose in the wild. 😐

linkchecker has a long, long list of options for you to look over, in case you want to check external URLs (gasp!), use customized configuration files or filter out URLs by regex. Or a lot of other things.

Perhaps the greatest part about using linkchecker though, is that it allows you to relax your mouse clicker finger, and do other things. Interns everywhere will rejoice. πŸ˜‰

And just for the record, without any snarky undertones, I do have a tendency to pull in a lot of old, or outdated software. If it still works, I’m still willing to use it. I hope you feel the same. πŸ˜•

diffh: Make your diff easier to see

This one is similar to dailystrips, in that it generates an HTML page as its main output. But this time, it’s working in tandem with diff, to make things a little easier on the eyes.

Definitely a picture is worth a thousand words here.

2013-09-29-v5-122p-diffh

Not that there’s a lot to point out there, but with the -u flag in diff, piped through diffh, you can come up with a visually clear representation of what diff is trying to tell you.

I’m not a programmer so maybe there are better, more obvious ways to show diff visually, but I can see where a couple of large files would be easier to understand this way.

And that’s all I can think of to say. 😐

dailystrips: Gasping for air

Just for the record, I don’t expect a program 10 years out of development to sing along without a care in the world.

On the other hand, I do make use of software that pushes possibly as far back as the 1980s, not counting core programs that have been around since the dawn of technology.

Point being, 10 years without attention is not too far gone.

For dailystrips though, it’s not the underlying software that changed, it’s the targets of the software.

2013-09-30-v5-122p-dailystrips

I’m being cryptic, and I apologize. See, dailystrips is a great idea — a simple perl script that seeks out the day’s comic strips that you like, downloads the images and lumps them all together on a simple HTML page.

The more you think about it, the more brilliant it is: Rather than wander from site to site loading up all the garbage that comes with those comics, dailystrips peels out the image you actually want to see, and puts it on a vanilla page that loads in seconds. Fractions of seconds.

The problem is, as time has gone on, those sites have either changed or rearranged their content. And like I hinted, dailystrips has gone without updates since (apparently) 2003.

Long and short, for every four or five comics I tried to use with daily strips, I got one, maybe two that still worked. You can see in the screenshot that two out of four there were working, at best.

It depends on the host and probably the comic too. If you’ve got time on your hands I suppose you could pick through and see which ones don’t work, but the home page brags that dailystrips — in its prime — supported more than 550 comics.

You’d really have your work cut out for you.

Personally I’m a fan of any application or program that does the work of yanking actual content out of the swirling pool of muck that obscures the Internet.

The fact that this one is gasping for air makes the state of affairs all the more … disheartening. 😦

P.S., To get this rolling, you’ll need your distro’s version of perl’s lwp-protocol-https. In Arch, that’s perl-lwp-protocol-https and in Debian it should be liblwp-protocol-https-perl.