pup: Playing fetch with HTML

Every month I export the posts from this site, grind away at the XML file, pluck out titles and links, and rearrange them to form an index page. Don’t say thank you; I do it for me as much as anyone else. I can’t remember everything I’ve covered in the past two years, and that index has saved me more than once. :\

Point being, it takes a small measure of grep, plus some rather tedious vim footwork to get everything arranged in the proper order and working.

You know what would be nice? If some tool could skim through that XML file, extract just the link and title fields, and prettify them to make my task a bit easier.

pup can do that.

2014-11-08-2sjx281-pup-01

Oh, that is so wonderful. … 🙄

In that very rudimentary example, pup took the file, the field I wanted, and sifted through for all the matching tags before dumping it into the index file.

pup will also colorize and format HTML for the sake of easy viewing, and the effect is again, oh-so wonderful.

2014-11-08-2sjx281-pup-02

That might remind you of tidyhtml, the savior of sloppy HTML coders everywhere, and you could conceivably use it that way. pup can do a lot more than that, though.

You can parse for multiple tags with pup, filter out specific IDs nestled in <span> tags, print from selected nodes and pluck out selectors. And a lot more that I don’t quite understand fully. 😳

It is possible that you could do some of what pup does with a crafty combination of things like sed or grep. Then again, pup seems confident in its HTML expertise, and the way it is designed is easy to figure out.

And for those of you who won’t deal with software more than a few months old, I can see that at the time of this writing, pup had been updated within the week. So it’s quite fresh. Try pup without fear of poisoning your system with year-old programs. 😉

2 thoughts on “pup: Playing fetch with HTML

  1. darkstarsword

    The lack of a decent index on blogs is a ridiculously widespread problem on blogs – it seems like the blogger and wordpress designers never considered that anyone might want to access a post old enough to have dropped off the front page.

    For a blog I’ve been posting to on Blogger I solved this by writing an index page in javascript using the Blogger API so that it will always be up to date (this has some site-specific features, like grouping posts for the same game and using post labels to group guide and misc posts separately, but it could be adapted for other Blogger sites with relative ease):
    http://helixmod.blogspot.com/2013/10/game-list-automatically-updated.html
    Source code: https://github.com/DarkStarSword/3d-fixes/tree/master/__game_list__

    If wordpress exposes a similar API perhaps something similar could be created for it?

    1. K.Mandla Post author

      Thanks for that, I’ll see if that helps me at all with the code WordPress will export. If not, it’s not a huge deal. It’s 20-30 minutes out of a month, and not as huge a hassle as I make it sound. 😉

Comments are closed.