Tag Archives: parse

pup: Playing fetch with HTML

Every month I export the posts from this site, grind away at the XML file, pluck out titles and links, and rearrange them to form an index page. Don’t say thank you; I do it for me as much as anyone else. I can’t remember everything I’ve covered in the past two years, and that index has saved me more than once. :\

Point being, it takes a small measure of grep, plus some rather tedious vim footwork to get everything arranged in the proper order and working.

You know what would be nice? If some tool could skim through that XML file, extract just the link and title fields, and prettify them to make my task a bit easier.

pup can do that.

2014-11-08-2sjx281-pup-01

Oh, that is so wonderful. … 🙄

In that very rudimentary example, pup took the file, the field I wanted, and sifted through for all the matching tags before dumping it into the index file.

pup will also colorize and format HTML for the sake of easy viewing, and the effect is again, oh-so wonderful.

2014-11-08-2sjx281-pup-02

That might remind you of tidyhtml, the savior of sloppy HTML coders everywhere, and you could conceivably use it that way. pup can do a lot more than that, though.

You can parse for multiple tags with pup, filter out specific IDs nestled in <span> tags, print from selected nodes and pluck out selectors. And a lot more that I don’t quite understand fully. 😳

It is possible that you could do some of what pup does with a crafty combination of things like sed or grep. Then again, pup seems confident in its HTML expertise, and the way it is designed is easy to figure out.

And for those of you who won’t deal with software more than a few months old, I can see that at the time of this writing, pup had been updated within the week. So it’s quite fresh. Try pup without fear of poisoning your system with year-old programs. 😉