I’m in favor of any tool that can strip away the manure that masquerades as XML files. I have no earthly idea why anyone would use that style or arrangement voluntarily, especially when simpler and cleaner arrangements are so much … cleaner and simpler to work with. 
So if you hand me a suite of 10 or 12 tools that scrape away at XML and HTML files, I’m like a kid on Christmas Day. Here’s html-xml-utils, which is just a toy box full of goodies. Which unfortunately means I can only show one or two.
hxnormalize
, I imagine, improves readability for pages with frequent links. Go from this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>Simple page</title>
</head>
<body>
<h1>A simple HTML page</h1>
<p>This is a very simple HTML page, made from scratch for the purpose of testing some <a href="http://www.w3.org/Tools/HTML-XML-utils/man1/" target="_blank">tools</a> in the <a href="http://www.w3.org/Tools/HTML-XML-utils/" target="_blank">html-xml-utils</a> package.
</body>
</html>
to this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "">
<html>
<head>
<title>Simple page</title>
</head>
<body>
<h1>A simple HTML page</h1>
<p>This is a very simple HTML page, made from scratch for the
purpose of testing some <a
href="http://www.w3.org/Tools/HTML-XML-utils/man1/"
target="_blank">tools</a> in the <a
href="http://www.w3.org/Tools/HTML-XML-utils/"
target="_blank">html-xml-utils</a> package.</p>
</body>
</html>
Not only does every line break at a link, which makes them easy to spot, but some closing tags have been corrected, because I gave hxnormalize
the -x
flag.
I can re-use my example with hxprintlinks
, which will number every link in the document, and add a reference list at the bottom of the page.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>Simple page</title>
</head>
<body>
<h1>A simple HTML page</h1>
<p>This is a very simple HTML page, made from scratch for the purpose of testing some <a href="http://www.w3.org/Tools/HTML-XML-utils/man1/" target="_blank">[1]tools</a> in the <a href="http://www.w3.org/Tools/HTML-XML-utils/" target="_blank">[2]html-xml-utils</a> package.
<ol>
<li>http://www.w3.org/Tools/HTML-XML-utils/man1/</li>
<li>http://www.w3.org/Tools/HTML-XML-utils/</li>
</ol>
</body>
</html>
Of course, pipe hxnormalize
into hxprintlinks
, and some of that will be cleaned up a little. ๐
If you remember xidel or xmlstarlet, you might remember how it’s possible to pull single elements out of an XML file, for further editing. hxextract
can do that, and here are the results of hxextract command .config/openbox/rc.xml
on my system:
kmandla@6m47421: ~/downloads$ hxextract command rc.xml
<command>gmrun</command><command>urxvtc -e alpine -d 0</command><command>urxvtc -e wicd-curses</command><command>urxvtc -g 142x60 -e /home/kmandla/.scripts/mc.sh</command><command>/home/kmandla/.scripts/cleanup.sh</command><command>urxvtc -e htop</command><command>urxvtc -e alsamixer</command><command>/home/kmandla/.scripts/volume.sh</command><command>urxvtc -e alsamixer -D equal</command><command>urxvtc -g 142x60 -e elinks</command><command>/home/kmandla/.scripts/browser.sh</command><command>urxvtc -g 35x9 -e tty-clock -x -t -B</command><command>urxvtc -g 24x12 -e clockywock</command><command>urxvtc -e vim</command><command>urxvtc -e sc</command><command>urxvtc -e wyrd</command><command>urxvtc -e tudu</command><command>urxvtc -e mocp</command><command>pidgin</command><command>urxvtc -g 80x24 -title rhapsody -e /home/kmandla/.scripts/chatnews.sh</command><command>urxvtc</command>
Not pretty, but a step forward in terms of finding miscreant keyboard commands in my rc.xml file. ๐
There is a lot more — a lot more — available in html-xml-utils that I just don’t have the time and resources to touch on. Look for tools that will convert from XML to asc files, tools that will build tables of contents and bibliographies for entire trees of files, and even a few that transpose tables or just pull out links. That one, hxwls
, is mighty clever. …
I leave it to you to explore the rest of that suite. If you’re like me and can only scratch your head a the ascent of XML as a data format, this will be fun for you to play with.
Oh, and I almost forgot: Theodore gets credit for mentioning this one. Thanks, Theodore. ๐