Tag Archives: extract

patool: Multilingual

I ran into some time-consuming real-world issues yesterday, so I have to apologize for missing a post. I’ll make up for it today.

As today’s tool, or perhaps as yesterday’s tool with another to come, here’s patool.

2015-04-07-6m47421-patool

I don’t use multiarchive tools much. Part of that is just that I rely on tar most of the time, unless I get a different format from another source. But usually, the things I compress are simply tar‘ed up. That might make me one of the few people on the planet who knows the proper command sequence to un-tar something.

Regardless, patool has a few points that are worth discussion.

Most of patool seems to work as command-action-target format, so extracting a file — just about any compressed file, I might add — is as simple as patool extract file. The extension of the file appears to be irrelevant to patool — if I rename a file to show a different extension, it manages to extract it anyway.

Of course that might be the flexibility of the underlying compression tools in working with other formats. It’s hard to tell.

patool does a couple of things that you might like. patool can directly repack an archive to switch formats, which could save you a few steps if you’re converting all your old 7zip files into something more modern.

And patool seems smart enough not to overwrite a file that exists already, and will instead create a folder and drop the target in it. Very convenient.

Like a lot of multiarchive tools, patool seems only as multilingual, in terms of archive formats, as what you have installed on your machine. So I’m guessing if you want the ability to decompress .ace files, you’ll need to install unace first. So from a technical standpoint, patool doesn’t really save you any disk space.

patool is python-based, and in both AUR and Debian. If you’re interested in how it compares to multiarchive standbys like atool, unp or dtrx … give it a try and report back to us. ๐Ÿ˜€

html-xml-utils: A sweet suite

I’m in favor of any tool that can strip away the manure that masquerades as XML files. I have no earthly idea why anyone would use that style or arrangement voluntarily, especially when simpler and cleaner arrangements are so much … cleaner and simpler to work with. :\

So if you hand me a suite of 10 or 12 tools that scrape away at XML and HTML files, I’m like a kid on Christmas Day. Here’s html-xml-utils, which is just a toy box full of goodies. Which unfortunately means I can only show one or two.

hxnormalize, I imagine, improves readability for pages with frequent links. Go from this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
  <title>Simple page</title>
</head>

<body>

<h1>A simple HTML page</h1>

<p>This is a very simple HTML page, made from scratch for the purpose of testing some <a href="http://www.w3.org/Tools/HTML-XML-utils/man1/" target="_blank">tools</a> in the <a href="http://www.w3.org/Tools/HTML-XML-utils/" target="_blank">html-xml-utils</a> package.

</body>
</html>

to this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "">

<html>
  <head>
    <title>Simple page</title>
  </head>

  <body>
    <h1>A simple HTML page</h1>

    <p>This is a very simple HTML page, made from scratch for the
      purpose of testing some <a
      href="http://www.w3.org/Tools/HTML-XML-utils/man1/"
      target="_blank">tools</a> in the <a
      href="http://www.w3.org/Tools/HTML-XML-utils/"
      target="_blank">html-xml-utils</a> package.</p>
  </body>
</html>

Not only does every line break at a link, which makes them easy to spot, but some closing tags have been corrected, because I gave hxnormalize the -x flag.

I can re-use my example with hxprintlinks, which will number every link in the document, and add a reference list at the bottom of the page.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
  <title>Simple page</title>
</head>
<body>
<h1>A simple HTML page</h1>
<p>This is a very simple HTML page, made from scratch for the purpose of testing some <a href="http://www.w3.org/Tools/HTML-XML-utils/man1/" target="_blank">[1]tools</a> in the <a href="http://www.w3.org/Tools/HTML-XML-utils/" target="_blank">[2]html-xml-utils</a> package.

<ol>
<li>http://www.w3.org/Tools/HTML-XML-utils/man1/</li>
<li>http://www.w3.org/Tools/HTML-XML-utils/</li>
</ol>
</body>
</html>

Of course, pipe hxnormalize into hxprintlinks, and some of that will be cleaned up a little. ๐Ÿ˜‰

If you remember xidel or xmlstarlet, you might remember how it’s possible to pull single elements out of an XML file, for further editing. hxextract can do that, and here are the results of hxextract command .config/openbox/rc.xml on my system:

kmandla@6m47421: ~/downloads$ hxextract command rc.xml 
<command>gmrun</command><command>urxvtc -e alpine -d 0</command><command>urxvtc -e wicd-curses</command><command>urxvtc -g 142x60 -e /home/kmandla/.scripts/mc.sh</command><command>/home/kmandla/.scripts/cleanup.sh</command><command>urxvtc -e htop</command><command>urxvtc -e alsamixer</command><command>/home/kmandla/.scripts/volume.sh</command><command>urxvtc -e alsamixer -D equal</command><command>urxvtc -g 142x60 -e elinks</command><command>/home/kmandla/.scripts/browser.sh</command><command>urxvtc -g 35x9 -e tty-clock -x -t -B</command><command>urxvtc -g 24x12 -e clockywock</command><command>urxvtc -e vim</command><command>urxvtc -e sc</command><command>urxvtc -e wyrd</command><command>urxvtc -e tudu</command><command>urxvtc -e mocp</command><command>pidgin</command><command>urxvtc -g 80x24 -title rhapsody -e /home/kmandla/.scripts/chatnews.sh</command><command>urxvtc</command>

Not pretty, but a step forward in terms of finding miscreant keyboard commands in my rc.xml file. ๐Ÿ˜

There is a lot more — a lot more — available in html-xml-utils that I just don’t have the time and resources to touch on. Look for tools that will convert from XML to asc files, tools that will build tables of contents and bibliographies for entire trees of files, and even a few that transpose tables or just pull out links. That one, hxwls, is mighty clever. …

I leave it to you to explore the rest of that suite. If you’re like me and can only scratch your head a the ascent of XML as a data format, this will be fun for you to play with.

Oh, and I almost forgot: Theodore gets credit for mentioning this one. Thanks, Theodore. ๐Ÿ˜‰

xmlstarlet: A superstar for XML

xidel was kind to me, reducing much of my boiling invective for XML configuration files to a rolling simmer aimed at the inconvenience. xmlstarlet has the potential to cool that simmer to a lukewarm distaste.

2014-07-04-6m47421-xmlstarlet-02

Now neither tool alone will ever quench my hatred, and together I doubt very much they’d be able to do more than keep my day from turning black. But K.Mandla is definitely mellowing out.

xmlstarlet is a collection of tools for formatting, polling, editing and transforming XML files. That alone is only a sliver of what it can do, and in the right hands it would no doubt be quite a weapon.

For us mere mortals, it’s nice to be able to reformat files and get their components nested cleanly, or to see the breakdown between elements, or just validate them to make sure they’re not broken. (That’s the worst, a broken XML configuration file. I writhe just to think about it.)

But xmlstarlet can apparently also count the number of elements matching an expression, trickle through an XML document and total up specific elements and output to a table, and if you’re lucky, even make a list of links embedded in an XHTML file. Take a look at the documentation if you don’t believe me.

Little people like me, who never have an occasion to work with XML other than my .config/openbox/rc.xml file, will find the most basic xmlstarlet tricks to be sufficient reason to keep it around. “It can clean up my menu.xml file? Please, please, please!”

On the other hand, that’s just a tiny taste of what xmlstarlet can do, and a brief spin past the documentation will make that abundantly clear. Make sure you take a close look at this one before you move to the next tool du jour. You’ll be missing out otherwise. ๐Ÿ˜‰

xidel: Taking away the pain of XML

I avoid XML like the plague. I am not a programmer, so configuration files and software that use XML are anathema to me. And where I have to use it, like in Openbox’s rc.xml and menu.xml files, I look for just about any way out of it.

xidel describes itself as a tool that will “download and extract data from HTML/XML pages.” The home page supplies quite a few examples of that.

2014-07-04-6m47421-xidel-01 2014-07-04-6m47421-xidel-02

Yes, xidel can retrieve web pages, and yes, xidel can extract the data that’s embedded in them, so you don’t have to pick through it to find what you need.

But it can also sift through configuration files and pull out, for example, the programs executed in an Openbox menu.xml file.

2014-07-04-6m47421-xidel-03

For someone like me, who considers XML to be cruel and unusual punishment, that is a very nifty trick. The next time I need to switch window managers and want to convert the list of keybindings I know, xidel will be there to expedite it.

At this point you might ask, “What’s the benefit of this over an HTML stripper, perhaps like dehtml?”

Mostly in its flexibility, I would answer. dehtml yanks the core text out of an HTML page, but xidel allows you to filter or search through a file, and control the output.

I’m definitely no expert, but it only took me about 20 minutes with a few examples to get xidel working how I wanted. If you need to wrangle XML pages on a regular basis (and I feel bad for you if you do), I’m sure you can get xidel to work on your project in a matter of minutes.

Spend a little time with the parser documentation, and you’ll see how you can send extracted data into variables, loop through documents for specific tags, and otherwise make your life sooo much easier.

I like it when a program makes my life easier. ๐Ÿ˜€

tarman: A fullscreen archive navigator

For some reason, tar has a reputation for being cryptic or difficult to handle.

That’s mystifying to me, probably because I use it on a weekly basis as a bland, uncompressed file bundler. For me, tar cvvf package.tar file1 file2 file3 is certainly no challenge to remember. I can think of far more complex and unintuitive software in the Linux landscape.

For those who can’t handle the challenge of remembering c and v and f and tarname and filename, they may want to look into tarman.

2014-05-16-6m47421-tarman

It’s been a while since our last fullscreen archive manager — 2a, if I remember right. tarman pulls the same stunt as 2a, but does it in a cleaner fashion, I believe.

tarman works a lot like Xarchiver or File Roller, in that you can navigate your directory tree and archived files within in it. Select a file or several files, press a to archive them. Ta-da!

Or alternatively, enter an archive (tarman seems to be able to handle bzip2-compressed tar archives; it may know others too), select a file, press e and it will be extracted. Ta-da! Again!

Press F1 or ? for in-your-face help cues. Press q to quit.

As for faults, I can only mention a flickering effect on each keypress. I think tarman is trying to refresh the display at every keystroke, and the redraw is flashing as a result. A little irritating, but minor, and probably something that can be corrected.

That’s about it. Not a lot to it, and it does the job well.

And you avoid all the stress of trying to remember three letters and a couple of names. ๐Ÿ™„ ๐Ÿ‘ฟ

ps*: The splat meaning, โ€œWhatever you wantโ€

I probably should have just dumped ps2ascii and ps2pdf in with all the other ps-entitled tools that I have in my list, like I did with pdf-entitled tools.

Truth be told though, it seems the vast majority of Postscript-related tools have already been lumped into one megakit — the aptly named psutils. To include …

  • psbook: Rearranges pages.
  • psmerge: Merges multiple PS files.
  • psnup: Puts several PS pages on a sheet of paper.
  • psresize: Changes document size
  • psselect: Splits out pages from PS files, if I remember right.
  • pstops: More page rearrangement.

As well as a healthy rasher of scripts to further manhandle your PS file collection.

It looks like the bulk of the applications in the psutils package are around 20 years old. That’s either a turnoff for you if you think old software sucks, or a moot point if you think Postscript files haven’t evolved much even in that amount of time. ๐Ÿ™„

There’s more than just what’s in psutils though. Here’s …

  • pslib: a C library for creating PostScript files.
  • pspresent: A fullscreen Postscript presentation tool, more intended for X than just the console.
  • psrip: Yanks images from PS files.
  • pstoedit: Translates PostScript and PDF graphics into vector formats.
  • ps2eps: For making the short leap from PS to EPS.
  • pstotext: Extracts text from PS files.

And as always, my list is not comprehensive. I am sure there are dozens more hiding out there. ๐Ÿ˜‰