uniq: The unique solution

Everybody knows about sort, but nobody seems to know about uniq. More’s the pity, since uniq takes sort‘s product and does still cooler things with it.

Consider: A list of 10,000 supposedly random words. They’re scrambled and it’s difficult to see where words are repeated. How can we find out how many words are duplicates, and how many times?

Easier done than said, if you have uniq. uniq works best with sorted lists — actually, uniq doesn’t work very well at all without sorted lists — so let’s sort our list first.

sort test.txt > sorted.txt

Next, we fire up uniq. We want to know how many of each word, and it would be nice if we could see the highest numbers first, rather than at the end of the list. And actually, just the first 20 would be enough to satisfy our curiosity. Ergo,

uniq -d -c sorted.txt | sort -r | head -20

Results?

2014-06-10-6m47421-uniq

The -d flag plucks out repeated words, rather than just listing everything (its opposite is -u, which shows only singletons). -c adds a line count to the front. We pipe it back through sort -r so we can reverse the output, and head just cuts off the list after the first 20. Simple, huh? 🙂

“That’s not so special,” you say. “Why not just use sort -u, K.Mandla. Duh.”

Because, duh, sort -u doesn’t show duplicated lines. It sorts the output and removes duplicate lines. So not only don’t you get the output you want, but you’ve hopelessly trashed your data file, because the duplicated entries have vanished. Duh. 😯

uniq has a few other options that will help you get the results you want. It’s particularly useful for finding similar names in lists, or comparing the contents of different directories. List them both, sort the mixed results, and pull out the uniq‘s.

And where, pray tell, might one find this marvel of modern programming? In coreutils, of course. 😉

2 thoughts on “uniq: The unique solution

  1. Pingback: factor: Simple tools are fun | Inconsolation

  2. Pingback: join: Not everything is perfect | Inconsolation

Comments are closed.