Everybody knows about sort, but nobody seems to know about
uniq. More’s the pity, since
sort‘s product and does still cooler things with it.
Consider: A list of 10,000 supposedly random words. They’re scrambled and it’s difficult to see where words are repeated. How can we find out how many words are duplicates, and how many times?
Easier done than said, if you have
uniq works best with sorted lists — actually,
uniq doesn’t work very well at all without sorted lists — so let’s sort our list first.
sort test.txt > sorted.txt
Next, we fire up
uniq. We want to know how many of each word, and it would be nice if we could see the highest numbers first, rather than at the end of the list. And actually, just the first 20 would be enough to satisfy our curiosity. Ergo,
uniq -d -c sorted.txt | sort -r | head -20
-d flag plucks out repeated words, rather than just listing everything (its opposite is
-u, which shows only singletons).
-c adds a line count to the front. We pipe it back through
sort -r so we can reverse the output, and
head just cuts off the list after the first 20. Simple, huh? 🙂
“That’s not so special,” you say. “Why not just use
sort -u, K.Mandla. Duh.”
sort -u doesn’t show duplicated lines. It sorts the output and removes duplicate lines. So not only don’t you get the output you want, but you’ve hopelessly trashed your data file, because the duplicated entries have vanished. Duh. 😯
uniq has a few other options that will help you get the results you want. It’s particularly useful for finding similar names in lists, or comparing the contents of different directories. List them both,
sort the mixed results, and pull out the
And where, pray tell, might one find this marvel of modern programming? In coreutils, of course. 😉