< Previous | Contents | Next >
sort
The sort program sorts the contents of standard input, or one or more files specified on the command line, and sends the results to standard output. Using the same technique that we used with cat, we can demonstrate processing of standard input directly from the keyboard:
[me@linuxbox ~]$ sort > foo.txt
c b a
[me@linuxbox ~]$ cat foo.txt
a b c
[me@linuxbox ~]$ sort > foo.txt
c b a
[me@linuxbox ~]$ cat foo.txt
a b c
After entering the command, we type the letters “c”, “b”, and “a”, followed once again by Ctrl-d to indicate end-of-file. We then view the resulting file and see that the lines now appear in sorted order.
Since sort can accept multiple files on the command line as arguments, it is possible to merge multiple files into a single sorted whole. For example, if we had three text files and wanted to combine them into a single sorted file, we could do something like this:
sort file1.txt file2.txt file3.txt > final_sorted_list.txt
sort file1.txt file2.txt file3.txt > final_sorted_list.txt
sort has several interesting options. Here is a partial list:
Table 20-1: Common sort Options
Option | Long Option | Description |
-b | --ignore-leading-blanks | By default, sorting is performed on |
the entire line, starting with the | ||
first character in the line. This | ||
option causes sort to ignore | ||
leading spaces in lines and | ||
calculates sorting based on the first | ||
non-whitespace character on the | ||
line. | ||
-f | --ignore-case | Makes sorting case-insensitive. |
-n | --numeric-sort | Performs sorting based on the numeric evaluation of a string. Using this option allows sorting to be performed on numeric values rather than alphabetic values. |
-r | --reverse | Sort in reverse order. Results are in |
descending rather than ascending | ||
order. | ||
-k | --key=field1[,field2] | Sort based on a key field located |
from field1 to field2 rather than the | ||
entire line. See discussion below. | ||
-m | --merge | Treat each argument as the name |
of a presorted file. Merge multiple | ||
files into a single sorted result | ||
without performing any additional | ||
sorting. | ||
-o | --output=file | Send sorted output to file rather |
than standard output. | ||
-t | --field-separator=char | Define the field-separator |
character. By default fields are | ||
separated by spaces or tabs. |
Although most of the options above are pretty self-explanatory, some are not. First, let’s look at the -n option, used for numeric sorting. With this option, it is possible to sort val- ues based on numeric values. We can demonstrate this by sorting the results of the du command to determine the largest users of disk space. Normally, the du command lists the results of a summary in pathname order:
[me@linuxbox ~]$ du -s /usr/share/* | head
252 /usr/share/aclocal
96 /usr/share/acpi-support
8 /usr/share/adduser
196 /usr/share/alacarte
344 /usr/share/alsa
8 /usr/share/alsa-base 12488 /usr/share/anthy
8 /usr/share/apmd
21440 /usr/share/app-install
48 /usr/share/application-registry
[me@linuxbox ~]$ du -s /usr/share/* | head
252 /usr/share/aclocal
96 /usr/share/acpi-support
8 /usr/share/adduser
196 /usr/share/alacarte
344 /usr/share/alsa
8 /usr/share/alsa-base 12488 /usr/share/anthy
8 /usr/share/apmd
21440 /usr/share/app-install
48 /usr/share/application-registry
In this example, we pipe the results into head to limit the results to the first ten lines. We can produce a numerically sorted list to show the ten largest consumers of space this way:
[me@linuxbox ~]$ du -s /usr/share/* | sort -nr | head
509940 /usr/share/locale-langpack
242660 /usr/share/doc
197560 /usr/share/fonts
179144 /usr/share/gnome
146764 /usr/share/myspell
144304 /usr/share/gimp
135880 /usr/share/dict
76508 /usr/share/icons
68072 /usr/share/apps
62844 /usr/share/foomatic
[me@linuxbox ~]$ du -s /usr/share/* | sort -nr | head
509940 /usr/share/locale-langpack
242660 /usr/share/doc
197560 /usr/share/fonts
179144 /usr/share/gnome
146764 /usr/share/myspell
144304 /usr/share/gimp
135880 /usr/share/dict
76508 /usr/share/icons
68072 /usr/share/apps
62844 /usr/share/foomatic
By using the -nr options, we produce a reverse numerical sort, with the largest values appearing first in the results. This sort works because the numerical values occur at the beginning of each line. But what if we want to sort a list based on some value found within the line? For example, the results of an ls -l:
[me@linuxbox ~]$ ls -l /usr/bin | head
total 152948
-rwxr-xr-x | 1 | root | root | 34824 | 2016-04-04 | 02:42 | [ |
-rwxr-xr-x | 1 | root | root | 101556 | 2007-11-27 | 06:08 | a2p |
-rwxr-xr-x | 1 | root | root | 13036 | 2016-02-27 | 08:22 | aconnect |
-rwxr-xr-x | 1 | root | root | 10552 | 2007-08-15 | 10:34 | acpi |
-rwxr-xr-x | 1 | root | root | 3800 | 2016-04-14 | 03:51 | acpi_fakekey |
-rwxr-xr-x | 1 | root | root | 7536 | 2016-04-19 | 00:19 | acpi_listen |
-rwxr-xr-x | 1 | root | root | 3576 | 2016-04-29 | 07:57 | addpart |
-rwxr-xr-x | 1 | root | root | 20808 | 2016-01-03 | 18:02 | addr2line |
-rwxr-xr-x | 1 | root | root | 489704 | 2016-10-09 | 17:02 | adept_batch |
Ignoring, for the moment, that ls can sort its results by size, we could use sort to sort this list by file size, as well:
[me@linuxbox ~]$ ls -l /usr/bin | sort -nr -k 5 | head
-rwxr-xr-x | 1 | root | root | 8234216 | 2016-04-07 | 17:42 | inkscape |
-rwxr-xr-x | 1 | root | root | 8222692 | 2016-04-07 | 17:42 | inkview |
-rwxr-xr-x | 1 | root | root | 3746508 | 2016-03-07 | 23:45 | gimp-2.4 |
-rwxr-xr-x | 1 | root | root | 3654020 | 2016-08-26 | 16:16 | quanta |
-rwxr-xr-x | 1 | root | root | 2928760 | 2016-09-10 | 14:31 | gdbtui |
-rwxr-xr-x | 1 | root | root | 2928756 | 2016-09-10 | 14:31 | gdb |
-rwxr-xr-x | 1 | root | root | 2602236 | 2016-10-10 | 12:56 | net |
-rwxr-xr-x | 1 | root | root | 2304684 | 2016-10-10 | 12:56 | rpcclient |
-rwxr-xr-x | 1 | root | root | 2241832 | 2016-04-04 | 05:56 | aptitude |
-rwxr-xr-x | 1 | root | root | 2202476 | 2016-10-10 | 12:56 | smbcacls |
Many uses of sort involve the processing of tabular data, such as the results of the ls command above. If we apply database terminology to the table above, we would say that each row is a record and that each record consists of multiple fields, such as the file at- tributes, link count, filename, file size and so on. sort is able to process individual fields. In database terms, we are able to specify one or more key fields to use as sort keys. In the example above, we specify the n and r options to perform a reverse numerical sort and specify -k 5 to make sort use the fifth field as the key for sorting.
The k option is very interesting and has many features, but first we need to talk about how sort defines fields. Let’s consider a very simple text file consisting of a single line containing the author’s name:
William Shotts
William Shotts
By default, sort sees this line as having two fields. The first field contains the charac- ters:
“William”
and the second field contains the characters:
“ Shotts”
meaning that whitespace characters (spaces and tabs) are used as delimiters between fields and that the delimiters are included in the field when sorting is performed.
Looking again at a line from our ls output, we can see that a line contains eight fields and that the fifth field is the file size:
-rwxr-xr-x 1 root root 8234216 2016-04-07 17:42 inkscape
-rwxr-xr-x 1 root root 8234216 2016-04-07 17:42 inkscape
For our next series of experiments, let’s consider the following file containing the history of three popular Linux distributions released from 2006 to 2008. Each line in the file has three fields: the distribution name, version number, and date of release in MM/DD/YYYY format:
SUSE | 10.2 | 12/07/2006 |
Fedora | 10 | 11/25/2008 |
SUSE | 11.0 | 06/19/2008 |
Ubuntu | 8.04 | 04/24/2008 |
Fedora | 8 | 11/08/2007 |
SUSE | 10.3 | 10/04/2007 |
Ubuntu | 6.10 | 10/26/2006 |
Fedora | 7 | 05/31/2007 |
Ubuntu | 7.10 | 10/18/2007 |
Ubuntu | 7.04 | 04/19/2007 |
SUSE | 10.1 | 05/11/2006 |
Fedora | 6 | 10/24/2006 |
Fedora | 9 | 05/13/2008 |
Ubuntu | 6.06 | 06/01/2006 |
Ubuntu | 8.10 | 10/30/2008 |
Fedora | 5 | 03/20/2006 |
Using a text editor (perhaps vim), we’ll enter this data and name the resulting file dis- tros.txt.
Next, we’ll try sorting the file and observe the results:
[me@linuxbox | ~]$ | sort distros.txt |
Fedora 10 | 11/25/2008 | |
Fedora 5 | 03/20/2006 | |
Fedora 6 | 10/24/2006 | |
Fedora 7 | 05/31/2007 | |
Fedora 8 | 11/08/2007 | |
Fedora 9 | 05/13/2008 | |
SUSE 10.1 | 05/11/2006 | |
SUSE 10.2 | 12/07/2006 | |
SUSE 10.3 | 10/04/2007 | |
SUSE 11.0 | 06/19/2008 | |
Ubuntu 6.06 | 06/01/2006 | |
Ubuntu 6.10 | 10/26/2006 | |
Ubuntu 7.04 | 04/19/2007 | |
Ubuntu 7.10 | 10/18/2007 | |
Ubuntu 8.04 | 04/24/2008 | |
Ubuntu 8.10 | 10/30/2008 |
Well, it mostly worked. The problem occurs in the sorting of the Fedora version numbers. Since a “1” comes before a “5” in the character set, version “10” ends up at the top while version “9” falls to the bottom.
To fix this problem we are going to have to sort on multiple keys. We want to perform an alphabetic sort on the first field and then a numeric sort on the second field. sort allows
multiple instances of the -k option so that multiple sort keys can be specified. In fact, a key may include a range of fields. If no range is specified (as has been the case with our previous examples), sort uses a key that begins with the specified field and extends to the end of the line. Here is the syntax for our multi-key sort:
[me@linuxbox | ~]$ | sort --key=1,1 --key=2n distros.txt |
Fedora 5 | 03/20/2006 | |
Fedora 6 | 10/24/2006 | |
Fedora 7 | 05/31/2007 | |
Fedora 8 | 11/08/2007 | |
Fedora 9 | 05/13/2008 | |
Fedora 10 | 11/25/2008 | |
SUSE 10.1 | 05/11/2006 | |
SUSE 10.2 | 12/07/2006 | |
SUSE 10.3 | 10/04/2007 | |
SUSE 11.0 | 06/19/2008 | |
Ubuntu 6.06 | 06/01/2006 | |
Ubuntu 6.10 | 10/26/2006 | |
Ubuntu 7.04 | 04/19/2007 | |
Ubuntu 7.10 | 10/18/2007 | |
Ubuntu 8.04 | 04/24/2008 | |
Ubuntu 8.10 | 10/30/2008 |
Though we used the long form of the option for clarity, -k 1,1 -k 2n would be ex- actly equivalent. In the first instance of the key option, we specified a range of fields to include in the first key. Since we wanted to limit the sort to just the first field, we speci - fied 1,1 which means “start at field one and end at field one.” In the second instance, we specified 2n, which means that field 2 is the sort key and that the sort should be numeric. An option letter may be included at the end of a key specifier to indicate the type of sort to be performed. These option letters are the same as the global options for the sort pro- gram: b (ignore leading blanks), n (numeric sort), r (reverse sort), and so on.
The third field in our list contains a date in an inconvenient format for sorting. On com- puters, dates are usually formatted in YYYY-MM-DD order to make chronological sort- ing easy, but ours are in the American format of MM/DD/YYYY. How can we sort this list in chronological order?
Fortunately, sort provides a way. The key option allows specification of offsets within fields, so we can define keys within fields:
[me@linuxbox ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt
Fedora 10 11/25/2008
Ubuntu 8.10 10/30/2008
[me@linuxbox ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt
Fedora 10 11/25/2008
Ubuntu 8.10 10/30/2008
SUSE | 11.0 | 06/19/2008 |
Fedora | 9 | 05/13/2008 |
Ubuntu | 8.04 | 04/24/2008 |
Fedora | 8 | 11/08/2007 |
Ubuntu | 7.10 | 10/18/2007 |
SUSE | 10.3 | 10/04/2007 |
Fedora | 7 | 05/31/2007 |
Ubuntu | 7.04 | 04/19/2007 |
SUSE | 10.2 | 12/07/2006 |
Ubuntu | 6.10 | 10/26/2006 |
Fedora | 6 | 10/24/2006 |
Ubuntu | 6.06 | 06/01/2006 |
SUSE | 10.1 | 05/11/2006 |
Fedora | 5 | 03/20/2006 |
By specifying -k 3.7 we instruct sort to use a sort key that begins at the seventh character within the third field, which corresponds to the start of the year. Likewise, we specify -k 3.1 and -k 3.4 to isolate the month and day portions of the date. We also add the n and r options to achieve a reverse numeric sort. The b option is included to suppress the leading spaces (whose numbers vary from line to line, thereby affecting the outcome of the sort) in the date field.
Some files don’t use tabs and spaces as field delimiters; for example, the /etc/passwd
file:
[me@linuxbox ~]$ head /etc/passwd root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/bin/sh bin:x:2:2:bin:/bin:/bin/sh sys:x:3:3:sys:/dev:/bin/sh sync:x:4:65534:sync:/bin:/bin/sync games:x:5:60:games:/usr/games:/bin/sh man:x:6:12:man:/var/cache/man:/bin/sh lp:x:7:7:lp:/var/spool/lpd:/bin/sh mail:x:8:8:mail:/var/mail:/bin/sh news:x:9:9:news:/var/spool/news:/bin/sh
[me@linuxbox ~]$ head /etc/passwd root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/bin/sh bin:x:2:2:bin:/bin:/bin/sh sys:x:3:3:sys:/dev:/bin/sh sync:x:4:65534:sync:/bin:/bin/sync games:x:5:60:games:/usr/games:/bin/sh man:x:6:12:man:/var/cache/man:/bin/sh lp:x:7:7:lp:/var/spool/lpd:/bin/sh mail:x:8:8:mail:/var/mail:/bin/sh news:x:9:9:news:/var/spool/news:/bin/sh
The fields in this file are delimited with colons (:), so how would we sort this file using a key field? sort provides the -t option to define the field separator character. To sort the passwd file on the seventh field (the account’s default shell), we could do this:
[me@linuxbox ~]$ sort -t ':' -k 7 /etc/passwd | head
me:x:1001:1001:Myself,,,:/home/me:/bin/bash
[me@linuxbox ~]$ sort -t ':' -k 7 /etc/passwd | head
me:x:1001:1001:Myself,,,:/home/me:/bin/bash
root:x:0:0:root:/root:/bin/bash dhcp:x:101:102::/nonexistent:/bin/false
gdm:x:106:114:Gnome Display Manager:/var/lib/gdm:/bin/false hplip:x:104:7:HPLIP system user,,,:/var/run/hplip:/bin/false klog:x:103:104::/home/klog:/bin/false messagebus:x:108:119::/var/run/dbus:/bin/false polkituser:x:110:122:PolicyKit,,,:/var/run/PolicyKit:/bin/false pulse:x:107:116:PulseAudio daemon,,,:/var/run/pulse:/bin/false
root:x:0:0:root:/root:/bin/bash dhcp:x:101:102::/nonexistent:/bin/false
gdm:x:106:114:Gnome Display Manager:/var/lib/gdm:/bin/false hplip:x:104:7:HPLIP system user,,,:/var/run/hplip:/bin/false klog:x:103:104::/home/klog:/bin/false messagebus:x:108:119::/var/run/dbus:/bin/false polkituser:x:110:122:PolicyKit,,,:/var/run/PolicyKit:/bin/false pulse:x:107:116:PulseAudio daemon,,,:/var/run/pulse:/bin/false
By specifying the colon character as the field separator, we can sort on the seventh field.