< Previous | Contents | Next >
join
In some ways, join is like paste in that it adds columns to a file, but it uses a unique way to do it. A join is an operation usually associated with relational databases where data from multiple tables with a shared key field is combined to form a desired result.
The join program performs the same operation. It joins data from multiple files based on a shared key field.
To see how a join operation is used in a relational database, let’s imagine a very small database consisting of two tables, each containing a single record. The first table, called CUSTOMERS, has three fields: a customer number (CUSTNUM), the customer’s first name (FNAME), and the customer’s last name (LNAME):
CUSTNUM | FNAME | LNAME |
======== | ===== | ====== |
4681934 | John | Smith |
The second table is called ORDERS and contains four fields: an order number (ORDER- NUM), the customer number (CUSTNUM), the quantity (QUAN), and the item ordered (ITEM).
ORDERNUM | CUSTNUM | QUAN | ITEM |
======== | ======= | ==== | ==== |
3014953305 | 4681934 | 1 | Blue Widget |
Note that both tables share the field CUSTNUM. This is important, as it allows a relation- ship between the tables.
Performing a join operation would allow us to combine the fields in the two tables to achieve a useful result, such as preparing an invoice. Using the matching values in the CUSTNUM fields of both tables, a join operation could produce the following:
FNAME | LNAME | QUAN | ITEM |
===== | ===== | ==== | ==== |
John | Smith | 1 | Blue Widget |
To demonstrate the join program, we’ll need to make a couple of files with a shared key. To do this, we will use our distros-by-date.txt file. From this file, we will construct two additional files, one containing the release dates (which will be our shared key for this demonstration) and the release names:
[me@linuxbox ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt [me@linuxbox ~]$ paste distros-dates.txt distros-names.txt > distros- key-names.txt
[me@linuxbox ~]$ head distros-key-names.txt
11/25/2008 Fedora 10/30/2008 Ubuntu 06/19/2008 SUSE
05/13/2008 Fedora 04/24/2008 Ubuntu 11/08/2007 Fedora 10/18/2007 Ubuntu
[me@linuxbox ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt [me@linuxbox ~]$ paste distros-dates.txt distros-names.txt > distros- key-names.txt
[me@linuxbox ~]$ head distros-key-names.txt
11/25/2008 Fedora 10/30/2008 Ubuntu 06/19/2008 SUSE
05/13/2008 Fedora 04/24/2008 Ubuntu 11/08/2007 Fedora 10/18/2007 Ubuntu
10/04/2007 SUSE
05/31/2007 Fedora 04/19/2007 Ubuntu
10/04/2007 SUSE
05/31/2007 Fedora 04/19/2007 Ubuntu
and the second file, which contains the release dates and the version numbers:
[me@linuxbox ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt [me@linuxbox ~]$ paste distros-dates.txt distros-vernums.txt > distro s-key-vernums.txt
[me@linuxbox | ~]$ head distros-key-vernums.txt |
11/25/2008 | 10 |
10/30/2008 | 8.10 |
06/19/2008 | 11.0 |
05/13/2008 | 9 |
04/24/2008 | 8.04 |
11/08/2007 | 8 |
10/18/2007 | 7.10 |
10/04/2007 | 10.3 |
05/31/2007 | 7 |
04/19/2007 | 7.04 |
We now have two files with a shared key (the “release date” field). It is important to point out that the files must be sorted on the key field for join to work properly.
[me@linuxbox ~]$ join distros-key-names.txt distros-key-vernums.txt | head
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04
[me@linuxbox ~]$ join distros-key-names.txt distros-key-vernums.txt | head
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04
Note also that, by default, join uses whitespace as the input field delimiter and a single space as the output field delimiter. This behavior can be modified by specifying options. See the join man page for details.