join from the Linux command line by OnWorks

Skip to content

join

In some ways, join is like paste in that it adds columns to a file, but it uses a unique way to do it. A join is an operation usually associated with relational databases where data from multiple tables with a shared key field is combined to form a desired result.

The join program performs the same operation. It joins data from multiple files based on a shared key field.

To see how a join operation is used in a relational database, let’s imagine a very small database consisting of two tables, each containing a single record. The first table, called CUSTOMERS, has three fields: a customer number (CUSTNUM), the customer’s first name (FNAME), and the customer’s last name (LNAME):

CUSTNUM	FNAME	LNAME
========	=====	======
4681934	John	Smith

The second table is called ORDERS and contains four fields: an order number (ORDER- NUM), the customer number (CUSTNUM), the quantity (QUAN), and the item ordered (ITEM).

ORDERNUM	CUSTNUM	QUAN	ITEM
========	=======	====	====
3014953305	4681934	1	Blue Widget

Note that both tables share the field CUSTNUM. This is important, as it allows a relation- ship between the tables.

Performing a join operation would allow us to combine the fields in the two tables to achieve a useful result, such as preparing an invoice. Using the matching values in the CUSTNUM fields of both tables, a join operation could produce the following:

FNAME	LNAME	QUAN	ITEM
=====	=====	====	====
John	Smith	1	Blue Widget

To demonstrate the join program, we’ll need to make a couple of files with a shared key. To do this, we will use our distros-by-date.txt file. From this file, we will construct two additional files, one containing the release dates (which will be our shared key for this demonstration) and the release names:

[me@linuxbox ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt [me@linuxbox ~]$ paste distros-dates.txt distros-names.txt > distros- key-names.txt

[me@linuxbox ~]$ head distros-key-names.txt

11/25/2008 Fedora 10/30/2008 Ubuntu 06/19/2008 SUSE

05/13/2008 Fedora 04/24/2008 Ubuntu 11/08/2007 Fedora 10/18/2007 Ubuntu

[me@linuxbox ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt [me@linuxbox ~]$ paste distros-dates.txt distros-names.txt > distros- key-names.txt

[me@linuxbox ~]$ head distros-key-names.txt

11/25/2008 Fedora 10/30/2008 Ubuntu 06/19/2008 SUSE

05/13/2008 Fedora 04/24/2008 Ubuntu 11/08/2007 Fedora 10/18/2007 Ubuntu

10/04/2007 SUSE

05/31/2007 Fedora 04/19/2007 Ubuntu

10/04/2007 SUSE

05/31/2007 Fedora 04/19/2007 Ubuntu

and the second file, which contains the release dates and the version numbers:

[me@linuxbox ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt [me@linuxbox ~]$ paste distros-dates.txt distros-vernums.txt > distro s-key-vernums.txt

[me@linuxbox	~]$ head distros-key-vernums.txt
11/25/2008	10
10/30/2008	8.10
06/19/2008	11.0
05/13/2008	9
04/24/2008	8.04
11/08/2007	8
10/18/2007	7.10
10/04/2007	10.3
05/31/2007	7
04/19/2007	7.04

We now have two files with a shared key (the “release date” field). It is important to point out that the files must be sorted on the key field for join to work properly.

[me@linuxbox ~]$ join distros-key-names.txt distros-key-vernums.txt | head

11/25/2008 Fedora 10

10/30/2008 Ubuntu 8.10

06/19/2008 SUSE 11.0

05/13/2008 Fedora 9

04/24/2008 Ubuntu 8.04

11/08/2007 Fedora 8

10/18/2007 Ubuntu 7.10

10/04/2007 SUSE 10.3

05/31/2007 Fedora 7

04/19/2007 Ubuntu 7.04

[me@linuxbox ~]$ join distros-key-names.txt distros-key-vernums.txt | head

11/25/2008 Fedora 10

10/30/2008 Ubuntu 8.10

06/19/2008 SUSE 11.0

05/13/2008 Fedora 9

04/24/2008 Ubuntu 8.04

11/08/2007 Fedora 8

10/18/2007 Ubuntu 7.10

10/04/2007 SUSE 10.3

05/31/2007 Fedora 7

04/19/2007 Ubuntu 7.04

Note also that, by default, join uses whitespace as the input field delimiter and a single space as the output field delimiter. This behavior can be modified by specifying options. See the join man page for details.

< Previous | Contents | Next >

Free Hosting Online for WorkStations

Menu

Top OS Cloud Computing at OnWorks: