External Sorting

From

Jump to: navigation, search
← Non-comparision Sorting ↑ Sorting End


An external sort is one in which the entire data to be sorted does not fit into internal memory at the same time. Therefore, only portions of the data may be in the internal memory at any given time, while the remainder stays on the disk. Because of the large amount of disk I/O required for an external sort, any external sorting algorithm is much slower than an internal sorting algorithm and should only be used when there is no other alternative. The number of comparisons is not a particularly significant performance measure for an external sort due to the time required for I/O.


Most external sorting algorithms fall into the category of a Sort-Merge algorithm. This means that the algorithm has two stages. The first is stage involves sorting portions of the data, and the second stage involves a series of merges of the sorted portions.


Let’s consider a basic file merge to begin with. Assume that I have two files, F0 and F1. These files contain integers and the individual files are sorted. To merge the two files, we open a third file, F2, and create a two place buffer in main memory. One place in the buffer is for an integer from F0 and the other place is for an integer from F1. The merge algorithm is as follows:

Let the buffer be called B where B[0]is for F0 and B[1] is for F1.
 
Load B[0] with the first element from F0 and B[1] with the first element from F1.
 
While neither file is empty do 
	If B[0] < B[1] 
		write B[0] to F2 and replace B[0] with the next value from F0; 
	Else
		write B[1] to F2 and replace B[1] with the next value from F1. 

When one file becomes empty, write the remainder of the other file to F2.
 


As an example, consider the merge of the following:


(...need tables or diagram here...)


Simple Sort-Merge

This is about the simplest of the external sorting algorithms. It is not, strictly speaking, a sort-merge algorithm, but it does require sorting and merging. Suppose I have a very large file called source that needs to be sorted. In addition, I have two temporary files called F1 and F2 that will be used for the process and an internal array called A of size k. Normally, this array is chosen to be as large as possible and still fit into the memory.


The first step is to read into the array A a block of size k from the file source. This block is sorted by an internal sorting algorithm and written to file F1. This ends the setup step. The second step is to read into the array a block of size k from the file source, sort the array using an internal sorting algorithm, and then merge the array A with file F1, writing the result to file F2. This step is repeated until the file source has been completely read. However, with each repetition, we alternate between merging from F1 to F2 and from F2 to F1.


As we read the file source in blocks of k, we will eventually have read the entire file. When this happens, the last file that was the destination of the merge contains the sorted version. As an example of a simple sort-merge, we will use the following data with a block size of 4: 81, 94, 11, 96, 12, 35, 17, 95, 28, 14, 39, 58, 75, 15.

Balanced m-Way Sort-Merge

A second external sort is the Balanced m-Way Sort-Merge. For this algorithm, we have a large unsorted file called source. We also have 2m auxiliary files called F1, F2, …, F2m. Finally we have a block size called k.


The basic algorithm is in two phases. In the sort phase, the source file is read into the memory in blocks of k and each block is sorted by an internal sorting algorithm. This means that k is chosen to be a large value, but not so large that the memory cannot hold an array of k elements from source. After a block is sorted, it is written to one of the first m auxiliary files. The blocks are written in a round-robin fashion. That is, block 1 is written to F1, block 2 is written to F2, … block m is written to Fm, block m + 1 is written to F1, block m + 2 is written to F2, and so on. This process continues until the source file has been read completely. Note that the last block is likely to be less than size k.


Once the source file has been completely distributed over the first m auxiliary files, the second phase—the merge phase—begins. In this phase, the first block of each file F1 to Fm are merged to a block on Fm+1. Then the second block of each auxiliary file F1 to Fm are merged to a block on Fm+2. This pattern continues until all the blocks from F1 to Fm are merged to blocks on Fm+1 to F2m. Note that, in general, the blocks on Fm+1 to F2m are of size m*k. Now the pattern of merging blocks continues but taking a block from each file Fm+1 to F2m and merging them to a block on F1 to Fm. With each cycle, the block sizes grow by a factor of m. Eventually, as the blocks get larger, we end up with one block. When this happens, the one block is the sorted source file.


As an example, we will use a 2-way balanced sort-merge with a block size of 4 to sort the following: 81, 94, 11, 96, 12, 35, 17, 95, 28, 14, 39, 58, 75, 15.


Polyphase Sort-Merge

The problem with a Balanced m-Way Sort-Merge is that the entire file to be sorted must pass through the computer’s memory on each pass. The Polyphase Sort-Merge reduces the total number of bytes that need to be transferred by using a unique distribution scheme based on a kth order Fibonacci sequence.


Example: Suppose I have a source file that divides into 25 blocks, where each block can be sorted by an internal sorting algorithm. In addition, I have 5 auxiliary files called F1, …, F5 that are used as temporary files. Now, in the sort phase of the algorithm, I read my source file one block at a time, sort the block, and write the block to one of the files F1 to F4. Note that I am leaving F5 empty as it will be the destination of the merge phase for the first pass. However, unlike a Balanced m-Way Sort-Merge, I do not distribute the blocks equally across the auxiliary files. The blocks are distributed in the following manner.


(...need table here...)


At the start of each pass, one of the auxiliary files is empty. This file becomes the destination for the merged blocks during this pass. While the pass is in progress, a block is taken from each of the initially non-empty files and merged to a large block that is placed in the destination file. Since the non-empty files do not contain the same number of blocks, eventually one of the auxiliary files empties out during the pass, which causes the pass to end. For example, in the first pass above, 4 sets of blocks are eventually removed from files F1 to F4 and merged to 4 larger blocks on F5. At this point, F4 becomes empty and pass 1 ends. On the second pass, F4 is the destination for the merged blocks. Note that the blocks coming from F5 in this pass are larger that the blocks coming from F1 to F3. This second pass merges 2 sets of block, but ends when F3 becomes empty.


The merge passes create fewer but larger blocks. Eventually we reach a pass, pass 3 in the example above, that ends with one auxiliary file empty and the other files containing one block each. This is the next-to-last pass as the final pass results in only one block. But, this one block is the entire source file now in sorted order.


What is important about this sort, is that some blocks go through the I/O system on every pass, but some blocks go through the I/O system only once. Does this make a difference? Suppose I had a source file that divided into 25 blocks, where each block was 100KB in size. If I sorted the file using a balanced 2-way sort-merge, every block goes through the I/O system on each pass. There are 5 passes, which results in a total of 12,500KB (about 12 MB) that must pass through the I/O system. Using a Polyphase sort-merge with the same initial blocks will result in a total of 6, 800KB passing through the I/O system. This reduces the amount of I/O by almost 50%.


When given a file to sort using the Polyphase merge-sort, how do we know the initial distribution of the blocks? It clearly depends on the number of auxiliary files that are available, and—in addition—we need to calculate two tables. The first is called the distribution table and the second, which is calculated from the first, is called the difference table. It is the difference table that is used to distribute blocks from the source file across the auxiliary files. For 5 auxiliary files, the first several rows of the two tables are the following.


(...need tables/diagram here...)


Note that if we have 5 auxiliary files, we only calculate columns for F1 to F4 since there must always be one file empty. The amazing fact is that for a real program, I only need a three row array to be able to calculate these two tables for any number of blocks!



CS2: Data Structures
Theory of Computation - ADT Preliminaries
Linear ADTs - Tree ADTs - Graph ADTs - Unordered Collection ADTs


Personal tools
MediaWiki Appliance - Powered by TurnKey Linux