Login

Moneo · 05-11-2006, 04:35 AM

Quote:OK. I have written an explanation, i hope it's useful! It explains the process I used to merge two files into one, first in a verbose psuedocode, then with some generic BASIC code. This omits any buffering code and just attempts to explain the actual merging.
.....

Thanks yetifoot, great merge tutorial. I already have experience in merging multiple files, so that's not my problem.

My problem is deciding:
* How many records should I sort in memory during the first pass?
* Should I then write the sorted records onto an output file as single records, one per write, or should I write the all the sorted records to the output file as one block? There's a big difference in the I/O time for reading single records versus blocked records.
* How do I determine in advance how many of these output files I should have?
* Merge Passes: What are the mechanics of reading those output files, merging them onto additional output files, until I end up with one output file which has all the merged records?
*****

anarky · 05-11-2006, 09:57 AM

I was going to attempt this problem until I saw the layout of the data in the file... And did anyone else get an original 381MB output file?

yetifoot · 05-11-2006, 12:18 PM

Moneo Asked:

Quote:* How many records should I sort in memory during the first pass?

My idea was that i should sort as many as i can fit into the allowed memory. My reasoning for this is that operations in memory are faster than operations on disk.

Quote:* Should I then write the sorted records onto an output file as single records, one per write, or should I write the all the sorted records to the output file as one block? There's a big difference in the I/O time for reading single records versus blocked records.

Again because of the speed of writing the full block, i chose to write each sorted block in one write. If the file created will be the same using either method (loop writing records, or write all at once), then it seems sensible to choose the faster method.

Quote:* How do I determine in advance how many of these output files I should have?

For my calculations it was done like this:

numblocks = (size of input data) \ (allowed memory size)
If (size of input data) MOD (allowed memory size) <> 0 Then numblocks = numblocks + 1

This is the simplest way i can think to express it.

Quote:* Merge Passes: What are the mechanics of reading those output files, merging them onto additional output files, until I end up with one output file which has all the merged records?

This is quite tricky.

Suppose we have determined that there are five blocks, we have sorted them, and put them to file as five workfiles named 'chunk1', 'chunk2', 'chunk3', 'chunk4', 'chunk5'

now we integer divide the number of workfiles by two. This determines how many merges need to take place

in our case 5 \ 2 = 2

so we merge 'chunk1' with 'chunk2' to create 'm1'
we merge 'chunk3' with 'chunk4' to create 'm2'

now we also check if there is a remaining block ((5 mod 2) <> 0) If there is then we could rename. ie 'chunk5' becomes 'm3' (my actual method was to merge the loose chunk into m1)

now we rename all m? files to chunk? and start again.

--------------------------------------

here follows the appropriate section of my code. hopefully using the notes above you can see how I did it.

Code:
Do

      num_merges = num_chunks \ 2

      If num_merges = 0 Then

        ' We are at a position where there is only one chunk, this means we're done

        Name "chunk1", OutFile_Path

        Exit Do

      End If

      ' Merge chunks (1-2,3-4,5-6) etc

      For i = 0 To num_merges - 1

        Files_Merge_int32("chunk" & (i * 2) + 1, "chunk" & (i * 2) + 2, "m" & i + 1, buffer, chunk_size)

        Kill "chunk" & (i * 2) + 1

        Kill "chunk" & (i * 2) + 2

      Next i

      ' If there were an odd number of chunks then merge last chunk into first output

      If num_chunks Mod 2 <> 0 Then

        Files_Merge_int32("m1", "chunk" & num_chunks, "a1", buffer, chunk_size)

        Kill "chunk" & num_chunks

        Kill "m1"

        Name "a1", "m1"

      End If

      ' Rename the m outputs back to chunk, in order to restart the loop

      For i = 0 To num_merges - 1

        Name "m" & i + 1, "chunk" & i + 1

      Next i

      ' Update how many chunks we have

      num_chunks = num_merges

    Loop

Antoni Gual · 05-11-2006, 12:23 PM

Moneo:
Nothing magic about radix sort. As you said it was the method used in punched cards' sorting.
If you sort a block of data by the less significative byte you get 256 output packets. If you save to disk these 256 separate packets, you can pile on them the results of sorting the remaining blocks of input, without merging at all. Then you simply daisy-chain all packets by increasing order (no merge) to have the input file for the second pass. As the radix sort is a stable sort futher passes will not loose the previous order.

The most complicated part of the sort is maintaining a set of indexes to the output array, to be able to keep 256 separate packets of output. I extended that to the output files, instead of using 256 output files and daisy chaining them at the end of each pass, I build the 256 packets at different offsets in the file, so I end with a single sorted file. All that requires tallying the data accurately for each pass to precalculate offsets.

The external radix sort has a chapter for itself in the 3rd volume of Knuth, so it should not be difficult to find. I have not read that chapter yet, only noticed it exists.

Moneo · 05-12-2006, 04:54 AM

YETIFOOT: To really appreciate all your good input, I would have to actually implement such a sort. I'll try to find the time. Thanks.

ANTONI: Let me read Knuth before trying to understand your suggestions. Thanks.
*****

Antoni Gual · 05-12-2006, 08:01 PM

Quote:I was going to attempt this problem until I saw the layout of the data in the file... And did anyone else get an original 381MB output file?

Yes, that's the idea of the problem: sort a 381Mb file, too big to fit in memory. The file is made of positive integers in binary form.

Moneo · 05-14-2006, 04:56 AM

Antoni, I read Knuth's section on external sorting, with special attention to Radix sorting. As with many chapters of Knuth's books, I hardly understood what he was talking about.

I've been lucky sometimes and have extracted some algorithms that I was able to program. But, not with this subject.

Thanks for your suggestions.
*****

Moneo · 05-14-2006, 07:41 AM

Antoni,

Anarky said: "I was going to attempt this problem until I saw the layout of the data in the file..."

Well, I generated the file too. It's not exactly a text file as earlier mentioned. Why did you choose a binary file instead of a simple text file? And why such a large file for QB? My God, 381MB. A much smaller file would still serve the purposes of the challenge.

I am not experienced in the handling of binary data files. So this, plus the file size, will inhibit me from entering the challenge, besides my problems understanding the mechanics of merge passes.

Mac made his own file and program, as some of us could do, but that puts him into a category all by himself with no one to judge.
*****

Antoni Gual · 05-14-2006, 08:54 PM

Well, reading binary data is not that difficult, just use GET with an integer variable, the data is read directly to it.

The size of the file is to prevent the use of a simple in memory sort algorithm. The QB45 file is 4 Mb in size, what's 300 Mb is the FB file (because FB can use a lot more memory than QB45)

Neo · 05-15-2006, 11:23 PM

Speaking in terms of FB, sorting a file using chunks of 1, 2, or 3 bytes is very easy and requires only 1 pass of all the data.
However, from 4 bytes and up using that cheap (but fast and efficient) algorithm will not work anymore due to the memory limitations of 100MB.

I also agree with Antoni that putting a raw data file in binary format is very easy to extract the required data from.

I can easily make a program that sorts by 1, 2, or 3 bytes, but 4 bytes will take too much time for me at the moment, considering my extremely poor programming skills and my newbiness.

Yours,

Neo.

Login
Username:
Password:	Lost Password?
	Remember me