Moneo Asked:
Quote:* How many records should I sort in memory during the first pass?
My idea was that i should sort as many as i can fit into the allowed memory. My reasoning for this is that operations in memory are faster than operations on disk.
Quote:* Should I then write the sorted records onto an output file as single records, one per write, or should I write the all the sorted records to the output file as one block? There's a big difference in the I/O time for reading single records versus blocked records.
Again because of the speed of writing the full block, i chose to write each sorted block in one write. If the file created will be the same using either method (loop writing records, or write all at once), then it seems sensible to choose the faster method.
Quote:* How do I determine in advance how many of these output files I should have?
For my calculations it was done like this:
numblocks = (size of input data) \ (allowed memory size)
If (size of input data) MOD (allowed memory size) <> 0 Then numblocks = numblocks + 1
This is the simplest way i can think to express it.
Quote:* Merge Passes: What are the mechanics of reading those output files, merging them onto additional output files, until I end up with one output file which has all the merged records?
This is quite tricky.
Suppose we have determined that there are five blocks, we have sorted them, and put them to file as five workfiles named 'chunk1', 'chunk2', 'chunk3', 'chunk4', 'chunk5'
now we integer divide the number of workfiles by two. This determines how many merges need to take place
in our case 5 \ 2 = 2
so we merge 'chunk1' with 'chunk2' to create 'm1'
we merge 'chunk3' with 'chunk4' to create 'm2'
now we also check if there is a remaining block ((5 mod 2) <> 0) If there is then we could rename. ie 'chunk5' becomes 'm3' (my actual method was to merge the loose chunk into m1)
now we rename all m? files to chunk? and start again.
--------------------------------------
here follows the appropriate section of my code. hopefully using the notes above you can see how I did it.
Code:
Do
num_merges = num_chunks \ 2
If num_merges = 0 Then
' We are at a position where there is only one chunk, this means we're done
Name "chunk1", OutFile_Path
Exit Do
End If
' Merge chunks (1-2,3-4,5-6) etc
For i = 0 To num_merges - 1
Files_Merge_int32("chunk" & (i * 2) + 1, "chunk" & (i * 2) + 2, "m" & i + 1, buffer, chunk_size)
Kill "chunk" & (i * 2) + 1
Kill "chunk" & (i * 2) + 2
Next i
' If there were an odd number of chunks then merge last chunk into first output
If num_chunks Mod 2 <> 0 Then
Files_Merge_int32("m1", "chunk" & num_chunks, "a1", buffer, chunk_size)
Kill "chunk" & num_chunks
Kill "m1"
Name "a1", "m1"
End If
' Rename the m outputs back to chunk, in order to restart the loop
For i = 0 To num_merges - 1
Name "m" & i + 1, "chunk" & i + 1
Next i
' Update how many chunks we have
num_chunks = num_merges
Loop