Login

Agamemnus · 01-15-2004, 05:42 AM

Were you expecting me to ask how to get the fibonnaci sequence or how to do p*p scrolling? Big Grin

LooseCaboose · 01-15-2004, 06:40 AM

Quote:Nononono, you don't understand. You can't have such large differences. The whole point is that a sorted array can be compressed much more than a non-sorted array.

Sorry, you can't make assumptions like that. You can have large differences in a set of data, it can, and /does/ happen. This is one of the reasons that the Quicksort algorithm is very rarely used in its pure form, if your unsorted set is actually data sorted in reverse order, then pure Quicksort becomes O(n^2), which is clearly unacceptable for realword situations. Likewise, you dont want a compression algorithm that works well with most sets of data, but produces /very/ poor performance on specific sets of data.

Quote:...and then you just record by how much the next number changed:

0 1 1 0 0 0 ... 1

...and also record how many bits the change will cost us, based on the maximum change, which will take 8 bits as well.

I dont understand what you mean by cost of change. Say you have a large set of data where elements range from 0-255, ie one byte. Once sorted we could assume that the largest distance between two numbers is somewhere between 64 and 128, which could be encoded as four bits. The distance between each element will need to be encoded as four bits because if you allow variable length encoding you need someway of storing the length of the current distance, which would defeat the purpose of your compression. We are encoding each distance using half the number of bits that we used to store the sorted data, so we have a compression rate of 1/2. However, you still need to store the steps to undo the sorted data back to the unsorted set. Assuming you have a very large set of data, say a million items, then you require 40 bits per swap, do the calculations and you will see that you aren't compressing anything.

Using a conventional compression routine such as bzip2 would yield fairly good results for most large data sets. AFAIK, more complex compression routines do some amount of data sorting anyway to group similar data items.

Agamemnus · 01-15-2004, 07:14 AM

no, read it again..

The undo wouldn't work with swaps. Direct method is better.

And, you virtually can't have a large distance if you have a large file size, and you only need to store the maximum bits for each distance and also the very first number in the array, so that gets rid of your objections there.

LooseCaboose · 01-15-2004, 09:33 AM

Quote:And, you virtually can't have a large distance if you have a large file size, and you only need to store the maximum bits for each distance and also the very first number in the array, so that gets rid of your objections there.

Again, lets assume that our data contains elements ranging from 0-255, and for the purpose of the exercise we have only 10 elements in the set. Now, suppose the unsorted data looks like this:

Code:
255, 0, 0, 255, 255, 255, 0, 0, 255, 0

After sorting the data looks like this:

Code:
0, 0, 0, 0, 0, 255, 255, 255, 255, 255

We can now encode the first number, 0, using one byte (we cant use any less because that is the size of our elements). Now we store the distance between each number, between any pair of consecutive 0's or 255's we dont even need to store a difference, however when we change from 0 to 255, we need to store a difference of 255, which requires one byte to encode. From this it follows that we need one byte to store the difference between each pair of numbers because we can't use variable length differences.

It may appear that you can solve the above problem by having a single bit which indicates whether or not any differences exists, however that doesn't work for the following data set:

Code:
1, 2, 3, 4, 5, 251, 252, 253, 254, 255

Which still requires one byte per difference. This sort of data isn't unreasonable to expect. Remember as long as the maximum difference is at least half of the element range you need to encode each difference with the same amount of bits as the full element range.

Agamemnus · 01-15-2004, 05:33 PM

You could use bits of 4..

Another idea for the first one I had is have a bit that gives the option of storing the differences and only using the actual amount of unique differences and not the actual amount. So 255 and 0 can be encoded with 16 bytes and one extra bit for indexing purposes. Then you'd only need 1 bit..

Agamemnus · 01-16-2004, 02:16 AM

But then again, you're right.

Login
Username:
Password:	Lost Password?
	Remember me