yEnc v1 uses a fixed offset of 42. This is meant to move characters that tend to occur
fewer times on average to the critical values. The result should be that less critical
values will need to be escaped. (Each time a critical value occurs it needs to be escaped,
adding an extra character in the output increasing the file size).
While having a fixed offset makes the encoding simple to understand and implement, it's
obvious that this technique can only hope to perform a little better than average, at
best.. and not at all at worst.
Having a variable offset would mean that no more escaping is done than is necessary for
each file, ALWAYS. This may not not mean much improvement on average, but as good as
eliminates the possibility of the
worst case scenario.
Having a variable offset would be easy to implement: just add a "offset=xxx" parameter
to the =ybegin line. Calculating the best offset is very easy too.
Here are some examples of file-scans to calculate the best offset. The bar-graph displays
how often each byte value (0...255) occurs.
Note: The red line is the value 0, the two purple lines are the values 10 and 13. The
light green and two darker green lines are the values that are calculated to be the best
best values to be mapped onto the afore mentioned critical values.
Note: the scale is
different on all pics!
JPEG
32MB of jpeg's have been scanned.
The value that occurs the most is 0, at 1.4%. The value that occurs the least is 47.
The best offset is -34 (= +222).
CAB
52MB of CAB's have been scanned.
Note that all the values occur equally often (or as good as). A side-effect of
compression.
The value that occurs the most is 0, at 0.5%. The value that occurs the least is 16.
The best offset is -6 (= +250).
ALL
165MB of various files have been scanned.
Note the small peak in the middle - this is due to the (only) 11MB of WAV's that were
part of the files being scanned!
The value that occurs the most is 0, at 5.9%. The value that occurs the least is 33.
The best offset is -193 (= +63).
WAV
11MB of WAV's have been scanned.
Note the characteristic graph that's the typical result for WAV files.
The value that occurs the most is 128, at 6.5%. The value that occurs the least is 233.
The best offset is -47 (= +209).