LHA, LZH, LZS (LHArc compressed files)


*** LHA, LZH, LZS (LHArc compressed files)
*** Document revision: 1.3
*** Last updated: March 11, 2004
*** Contributors/samples: Joe Forster/STA, net documents

  These files are created with LHA on the C64 (or C128),  and  can  present
special problems to the typical PC user. The compression used  is  LH1,  an
old method used on LZH 1.xx (pre-version 2), so any version of LHA  on  the
PC can  uncompress  them.  However,  LHA  allows  filenames  of  up  to  18
characters long, and DOS doesn't know how to handle them (Windows 95  unLHA
utilities will extract the full  filename).  Usually,  some  of  the  files
already  uncompressed  will  be  overwritten  by  other  files  just  being
uncompressed because the name seems the same to DOS. To  LHA  however,  the
filenames are quite different.

  LHA archives always have a string two bytes into the file ("-L??-") which
describe the type of compression used. Over the  development  life  of  LHA
there have been several different compression algorithms used. The "??"  in
the "-L??-" can be one of several possibilites, but on the C64 it is likely
limited to "H0" (no compression) and "H1". Newer versions  of  LHA/LZH  use
other combinations like "H2", "H3", "H4", "H5", "ZS", "Z5", and  "Z4".  The
letters typically used in the compression string come from a combination of
the creators initials of the LZ algorithm, Lempel/Ziv, and  the  author  of
the LHA program, Haruyasu Yoshizaki.

The following is a sample of an LHA header. Note the string to  search  for
at byte $0002:

      00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F        ASCII
      -----------------------------------------------   ----------------
0000: 24 93 2D 6C 68 31 2D 39 02 00 00 16 04 00 00 00   ..-lh1-.........
0010: 08 4C 14 00 00 0E 73 79 73 2E 48 6F 75 73 65 20   ................
0020: 4D 34 00 53 DE 06 11 1C 12 C4 C8 FA 3A 5B DC CE   ................
0030: B2 FA 38 1E 46 B0 B6 9E 9B 75 7A 49 71 72 B3 53   ................
0040: 6E 4E B4 A0 BF 5E 95 B3 05 8A 75 D5 6C E3 03 4A   ................
0050: 2C 54 F4 AF 05 18 59 E2 F4 34 4A 0A 28 D4 33 E2   ................
0060: C4 9D 04 D7 C7 8B 91 66 0E E5 DE 98 3C 92 CC B5   ................

  The header layout is fairly basic. The header for each file starts  *two*
bytes before the "-lh?-" string. The above example has already been trimmed
down to start at these two bytes. Each header has the same layout, only the
length varies due to the length of the filename. Here is a breakdown of the
above example.

    Bytes: $0000: 24 - Length of header (known as "LEN", not including this
                       and the next byte). If it is zero, we are at the end
                       of the file.
            0001: 93                   - Header checksum
            0002: 2D 6C 68 31 2D       - LHA compression type "-LH1-"
            0007: 39 02 00 00          - Compressed file size ($00000239)
            000B: 16 04 00 00          - Uncompressed file size ($00000416)
            000F: 00 08 4C 14          - Time/date stamp
            0013: 00                   - File attribute
            0014: 00                   - Header level
                                              00 = non-extended header
                                          01, 02 = extended header
            0015: 0E                   - Length of the following filename
            0016: 73 79 73 2E 48 6F 75 - Filename, with a zero and filetype
                  73 65 20 4D 34 00 53   appended ("SYS.HOUSE M4?S"). The
                                         name can be up to 18 characters in
                                         length. Note the length *includes*
                                         the zero and filetype, making the
                                         actual filename length 2 bytes
                                         shorter.
            0024: DE 06                - File data checksum (starts at LEN)
            0026: 11 1C 12 C4 C8 FA... - File data (starts at LEN+2)

  The header checksum at byte $0001 is calculated by adding  the  bytes  in
the header from $0002 (LHA compression type) to LEN+1 (File data checksum),
without carry.

  The time/date stamp (bytes $000F-$0012), is broken down as follows:

      Bytes:$000F-0010: Time of last modification:
                        BITS  0- 4: Seconds divided by 2
                                    (0-58, only even numbers)
                        BITS  5-10: Minutes (0-59)
                        BITS 11-15: Hours (0-23, no AM or PM)
      Bytes:$0011-0012: Date of last modification:
                        BITS  0- 4: Day (1-31)
                        BITS  5- 9: Month (1-12)
                        BITS 10-15: Year minus 1980

  The format of the compressed data is much too complex to get  into  here.
Understanding the layout would require  knowledge  of  Huffman  coding  and
sliding dictionaries, and  is  nowhere  near  as  simple  as  ZipCode!  The
description given in the LHA source  code  for  the  different  compression
modes are as follows:

      -lh0- no compression, file stored
      -lh1- 4k sliding dictionary (max 60 bytes) + dynamic Huffman +  fixed
            encoding of position
      -lh2- 8k sliding dictionary (max 256 bytes) + dynamic Huffman
      -lh3- 8k sliding dictionary (max 256 bytes) + static Huffman
      -lh4- 4k sliding dictionary  (max  256  bytes)  +  static  Huffman  +
            improved encoding of position and trees
      -lh5- 8k sliding dictionary  (max  256  bytes)  +  static  Huffman  +
            improved encoding of position and trees
      -lzs- 2k sliding dictionary (max 17 bytes)
      -lz4- no compression, file stored
      -lz5- 4k sliding dictionary (max 17 bytes)

  There are several utilities that you can use to decompress  these  files,
like the already-mentioned LHA on the PC, or Star  LHA,  one  of  the  many
excellent utilities contained in the Star Commander  distribution  package.
If you use Star LHA, keep in mind it needs the program LHA v2.14 (or newer)
to extract. If an older version of LHA is used (such as the common  version
2.13), then the files being extracted will be corrupt. It will extract  the
files directly into a D64 image, so the long  C64  filenames  will  not  be
lost.

  To an emulator user there is no use to these files, as  their  only  real
usage on a C64 was for storage  and  transmission  benefits.  The  standard
compression program on the PC is PKZIP (or ZIP compatibles), so unless  you
have some need to send *compressed* files back the C64, there is no use  in
using LHA.