########
             ##################
         ######            ######
      #####
    #####  ####  ####      ##      #####   ####  ####  ####  ####  ####   #####
  #####    ##    ##      ####    ##   ##   ##  ###     ##    ####  ##   ##   ##
 #####    ########     ##  ##   ##        #####       ##    ## ## ##   ##
#####    ##    ##    ########  ##   ##   ##  ###     ##    ##  ####   ##   ##
#####  ####  ####  ####  ####  #####   ####  ####  ####  ####  ####   ######
#####                                                                    ##
 ######            ######           Issue #19
   ##################             May 29, 2000
       ########			 (Memorial Day)


...............................................................................

			    Seek, and ye shall find.

		           Ask, and it shall be given.

...............................................................................

BSOUT

	C'mon, it's only, what, 9 months late?

	Many of you, I am sure, have been wondering, "Is C=Hacking still
alive?  Has he lost interest?"  The respective answers are yes, and no.
				- BUT -
	Although I have not lost interest in the 64, I have lost a lot of
free time I once had, and I am now able to pursue a lot of other interests!
So the total time allocated to the 64, and hence to C=Hacking, has
decreased considerably.  Work on this issue actually began last summer,
around August or September.  But work on jpx began about the same time,
followed by work on Sirius, and I devoted my C64 time to them instead of
C=Hacking.  Then work intensified at work, and work began on a garage, and
a plane, and... well, you get the idea.  Poor issue #19 just got worked on
in little dribbles every few weeks.
	The main reason I share this sad tale is that, the way I see it,
C=Hacking could use a little help, if it is to come out more frequently.
If nobody volunteers it will still come out, but in exactly the way it
does right now -- a little less frequently than it ought to.  Some of the
more time-consuming tasks are: finding articles, reviewing (actually
refereeing) articles, and collecting the latest news and tips.  Finding
articles means finding people who are doing some nifty Commodore project,
or talking someone into doing some nifty Commodore project.  Refereeing an
article means reading the article carefully, making sure everything is
technically correct, making suggestions for improvement, and so on.  And
collecting news means being plugged into the system.
	I have a few people I rely on for some of these things, but I
could use more, and if you'd like to help out (especially finding new
articles, or keeping up to date on the latest C64 news) please drop me
an email.

	With that out of the way, brother Judd would like to preach on
a malaise that afflicts the C64 world and which has been getting worse:
Not Finishing The Job.  I just think about all the promising projects
I've heard about over the last few years -- off the top of my head I remember
a SCPU game, a SCPU monitor, several demos, multiple utilities, a VDC code
library, several OSes... -- which were Almost Done.  And where are they
now?  Presumably, still Almost Done.  So if you have a project which is
Almost Done, but has been sitting around for the last few months/years...
please, please finish up that last 10% and release it.
	We, the technical community, are a community.  We draw strength
from each other, we get ideas and motivation from each other, and we
push each other to do great things.  It's a big feedback loop, where
activity stimulates more activity, and decreased activity begets yet less
activity.  I suppose C=Hacking serves as a prime example of this.
	I'm not saying we're on the verge of a big programming renassaince,
but I am concerned that we are drying up.  Maybe if people finish up those
programs lying around it will reverse the trend.  (I mean, hey, doesn't
this finally finished-up issue want to make you go out and do cool stuff?)

	In other news, The Wave seems to be testing out wonderfully and
is totally cool.  In case you've been under a rock these past few months
The Wave is an integrated TCP/IP suite for Wheels -- telnet, graphical
web browser, PPP, the works.  Lots of people have been beta-testing it
for several months now and it is solid.  Outstanding.
	I was asked lo these many months ago to put in a plug for

	http://www.6502.org

which is run by Mike Naberezny (mnaberez@nyx.net).  He is looking for
comments, suggestions, and maybe even contributions, so drop him a line
and tell him what you think.
	The ever-resourceful Pasi Ojala has several new thingies on his
web site.  This is probably ancient history by now but it's in my "latest
news" file, sooo...

1) a voice-only copy of the Amiga Expo 1988 presentation by R.J.Mical
   about the early years of Amiga is available in four parts as .mp3
   from http://www.cs.tut.fi/~albert/Dev/
        (24kbit/s, 16kHz, mono, ~20MB total, over 100 minutes)
   Includes facts and fiction and funny stories about the making of
   the Amiga. The files may change location in the future but you
   will find links to them from my page. Enjoy!

2) Some VIC20 graphics are also available at
        http://www.cs.tut.fi/~albert/Dev/VicPic/
   There is one picture which can be viewed with unexpanded VIC20
   (with 154x/7x or 1581 drive) and others for 8k-expanded
   machine. Both PAL and NTSC versions are available.
   There are also gif version of the pictures on the page.

	Myke Carter (mykec@delphi.com) has developed a filter program
that allows C=Hacking to be converted to geoWrite format.  Thus, if
you'd like a geoWrite version of C=Hacking, send him some email!

Finally, this is memorial day here in the States, and I'd just like to
suggest folks take a little time to think about the purpose of this holiday
and why we have it.  

Okay then, enough with the jabber, and on to hacking excellence.

.......
....
..
.                                    C=H 19

::::::::::::::::::::::::::::::::::: Contents ::::::::::::::::::::::::::::::::::

BSOUT
	o Voluminous ruminations from your unfettered editor.


Jiffies
	o Things.  And stuff.


Side Hacking

	o "Burst Fastloader for the C64", by Pasi Ojala <albert@cs.tut.fi>.
	  The 128 can burst-load from devices such as the 1571 and 1581.
	  With a small hardware modification, the C64 can too -- as it was
	  originally designed for.  This article discusses the modification
	  along with example burstload code.

	o "8000's User Port & Centronics Printers", by Ken Ross 
	  <petlibrary@bigfoot.com>.  This article describes the user port
	  on the PET 8000, including a demonstration BASIC program for
	  sending data to e.g. a centronics printer via the user port.


Main Articles

	o "Sex, lies, and microkernal-based 65816 native OSes, part 1",
	  by Jolse Maginnis <jmaginni@postoffice.utas.edu.au>.  It's time
	  to learn about OS design and design philosophy.  This article
	  starts with OS basics and ends with JOS innards.  (JOS, in case
	  you've been under a rock the past few months, is a rather cool
	  multitasking 65816 OS which can do some rather cool things).

	o "VIC-20 Kernel ROM Disassembly Project", by Richard Cini
	  <rcini@email.msn.com>

	  And on we go to article three in the series.  This article continues
	  the investigation of the IRQ and NMI routines -- specifically,
	  the routines called by those routines (UDTIM, SCNKEY, etc.).

	o "JPEG: Decoding and Rendering on a C64", by S. Judd <sjudd@ffd2.com>
	  and Adrian Gonzalez <adrianglz@globalpc.net>.  Actually it's
	  two articles:

	  "Decoding JPEGs".  This article covers the basics and details of
	  JPEG encoding and decoding, with special attention to the IDCT,
	  and some related C64 issues.

	  "Bringing 'true color' images to the 64".  This article discusses
	  Floyd-Steinberg dithering, and how the IFLI graphics in jpz are
	  rendered.

	  
.................................. Credits ...................................

Editor, The Big Kahuna, The Car'a'carn..... Stephen L. Judd
C=Hacking logo by.......................... Mark Lawrence

Special thanks to the folks who have helped out with reviewing and such,
and to the article authors for being patient!

Legal disclaimer:
	1) If you screw it up it's your own fault!  
	2) If you use someone's stuff without permission you're a dork!

About the authors:

Jolse Maginnis is a 20 year old programmer and web page designer,
currently taking a break from CS studies.  He first came into contact
with the C64 at just five or six years of age, when his parents brought
home their "work" computer.  He started out playing games, then moved on
to BASIC, and then on to ML.  He always wanted to be a demo coder, and in
1994 met up with a coder at a user's group meeting, and has since worked
on a variety of projects from NTSC fixing to writing demo pages and intros
and even a music collection.  JOS is taking up all his C64 time and he
is otherwise playing/watching sports, out with his girlfriend, or at a
movie or concert somewhere.  He'd just like to say that "everyone MUST
buy a SuperCPU, it's the way of the future" and that if he can afford
one, anyone can!

Richard Cini is a 31 year old vice president of Congress Financial
Corporation, and first became involved with Commodore 8-bits in 1981, when
his parents bought him a VIC-20 as a birthday present.  Mostly he used it
for general BASIC programming, with some ML later on, for projects such as
controlling the lawn sprinkler system, and for a text-to-speech synthesyzer.
All his CBM stuff is packed up right now, along with his other "classic" 
computers, including a PDP11/34 and a KIM-1.  In addition to collecting
old computers Richard enjoys gardening, golf, and recently has gotten
interested in robotics.  As to the C= community, he feels that it
is unique in being fiercely loyal without being evangelical, unlike
some other communities, while being extremely creative in making the 
best use out of the 64.

Adrian Gonzalez is a 26 year old system/network administrator for an ISP
serving Laredo, TX and Neuvo Laredo, Mexico.  He and his brother convinced
their parents to buy them a C64 in 1984, and whereas his brother moved on
to PCs he stuck with the 64 and later bought an Amiga.  He learned BASIC
programming in sixth grade and wrote a few BASIC programs for the family
business; since then Adrian has put several demos and utilities under his
belt.  In addition to fancy graphics and music, Adrian has an interest
in copy protection schemes (and playing the occasional game, of course).
When he's not coding, he's either playing basketball, playing piano,
editing videos, or going out to movies/parties.  You can visit his web
page at http://starbase.globalpc.net/c64/main.html for more info.

For information on the mailing list, ftp and web sites, send some email
to chacking-info@jbrain.com.

While http://www.ffd2.com/fridge/chacking is the main C=Hacking homepage,
C=Hacking is available many other places including

	http://www.funet.fi/pub/cbm/magazines/c=hacking/
	http://metalab.unc.edu/pub/micro/commodore/magazines/c=hacking/


................................... Jiffies ..................................

$FFC6

I actually have a little Jiffy that I 'discovered' recently.  It's one of
those things that is so obvious and simple that it took me several tries
before I stumbled onto it.  It also highlights a rather powerful feature
of the lowly C64 kernal.

Not long ago, I was asked to write a slideshow program for jpz.  Ideally,
a slideshow program should be a "plug-in" for the regular viewer, which can
load pictures from some list in a file.  But I didn't see a decent way to do
this, especially for jpz which has maybe 200 bytes free total.  Then the
thunderclap finally occured.

Everyone has used CMD4 to redirect a file to the printer.  But just as the
kernal can redirect _output_ to different devices, it can redirect the
_input_ to be from different devices, using CHKIN.  So all the slideshow
program has to do is open a list of filenames, redirect input to that file,
and execute the normal jpz.  jpz just uses JSR CHRIN to get data -- normally
that data comes from the keyboard, but with CHKIN it comes from the file
instead, akin to "a.out < input" in unix.  Since jpz doesn't close the file,
calling jpz repetitively will keep reading from the input file.
	The result is a simple and effective slideshow program, and a trick
which ought to be useful in other situations.  Here is the entire slideshow
code, located at $02ae to be autobooting.  The main loop is seven lines long:

*      
* Simple slideshow -- slj 4/2000
*

         org $02ae

name     txt 'ssw.files'

start    
         lda #start-name
         ldx #<name
         ldy #>name
         jsr $ffbd
         lda #3
         tay
         ldx $ba
         jsr $ffba
         jsr $ffc0

         ldx #<main		;Modify JPZ to jump to main instead
         ldy #>main		;of exiting
         lda $10fb		;Check if jpy or jpz is in memory
         cmp #$4c
         bne :jpy
         stx $10fc
         sty $10fd
         beq main
:jpy     stx $10ed
         sty $10ee

main     
         ldx #3
         jsr $ffc6
         jsr $ffe4
         lda $90		;loop until EOF reached
         and #$40
         bne :done
         jmp $1000		;call jpz
:done    
         lda #3
         jsr $ffc3
         jsr $ffcc
         jmp $a474

         da start
         da start


................................ Side Hacking ................................

Burst Fastloader for C64 by Pasi Ojala, albert@cs.tut.fi
------------------------
   
   Commodore disk drives 1570/71 and 1581 implemented a new fast serial
   protocol to be used with the C128 computer. This synchronous serial
   protocol speeds up data transfer between the computer and the drive
   ten-fold. The amazing thing is that this kind of serial protocol was
   supposed to be used in VIC-20 and the 1540 drive until it was
   discovered that a hardware bug in the 6522 VIA (versatile interface
   adapter) chip prevented the use of the chip's synchronous serial
   interface.
   
   The synchronous serial port would've allowed whole bytes to be sent in
   both directions without processor intervention with the maximum speed
   of one bit per two clock cycles. Without a bug-free synchronous serial
   port the transfer had to be slowed down considerably so that the
   receiver has a chance to detect all changes in the serial bus lines.
   This became the dead slow software-driven Commodore serial protocol.
   
  Syncronous Serial
  
   The complex interface adapter (6526 CIA) chips used in Commodore 64
   and later in Commodore 128 have bug-free synchronous serial
   interfaces: serial data and serial clock inputs/outputs. In input
   mode, each time a rising edge is detected in the serial clock pin
   (CNT), the state of the serial data (SP) is shifted into a register.
   When 8 bits are received the accumulated bits are moved into the
   serial data register and a bit is set in the interrupt status register
   to reflect this. If the corresponding interrupt is enabled, an
   interrupt is generated.
   
   In output mode the serial clock line is controlled by Timer A. The
   serial clock is derived from the timer underflow pulses. When a byte
   is written to the serial data register, the value is clocked out
   through the serial data pin (SP) and the corresponding clock signal
   appears on the serial clock pin (CNT). After all 8 bits are sent, the
   serial interrupt bit is set in the interrupt status register.
   
   Synchronous serial bus is used in C128/157x/1581 fast serial protocol.
   An obsolete signal in the peripheral serial bus (SRQ) was taken into
   service as the new fast (synchronous) serial clock line. The old
   serial data line doubles as slow and fast serial data line. And the
   old serial clock line doubles as slow serial clock line and fast
   serial (byte) acknowledge line.
   
   The fast serial protocol is basically very simple. The side sending
   data configures its synchronous serial port into output mode, the
   other side uses input mode. The old peripheral serial bus clock line
   is controlled by the receiving side and is used as an acknowledge:
   when the receiver is ready for data, it toggles the state of the clock
   line. The actual data is transferred using the synchronous serial
   ports. The sender writes the data to be sent into the serial data
   register and waits for the transfer to complete. The receiver waits
   for a byte to arrive into its serial data register. The actual
   transfer is automatically handled by the hardware.
   
   Both the drive and the computer must detect whether the other side can
   handle fast serial transfers. This is accomplished by sending a byte
   using the synchronous serial port while doing handshaking. The drive
   sends a fast serial byte when the computer sends a secondary address
   (SECOND, which is called by e.g. CHKOUT), the computer can in practice
   send the fast serial byte anytime after the drive is reset and before
   the drive would send fast serial bytes.
   
  Modification to c64
  
   To use burst fastloader with C64 we need to connect the CIA
   synchronous serial port to the synchronous serial lines of the
   Commodore peripheral serial bus. Two wires are needed: one to connect
   the serial bus data line to the syncronous serial port data line and
   one to connect the serial bus SRQ (the obsolete line for service
   request, now fast serial clock) to the synchronous serial port clock
   line. Select the right connections depending on whether you want to
   use CIA1 or CIA2.
   
        1570/1,1581                             C64

Pin1    SRQ     Fast serial bus clk             CNT1/2  User port 4/6
Pin5    DATA    Data - slow&fast bus            SP1/2   User port 5/7


Top view - old c64, CIA1
User port       Cass port       Serial connector

||||||||||||    ||||||           HHHHH          behind:
||||||||||||    ||||||         .-1 3 5-.
       ||______________________|  2 4  |          / \
       |        CNT1               6   |         // \\
       |_______________________________|         |||||
                SP1                             1 264 5


Top view - old c64, CIA2
User port       Cass port       Serial connector

||||||||||||    ||||||           HHHHH          behind:
||||||||||||    ||||||         .-1 3 5-.
     ||________________________|  2 4  |          / \
     |  CNT2                       6   |         // \\
     |_________________________________|         |||||
                SP2                             1 264 5

   Solder the wires either to the resistor pack or directly to the user
   port connector, but remember to leave the outer half of the connector
   free so that you can still plug in your user port devices.
   
   Then solder the other ends to the serial connector. Those left- and
   rightmost pins are 1 and 5, respectively, so it is fairly easy to do
   the soldering. You can also build a cable which connects those lines
   externally.
   
  Software for C64
  
   Of course the C64 only uses the standard slow serial routines and we
   need a seperate fastloader routine to take advantage of the fast
   serial connection we just soldered into our machine. The following
   load routine is located in the unused area $2a7-$2ff and in the
   cassette buffer $334-$3ff. Just load and run the "burster" program. It
   installs the loader and replaces the default load routine by our
   routine. The old load routine is used if
     * a verify operation is requested
     * a directory load operation is requested (filename starts with '$')
     * the filename starts with a colon (':')
       
   So, it is possible to use the old load routine by prepending a colon
   (':') to the filename. This is needed if you need to use both fast and
   slow serial devices at the same time. Unfortunately detecting
   fast-serial-capable devices is not feasible, because a lot of ROM code
   would have to be duplicated and then the loader would become too
   large. Because of this it becomes the responsibility of the user to
   prepend the colon (':') if a slow serial device is accessed.
   
   A fastloader version is available for both CIA1 (asm, exe) and CIA2
   (asm, exe) versions, uuencoded versions are attached to this article.
   Only the CIA1 version is discussed here.
   
; DASM V2.12.04 source
;
; Burst loader routine, minimal version to allow loading of programs upto 63k
; in length ($400-$ffff). Directory is loaded with the normal load routine.
;
; (c)1987-98 Pasi Ojala, Use where you want, but please give me some credit
;
; This program needs SRQ to be connected to CNT1 and DATA to SP1 (CIA1).
; Cassette drive won't work with those wires connected if the disk drive
; is turned on. (SRQ is connected to cassette read line.)
;
; SRQ = Bidirectional fast clock line for fast serial bus
; DATA= Slow/Fast serial data (software clocked in slow mode)
;
; In C128D (64-mode) you should use CIA2, because it has special hardware
; which inhibits the use of CIA1 (or so I'm told).
;
; A short description of the burst protocol and commands can be found
; from the "1581 Disk Drive User's Guide".

        processor 6502

        ORG $0801
        DC.B $b,8,$ef,0 ; '239 SYS2061'
        DC.B $9e,$32,$30,$36,$31
        DC.B 0,0,0

install:
        ; copy first block to $2a7..$2ff
        ldx #block1_end-block1-1        ; Max $58
0$      lda block1,x
        sta _block1,x
        dex
        bpl 0$
        ; copy second block to $334..$3ff
        ldx #block2_end-block2          ; Max $cc
1$      lda block2-1,x
        sta _block2-1,x
        dex
        bne 1$

        lda $0330       ; load vector
        ldx $0331
        cmp #MyLoad
        beq 3$
2$      sta OldVrfy+1   ; chain the old load vector
        stx OldVrfy+2
        lda #MyLoad
        sta $0331
3$      rts

block1
#rorg $02a7
_block1
OldLoad lda #0
OldVrfy jmp $f4a5       ; The 'normal' load.

MyLoad: ;sta $93
        cmp #0          ; Is it a prg-load-operation ?
        bne OldVrfy     ; If not, use the normal routine
        stx $ae         ; Store the load address
        sty $af
        tay             ; ldy #0
        lda ($bb),y     ; Get the first char from filename
        ldy $af
        cmp #$24        ; Do we want a directory ($) ?
        beq OldLoad     ; Use the old routine if directory
        cmp #58         ; ':'
        beq OldLoad

        ; Activate Burst, the drive then knows we can handle it
        sei             ; We are polling the serial reg. intr. bit
        ldy #1          ; Set the clock rate to the fastest possible
        sty $dc04
        dey             ; = ldy #0
        sty $dc05
        lda #$c1
        sta $dc0e       ; Start TimerA, Serial Out, TOD 50Hz
        bit $dc0d       ; Clear interrupt register
        lda #8          ; Data to be sent, and interrupt mask
        sta $dc0c       ; (actually we just wake up the other end,
0$      bit $dc0d       ;  so that it believes that we can do
                        ;  burst transfers, data can be anything)
        beq 0$          ; Then we poll the serial (data sent)
        ; Clears the interrupt status

        ; This program assumes you don't try to use it on a 1541
        ; If you try anyway, your machine will probably lock up..

        lda #$25        ; Set the normal (PAL) frequence to TimerA
        sta $dc04       ; Change if you want to preserve NTSC-rate
        lda #$40
        sta $dc05
        lda #$81
        jmp LoadFile

GetByte lda #8          ; Interrupt mask for Serial Port
0$      bit $dc0d       ; Wait for a byte
        beq 0$          ;  (Serial port int. bit changes, hopefully)
        ;ldy $dc0c      ; Get the byte from Serial Port Register
ToggleClk:
        lda $dd00       ; Toggle the old serial clock (=send Ack)
        eor #$10        ;  so that the disk  drive will start
        sta $dd00       ;  sending the next byte immediately
        ;tya            ; return the value in Accumulator, update flags
        lda $dc0c       ; Get the byte from Serial Port Register
        rts
#rend
block1_end


block2
#rorg $0334
_block2

LoadFile:
        sta $dc0e       ; Start TimerA, Serial IN, TOD 50Hz (PAL)
        ;cli

        jsr $f5af       ; searching for ..

        lda $b7         ; Preserve the filename length
        pha
        lda $b9         ; Do the same with secondary address
        sta $a5         ; We store it to cassette sync countdown..
                        ;  No cassette routines are used anyway, as
        lda #0          ;  this prg is in cassette buffer..
        sta $b7         ; No filename for command channel
        lda #15
        sta $b9         ; Secondary address 15 == command channel
        lda #239
        sta $b8         ; Logical file number (15 might be in use?)
        jsr $ffc0       ; OPEN
        sta ErrNo+1
        pla
        sta $b7         ; Restore filename length
        bcs ErrNo       ; "device not present",
                        ; "too many open files" or "file already open"
        ; Send Burst command for Fastload
        ldx #239
        jsr $ffc9       ; CHKOUT Set command channel as output
        sta ErrNo+1
        bcs NoDev       ; "device not present" or other errors

        ; Bummer, the interrupt status register bit indicating fast serial
        ; will be cleared when we get here..

        ldy #3
3$      lda BCMD-1,y    ; Burst Fastload command
        jsr $ffd2
        dey
        bne 3$
        ; ldy #0
1$      lda ($bb),y
        jsr $ffd2       ; Send the filename byte by byte
        iny
        cpy $b7         ; Length of filename
        bne 1$
        jsr $ffcc       ; Clear channels

        sei
        jsr $ee85       ; Set serial clock on == clk line low
        bit $dc0d       ; Clear intr. register
        jsr ToggleClk   ; Toggle clk

        jsr HandleStat  ; Get Initial status
        pha             ; Store the Status

        ;jsr $f5d2      ; loading/verifying
        ; (uses CHROUT, which does CLI, so we can't use it)

; We could add a check here..
; if we don't have at least two bytes, we cannot read load address..

; It seems that for files shorter than 252 bytes the 1581 does not count
; the loading address into the block size.

        jsr GetByte     ; Get the load address (low) - We assume
                        ; that every file is at least 2 bytes long
        tax
        jsr GetByte     ; Get the load address (high)
        tay             ; already in Y
        lda $a5         ; The secondary address - do we use load
                        ;  address in the file or the one given to
        bne Our         ;  us by the caller ?
        stx $ae         ; We use file's load addr. -> store it.
        sty $af
Our     ldx #252        ; We have 252 bytes left in this block
        pla             ; Restore the Status
        bne Last        ; If not OK, it has to be bytes left
Loop    jsr GetAndStore ; Get X bytes and save them
        jsr HandleStat  ; Handle status byte
        beq Loop        ; If all was OK, loop..
Last    tax             ; Otherwise it is bytes left. Do the last..
        jsr GetAndStore ; Get X number of bytes and save them
        jsr $ee85       ; Serial clock on (the normal value)
        lda #239
        jsr $ffc3       ; Close the command channel
        clc             ; carry clear -> no error indicator
        bcc End

FileNotFound:
        pla             ; Pop the return address
        pla
        jsr $ee85       ; Serial clock on (the normal value)
        lda #4          ; File not found
        sta ErrNo+1
NoDev   lda #239
        jsr $ffc3       ; Close the command channel
ErrNo   lda #5          ; Device not present
        sec             ; carry set -> error indicator
End     ldx $ae         ; Loader returns the end address,
        ldy $af         ;  so get it into regs..
        cli
        rts             ; Return from the loader

HandleStat:
        jsr GetByte     ; Get a byte (and toggle clk to start the
                        ;  transfer for next byte)
        cmp #$1f        ; EOI ?
        bne 0$
        jmp GetByte     ; Get the number of bytes to follow and RTS
0$      cmp #2          ; File Not Found ?
        bcs FileNotFound        ; file not found or read error
        ; code 0 or 1 -> OK
        ldx #254        ; So, the whole block is coming
        lda #0          ; No error -> Z set
        rts

GetAndStore:
        jsr GetByte     ; Get a byte & toggle clk
        ;sta $d020
        ldy #$34
        sty 1           ; ROMs/IO off (hopefully no NMI:s occur..)
        ldy #0
        sta ($ae),y     ; Store the byte
        ldy #$37
        sty 1           ; Restore ROMs/IO (Should preserve the
                        ;  state, but here it doesn't..)
        inc $ae         ; Increase the address
        bne 0$
        inc $af
0$      dex             ; X= number of bytes to receive
        bne GetAndStore
        rts

BCMD:   dc.b $1f, $30, $55      ; 'U0',$1F == Burst Fastload command
                                ; If $9F, Doesn't have to be a prg-file
#rend
block2_end

   Now that was it. Now I just hold back and wait until someone
   implements this for VIC-20's buggy 6522 chips so that I don't have
   to.. :-)
                                      
begin 644 burster-cia1
M`0@+".\`GC(P-C$```"B5[U"")VG`LH0]Z+'O9D(G3,#RM#WK3`#KC$#R:S0[
M!.`"\!"-J@*.JP*IK(TP`ZD"C3$#8*D`3*7TR0#0^8:NA*^HL;NDK\DD\.K)Y
M.O#F>*`!C`3QNR#2_\C$M]#V(,S_R
M>""%[BP-W"#S`B#,`T@@[`*J(.P"J*6ET`2&KH2OHOQHT`@@WP,@S`/P^*H@@
MWP,@A>ZI[R##_QB0$FAH((7NJ02-Q`.I[R##_ZD%.*:NI*]88"#L`LD?T`-,A
G[`+)`K#:HOZI`&`@[`*@-(0!H`"1KJ`WA`'FKM`"YJ_*T.A@'S!5/
``
end
size 354

begin 644 burster-cia2
M`0@+".\`GC(P-C$```"B2[U"")VG`LH0]Z+)O8T(G3,#RM#WK3`#KC$#R:S0E
M!.`"\!"-J@*.JP*IK(TP`ZD"C3$#8*D`3*7TR0#0^8:NA*^HL;NDK\DD\.K)Y
M.O#F>*`!C`3=B(P%W:G!C0[=+`W=J0B-#-TL#=WP^TPT`ZD(+`W=\/NM`-U)T
M$(T`W:T,W6"I@(T.W2"O]:6W2*6YA:6I`(6WJ0^%N:GOA;@@P/^-Q@-HA;>PZ
M:Z+O(,G_CZI!(W&`ZGO(,/_J04XIJZDKUA@(.`"R1_0`TS@`LD"L-JB_JD`H
=8"#@`J`TA`&@`)&NH#>$`>:NT`+FK\K0Z&`?,%4"Y
``
end
size 344

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

8000's USER PORT & CENTRONICS PRINTERS

by Ken Ross
petlibrary@bigfoot.com
http://members.tripod.com/~petlibrary

A recent query had me digging out an old item dealing with the user port on
the CBM/PETs.  The main use I've put it to in the past has been to drive a
parallel printer with just the addition of a home brew cable (a Panasonic
Daisy Wheel printer salvaged before bin men got it!).  The user port is
the edge connection tween the IEEE edge and the cassette#1.  The top side
is mostly diagnostic, the underside is the easy to use area.  It's an I/O
(Input/ Output) system that you can control with a few PEEKs and POKEs.
Reading from left to right (as you look at the back of the beastie):

A _ ground
B _ input to 6522 VIA, CA1
C D E F G H J K L _ are  I/O lines ( 8 of them ) , PA0-7 [ data lines ]
M _ CB2 line from VIA can be I/O
N _ ground

A text file to be printed out can be read a character at a time with
MID$(etc) for this PRG to deal with and quite high speeds can be reached
even without having to compile it .

(This is actually a section of listing just printed out from my 8096 -
hence untidy numbers )

3010 POKE 59459, 255:REM make PA0-7 into outputs
3020 POKE 59467,PEEK(59467) AND 277 :REM disable shift register
3022 RETURN :REM finished with this sub
     [this enables the user port for this purpose]
3023 REM this sub puts the data into output
3024 if DATA <32 then goto 3080 :REM line does biz for LF & CR
3026 if DATA =>65 and DATA<= 90 then DATA=DATA +32 : goto 3029
     [petscii lower case is chr$(65-90) but ascii uses 97-122]
3027 if DATA =>193 and DATA<= 218 then DATA=DATA -128 :goto 3029
     [petscii upper case is chr$(193-218) which has to be shifted to
      ascii 65-90]
     [ascii uses up to 127 but petscii uses up to 255 for chars]
3029 REM line below sets strobe low to inform printer new data character on
way
3031 POKE 59468, PEEK(59468) AND 31 OR 192
3035 REM below sets strobe high as data arrives
3045 POKE 59468,PEEK(59468) AND 31 OR 224
3050 POKE 59471, DATA:REM at last data is POKE'd !!!
     [the data numbers from above]
3060 POKE 59468,PEEK(59468) AND 31 OR 224 :REM strobe high still
3065 REM handshake sub
3066  POKE 59467, PEEK(59467) OR 1
3067 WAIT 59469,2
3068 K=PEEK(59457)
3069 REM end of handshake sub
     [well it works for me!!]
3070 RETURN :REM back to main area for next data
3080 REM bit for LF & CR sub & return
     [this depends on the printer and the same procedure for paper eject
      if needed]

The cable connections are
CBM	- CENTRONICS
CB2     - DATA STROBE   #1
PA0~7   - DATA1-8       #2-9
CA1     - ACKNOWLEDGE   #10 ( or BUSY #11 depending on printer ! )
GND	- grounds #14, 16, 24, 33, chassis gnd 17

More modern printers will also need additional commands to enable things.
The commands needed for Epson printers ( with the exception list of
Epsons that don't use them !) are on my website at :

	http://members.tripod.com/~petlibrary/printesc.htm

If any more info turns up it'll be there in time .
.......
....
..
.                                    C=H 19

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

				 Main Articles

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


	  ------------------------------------------------------------
         | Sex, lies and microkernel based 65816 native OSes. - Part 1|
	  ------------------------------------------------------------
			      By Jolse Maginnis
 
Some readers may have read my article in GO64 issue 8/1999, which was a bit of
an introduction to JOS and some Operating System concepts, but it wasn't very
technical, and didn't really get into the nitty gritty. Getting down and dirty
with the bits and bytes is what C-Hacking is all about, so that's what this
series of articles will try to do wherever possible. 

I'll try to go into detail about modern OS designs, paying particular detail to
what is relevant to the C64/SuperCPU and what we can do without. I'll also try
and make comparisons to the kind of coding most of us are used to, e.g. just
using the kernel to access hardware, or just skipping the kernel altogether.
Most of the article will be in reference to the SuperCPU, specifically it's
65816 CPU, and the OS I'm making for it, called JOS. If you haven't got a
SuperCPU yet, hopefully you'll want one by the end! (Remember it won't stop you
running stock programs!)

	      -------------------------------------------------
	     | OK, So what do you plan to do.. And why bother? |
	      -------------------------------------------------

When I first heard about the SuperCPU, I got pretty excited. "20Mhz! That's 20
times faster! 16Mbs! That's 256 times more RAM! I can only imagine what it's
capable of!", well I didn't actually say those things, but I at least thought
them! At the time I had already started making an OS for the C64, and at the
time I didn't know much at all about making an OS, all I knew about was
multitasking, and how to do it on C64. After that day, I decided I'd wait until
I managed to get myself a SuperCPU and make an OS on that, and to my surprise,
at that time, there didn't seem to be anyone else developing an OS for the
SuperCPU.

Only when the SCPU arrived and I had started coding for it, did I realise how
powerful it was. Yeah it's 20 times faster in clock speed, but it's also a 16
bit processor, which might not seem like a great step up, but once you start
coding in 16 bits, it's hard to see how you did without it!

The 65816 has some great advantages over the 6502:
It's stack pointer is not limited to 256 bytes.
The Zero Page isn't stuck in the zero page! (It's now called the Direct Page).
There are a few more ways to put values on the stack.
Long addressing allowing upto 16mb directly accessible memory.
Plenty more..

The top three things in particular, together with the 16 bit wide registers
means it's very suited to programming in a high level language like C,
particularly when compared to code that has to be produced for 6502. Higher
level languages can actually use the real CPU stack rather than having to
simulate it, as with 6502. Also by moving the Direct Page register, local
variables can be accessed like zero page variables, so performance isn't hurt
too much.

All this would be good even at a lower speed like 1 or 2Mhzs, but it's at 20!
The SuperCPU adds some real power to your old C64, but it's all hidden away
because we're running a ~20 year old "OS". It's just crying out for a new one! 

The C64 has many limitations, most of which are provided by the kernel and the
CBM serial bus. Here's a list of the main limits:

Single Tasking - Running two seperate programs at the same time impossible.

Some devices aren't catered for - Some devices don't have a chance at running
with old programs that were designed before their time.

Old sequential filesystem - It's not designed for random access files, although
random access is possible, it's just slower. All C64 programs have to written,
so that files are read from the beginning to the end, which is a little bit
limiting. Also it's the drives that dictate the filesystem, so we aren't just
stuck with the kernel's limits, we're stuck with the drives' as well. Having
several files open on many drives, while reading and writing to all of them just
isn't a possibility. Why would you want to do that? If you we're multitasking
several programs, that's just might be what happens!

It became pretty clear that the C64's kernel was of no use to JOS, since it had
too many limitations. So everything had to be re-written from scratch, with the
limits removed. 

Along with re-doing the filesystem and adding multitasking, I had some other
plans for JOS:

Networking - Everything is internet, internet, internet these days, and why not,
the internet is great! So TCP/IP and SLIP/PPP were high on the list of TODO's.

GUI - The SuperCPU is ideal for a nice, flexible, easy to program GUI.

Console - I wanted the console to be as close as possible to one of the standard
terminals (vt100,ansi etc..) thus making it easy to get by without needing a
terminal emulation program.

Shared libraries & shared code, relocatable binary format - Sharing as much code
as possible really saves memory and loading time. The binary format means that
you don't have to worry about where in memory your program will be.

Modular and scalable - It's nice to be able to choose exactly what your OS
needs, rather than getting lumped with it all. E.g. Do you really need tcp/ip
loaded if your not going to use the internet? If i'm running a webserver, do I
really need the console driver loaded?

Device independence - Application should not have to worry at all about what
devices they are using, which means that they'll be compatible any device
including new ones. This is particularly useful when it comes to disk drives and
filesystems.

Porting and writing C programs - Wouldn't it be great if our C64's could take
advantage of the Open Source movement that's sweeping the world, and compile
some of these open source programs?

OK, so why am I bothering? At first I just wanted to see what I could do with
it, but now that it's come so far, it's not only of interest to me, as it's
become a very powerful OS.

	      -------------------------------------------------
	     |             Bloat: My layers theory             |
	      -------------------------------------------------

Unless you've been living on a remote desert island for the last 5 years, you'll
know about the terrible trend in personal computing these days; buy a new PC now
and in 6 months or less it's outdated. As CBM users, we successfully avoid all
this. Sure, CMD have tonnes of upgrades available, but they're all "once in a
lifetime" upgrades, I'm pretty sure I wont be upgrading my SuperCPU!

Have you ever thought about why PC's become outdated so quickly? It's very
popular to blame Microsoft (and I will!), since they are the main proponent of
bloat with their ever expanding OSes and applications, but it's just generally
accepted now that it's ok to leave things unoptimized, and just add more and
more "layers". I run Linux on my 486 PC, with 10mb of RAM, and it's unbelievable
how much time is spent "chunking" or "thrashing", due to programs and their
components taking up so much RAM. For me, it's all about layers. It's what
separates C64's from the bloated world of the PC. Here's my comparisons...


CPU Type
--------
PC           - 32 bit processors   
C64/SuperCPU - 8/16 bit Processor

This is quite arguable, but when most of your code doesn't deal with numbers
over 32768, 32 bit's can be a bit wasteful, but of course if you need to do 32
bit arithmetic on an 8 or 16 bit processor, that too is wasteful. For me a 16
bit processor is the ideal size, particularly after doing lots of 8 bit coding.

Language used
-------------
PC           - Mainly C, C++
C64/SuperCPU - Just about everything in Assembler

C can be a thin layer or a thick layer, depending on the processor. On 6502 it's
quite a thick layer, which is why most things for C64 were written in ASM. On
65816, that layer isn't so thick, so it's a much more viable alternative.
Although, when you write in a higher level language, you tend to forget about
the actual code it produces, and don't bother optimizing it. C++ adds another
layer onto C, not only because of the code it produces, but the style of
program. Good object oriented programming practice adds extra bloat, because
there is more emphasis on doing function calls, to do things that ordinarily are
done by directly accessing the data. The real bloat of Object Orientation isn't
actually the code that you write yourself, you can still write optomized code in
an OO language, but the bloat is in the libraries of objects that you use when
writing your application, take a look at JAVA's huge object libraries for
example.

OS type
-------

PC           - Multitasking OS
C64/SuperCPU - Kernel, or no OS at all.

A multitasking OS adds some layers by default, since it has to switch between
processes. The OS isn't just the task switcher however, it's everything that's
needed to run applications, such as device drivers and shared libraries. In my
opinion, absolutely none or as little as possible of the OS should be written in
a high level language, since it's going to be used by every application, and you
want frequently used things to be as optimized as possible. Most definitely the
most useful task an OS can provide is doing all the Disk I/O. Unfortunately for
us, the C64's kernel and CBM's serial bus are no where near fast enough, so
coders made their own DOS routines.

User Interface
--------------

PC          - Windows, X Windows
C64         - BASIC, GEOS

Windows and X are the most popular GUI's going around. X doesn't impose any
standards on applications, they are free to use whatever widget toolkits they
want, and usually do! When you have a few different applications running, each
with it's own GUI toolkit, you soon run out of memory, particularly if they're
big bloated C++ toolkits. Windows isn't quite the same, you at least have a
consistent look and feel, which also adds up to less memory wastage because most
apps use the same code. GEOS is nice looking but isn't very flexible at all, but
this does mean that it's a very thin layer. My hope is to achieve a balance
between the two.

So why'd I bother with all that? Well I just want to hilight that JOS will be
taking all those things into account, and I want to minimize the amount and size
of layers being added to our beloved C64's.


	      -------------------------------------------------
	     | Monolithic or Micro? How do we want our kernel? |
	      -------------------------------------------------

There are two main styles of OSes doing the rounds at the moment, both with
their own good and bad points.

Monolithic kernels
------------------

These, as the name suggests, are one large monolith of code, which usually
contain driver code for all devices. You would definitely consider the C64's
kernel as a monolithic kernel. Multitasking kernels sometimes allow
modularization, which is basically very similar to what a microkernel does, by
allowing parts of the OS to be dynamically loaded. Linux is a very popular
example of this. It's a monolithic kernel which allows kernel modules to be
loaded dynamically. Last time I checked Lunix Next Generation worked along these
lines. 

Good    - Generally a little faster than Microkernels, particularly if the time
	  taken to switch processes is slow.
	  
Bad     - Not as scalable as a Microkernel. You get everything in a big chunk,
	  whether you need it or not.
	  
How     - Generally applications need to make calls to a jump table, which
	  usually will point to routines for Opening, Closing, Reading and
	  Writing devices.
	  
          e.g. 
                lda #'a'
	  	jsr $ffd2 
	  Prints 'a' character to the current file/device.
	  

Microkernel
-----------

Microkernels truly are micro in size, if they're done correctly. Rather than
lump all the device driver and API code in together, Microkernels only provide
very simple services for setting up processes and allowing them to communicate
with each other. All the device drivers and file-systems are then supplied by
optional programs that are loaded dynamically at run time. This allows maximum
scalability, as you simply don't have to load parts of the OS that you don't
need. The best example other than JOS would be QNX (http://www.qnx.com), a UNIX
based Microkernel OS, which is extremely scalable and very small in code size.
On 6502/C64, OS/A65 is another Microkernel OS.

Microkernel OSes rely heavily on fast Inter Process Communication (IPC). Luckily
this is quite easy to achieve on 65816, and is basically a matter of passing
pointers between processes.

Good    - Extremely scalable. Nicely split up into easy managle parts. Easier to
          debug. I chose a Microkernel in JOS for these reasons.

Bad     - Can be slower if too much time is spent switching between processes.

How     - A jump table is still used, but to actually do any I/O you need to
          communicate with the server process via IPC.
	  
	  To do this in JOS it involves setting up a message somewhere in memory
	  and then calling the S_send system call, to send to the server
	  process. Usually the message will be put on the stack and then popped
	  off when returned, much like a C function call.
	  
	  e.g. to open the file "hello.txt" for reading
	        
		pea O_READ         ; flags
	  	pea ^hellostr      ; high byte
		pea !hellostr      ; low word
                pea IO_OPEN        ; Message code
		tsc
		inc
		tax                ; Low word of Message = Stack+1
		ldy #0		   ; Stack is in Bank 0
		lda #Channel	   ; Channel where "hello.txt" is.
		jsr @S_send
		tsc
		clc
		adc #8
		tcs
		
	 hellostr .asc "hello.txt",0

         note: These are 65816 instructions, so if you don't know what they do
	 you better look them up! The '@' symbol is used to force long
	 addressing, '^' is used for the high 8 bits of a 24bit address, and '!'
	 is used as the bottom 16 bits.  Note that pea is a 16-bit instruction,
	 so pea ^hellostr will add an extra 00 byte.
	 
	 The first 4 pea's prepare an 8 byte filesystem message, containing:
	 Message code for an Open:	IO_OPEN
	 24 bit Pointer to Filename:	hellostr
	 Open flags for reading:	O_READ
	 
	 This message is passed to the filesystem using one of JOS's Inter
	 Process Communication (IPC) system calls, S_send. This call takes the
	 24 bit address of the message in X/Y, and the IPC channel for which to
	 send the message to, in the A register. Every system call in JOS
	 assumes 16 bit A/X/Y registers, as there really isn't anything to be
	 gained by switching to 8 bits for things that only need 8 bits. Adding
	 8 to the stack pointer at the end "pops" the message back off the
	 stack.
	 
	 This all looks a bit complicated doesn't it? Which is where shared
	 libraries help out. The standard C library for JOS allows you to do I/O
	 and such without actually worrying about the system calls. Yes it is a
	 "layer", but it's a very thin one, since the library is written in ASM.
	 
	       pea O_READ	; same as the c code: open("hello.txt",O_READ);
	       pea ^hellostr
	       pea !hellostr
	       jsr @_open
	       pla
	       pla
	       pla
	       
	 Much simpler right?
	 
	 Compare that with the C64 kernel equivalent of:
	 
	       lda #namelen
	       ldx #<hellostr
	       ldy #>hellostr
	       jsr $ffbd	; SETNAM
	       lda #1
	       ldx #8
	       ldy #1
	       jsr $ffba	; SETLFS
	       jsr $ffc0	; OPEN
	       
	 Notice that the JOS version doesn't worry about device numbers or
	 anything.. I'll get to that later...
	 
	         ---------------------------------------------
	        |       C isn't just the letter after B       |
	         ---------------------------------------------

Before I get into juicy OS details, I should explain about C and the standard C
library, as I'll be mentioning it quite a bit.

C is a very powerful language that was created by the same people who created
UNIX, so the two really go hand in hand. The majority of applications written
for UNIX type OSes are written in C; in fact, rather than give you executable
files, they are normally distrubuted as C source code, that you have to compile
yourself. Why is it used so much? Well if the only high level language you've
seen is BASIC, then you'd wonder how any high level language could be used for
good quality programs. C is different because it's just about as close as
you can get to programming in assembly without actually doing it, particularly
on newer processors. It isn't quite so pretty on 6502, but it's quite good on
the 65816. 

In BASIC you're used to having "built in" commands that will print to the
screen, and commands for opening files and reading input, and any other I/O
you can think of. But C on the other hand, has nothing "built in", it doesn't
even have much of a notion of strings! Strings are just pointers to null
terminated arrays of characters in C. So how do you actually get C to do
anything useful? i.e. do some I/O?

This is where the C standard library comes in. This library contains functions
that deal with the underlying OS, and in particular opening/closing &
reading/writing files. It also has code for dealing with strings, allocating
memory, reading directories and various other useful functions. The standard
library also contains more UNIX orientated functions, for dealing with OS
features such as IPC and process control (more on processes later). 

JOS implements a large section of the standard C library, in particular the
section that most command line applications will use. It does implement some of
the UNIX specific functions, but not in a compatible way, and programs that use
these functions are likely to be system applications that aren't useful for any
other system anyway.

Although it's called the standard 'C' library, that doesn't mean it can't be
used in assembly language, in fact it's quite a bit easier to call the C
functions than to deal directly with the OS, and there is no speed penalty in
using the C library because it's been hand coded in assembly language anyway.

Would you like to see what it's like to code using the standard C library? I've
been talking about functions, and if you're familiar with C64 BASIC's functions,
it's quite similar to that, except that you can pass more than one value to the
function. It's basically the same as writing subroutines in assembly, where we
usually pass values using the A,X & Y registers or a ZP value etc.. The only
difference is that ALL values are passed using the CPU stack, which is easily
accesible with the 65816. Ok let's take a look at the previous open file
example:

C code:        file = open("hello.txt",O_READ);

65816 assembly (16 bit regs):

	       pea O_READ
	       pea ^hellostr
	       pea !hellostr
	       jsr @_open	; C functions get "_" prepended to their names
	       pla		; so you don't get them mixed with assembly ones
	       pla
	       pla
	       stx file		; store the result in file
	       sty file+2

Notice that the values are placed onto the stack in reverse order, so they come
out in the correct order when the function accessing them. They are also long
jsr's because they aren't likely to be in the same bank as the calling program.

You might think that having to pop the values back off the stack is cumbersome,
and you're right. Why can't _open pop them off? Well it could, it'd need to do
some messing around with the stack at the end but it'd make things look nicer.
The reason it can't is because C functions don't always know how much data will
be on the stack, so they might pop the wrong amount off. It may look ugly, but
you get used to it.

Now I'll give you a bigger example of what C code looks like after it's been
compiled to prove that the 65816 is capable of producing half decent code. This
will probably only make sense if you've done C programming before, so if 
you're not interested in this kind of thing skip this section..

Here's a minimal version of the standard unix util 'cat', which concatenates
files together and sends them to the screen or whatever the stdout file is, as
it can be redirected in UNIX.

#include <stdio.h>

int main(int argc, char *argv[]) {
	FILE *fp;
	int ch=0;
   	int upto=1;
	
	if (argc<2) {
		fprintf(stderr,"Usage: cat FILE ...\n");
		exit(1);
	}
	argc--;
   	while(argc--) {
		fp = fopen(argv[upto++],"r");
		if (!fp) {
			perror("cat");
			exit(1);
		}
		while((ch = fgetc(fp)) != EOF)
			if (putchar(ch) == EOF) {
				perror("cat");
				exit(1);
			}
	   	fclose(fp);
	}
}

and here's the (unoptomized) compiled version:


#define _AS sep #$20:.as
#define _AL rep #$20:.al
#define _XS sep #$10:.xs
#define _XL rep #$10:.xl
#define _AXL rep #$30:.al:.xl
#define _AXS sep #$30:.as:.xs

	.xl		; make sure it's 16 bit code
	.al

	.(

mreg 	= 1
mreg2 	= 5

	.text

+_main
-_main:

	.(

RZ 	= 8		; RZ = register size: Two psuedo 32 bit registers
LZ 	= 26		; LZ = Local size: size of the local variables for this
			; function

	phd
	tsc		/* make space for local variables */
	sec
	sbc #LZ
	tcs
	tcd		/* set up the DP register as the frame pointer */
	
	stz RZ+1	/* ch = 0; */
	
	lda #1		/* upto = 1; */
	sta RZ+7
	
	lda LZ+6	/* if (argc < 2)  NOTE: could be just      */
	.(		/* cmp #2 : bpl L2                         */
	cmp #2		/* but the compiler doesn't know how far   */
	bmi skip	/* away L2 is.				   */
	brl L2
skip 	.)
	
	pea ^L4		/* fprintf(stderr,"Usage: cat FILE ...\n"); */
	pea !L4
	pea ^___stderr
	pea !___stderr
	jsr @_fprintf
	tsc
	clc
	adc #8
	tcs
	
	pea 1		/* exit(1) */
	jsr @_exit
	pla
L2:
	lda LZ+6	/* argc-- NOTE: dec LZ+6 would be better! */
	dec
	sta LZ+6
	brl L6
L5:
	pea ^L8		/* This rather large bit of code is all for */
	pea !L8		/* fopen(argv[upto++],"r");		    */
	
	lda RZ+7	/* arrays don't translate so well! */
	sta RZ+9
	lda RZ+9
	inc
	sta RZ+7
	ldx RZ+9
	lda #0
	.(
	stx mreg2
	ldy #2
	beq skip
blah 	asl mreg2
	rol
	dey
	bne blah
skip 	ldx mreg2
	.)
	clc
	tay
	txa
	adc LZ+8
	tax
	tya
	adc LZ+8+2
	sta mreg2+2
	stx mreg2
	lda [mreg2]
	tax
	ldy #2
	lda [mreg2],y
	pha
	phx
	jsr @_fopen
	tsc
	clc
	adc #8
	tcs
	stx RZ+11
	sty RZ+11+2

	ldx RZ+11	/* assign it to fp */
	lda RZ+11+2
	sta RZ+3+2
	stx RZ+3

	.(		/* if (!fp)
	lda RZ+3
	cmp #!0
	bne made
	lda RZ+3+2
	cmp #^0
	beq skip
made 	brl L13
skip 	.)

	pea ^L11	/* perror("cat"); */
	pea !L11
	jsr @_perror
	pla
	pla

	pea 1		/* exit(1) */
	jsr @_exit
	pla

	brl L13
L12:
	pei (RZ+1)	/* putchar(ch); */
	jsr @_putchar
	pla
	stx RZ+15
	
	lda RZ+15	/* if (putchar(ch) == EOF) 
	.(
	cmp #-1
	beq skip
	brl L15
skip 	.)
	
	pea ^L11	/* perror("cat"); */
	pea !L11
	jsr @_perror
	pla
	pla
	
	pea 1		/* exit(1)
	jsr @_exit
	pla
L15:
L13:
	pei (RZ+3+2)	/* fgetc(fp); */
	pei (RZ+3)
	jsr @_fgetc
	pla
	pla
	stx RZ+17	/* ch = fgetc(fp); */
	lda RZ+17
	sta RZ+1

	lda RZ+17	/* while ((ch = fgetc(fp)) != EOF) */
	.(
	cmp #-1
	beq skip
	brl L12
skip 	.)

	pei (RZ+3+2)	/* fclose(fp); */
	pei (RZ+3)
	jsr @_fclose
	pla
	pla
L6:
	lda LZ+6	/* while(argc--) */
	sta RZ+9
	lda RZ+9
	dec
	sta LZ+6
	lda RZ+9
	.(
	cmp #0
	beq skip
	brl L5
skip 	.)

	ldx #0		/* return from main() */
L1:
	tsc
	clc
	adc #LZ
	tcs
	pld
	rtl
	.)
	
	.text
	
-L11 	.asc "cat",0
-L8 	.asc "r",0
-L4 	.asc "Usage: cat FILE ...",10,0
	.)

As you can see, there's still quite a bit to be optomized as far as the compiler
is concerned, but the code is still quite good.

Having a C compiler and a standard C library that contains the most used
standard functions, is going a long way towards being able to port UNIX's and
other similar environments' applications. So what i've done is create a 65816
backend for a free ANSI C compiler called LCC.

I'm no longer talking theory here either, since a little while ago I decided to
give my standard C library and the compiler a test on portability, with some
great results. I've managed to do extremely simple porting jobs on: Pasi's C
versions of his gunzip and puzip, Andre Fachat's XA 6502/65816 cross compiler,
Marco Baye's ACME cross assembler. All of which, besides ACME, so far seem to be
working exactly how they should. There'd be thousands of open source programs
that could easily be ported to JOS, many of which wouldn't be of much use to
anyone, but still!

	         ---------------------------------------------
	        | Multitasking - Seeming to do it all at once.|
	         ---------------------------------------------
	     
We've all had experience with multitasking so I won't bore you too much.
For our purposes, it means being able to do several things at once.

But what actually is a "thing"? They're usually called "processes" or "tasks". I
usually call them processes, so that's what I'll refer to them as.

There are two main types of multitasking, pre-emptive and co-operative. The
latter is as you would expect, processes need to co-operate together in order to
work, processes can't "do their own thing". Pre-emptive multitasking is the more
flexible approach, because processes don't need to explicitly hand over the
processor to another process, they just have it taken away from them if they use
it for too long. So it was a pretty easy choice for which kind of multitasking
JOS would have, pre-emptive of course!

You might think that the C64 already does multitasking because programs normally
set up interrupt routines to go off during the processing of the program, so it
can do more than one thing, but that's a very special case of what I'm
referring to here. I'm referring to the ability to run seperate unrelated
programs at the same time, like reading your email, and typing in a text
editor. We'd all like to be able to do that wouldn't we? Particularly if we've
got the processing power and RAM to do it, and the SuperCPU certainly does.

Each process "owns" resources. The resources I'm talking about are simply parts
of the computer and OS like RAM, interrupts, kernel IPC objects, and some other
things.

Along with the resources it owns, each process has a number of attributes. First
of all it needs a unique identifier, so anything that wants to talk to it knows
how to address it. In Unix-like systems, this is called a Process IDentification
(PID). In JOS a PID is just a positive integer, simple.

Along with other processes being able to address it, the PID is used so that the
OS can keep track of which resources the process actually owns, and when it
exits (or is explicity killed) the OS can free up those things and let other
processes use them.

Processes can start other processes, so everything except the first process
keeps track of who its parent was in its Parent PID (PPID). You may wonder
what use it is to keep track of the parent? It's always been used in UNIX to set
up IPC, but it really isn't needed in JOS, apart from cosmetic purposes, since
JOS has better IPC mechanisms. That's the first example of "Just because it's in
UNIX doesn't mean it's needed", and there are plenty of others.

In JOS, a process can own multiple "threads" of execution. Threads are what most
people's idea of what a process is: some code running. 

Consider starting a C64 game, which has several different interrupt routines
running concurrently. We certainly wouldn't consider each interrupt routine to
be a seperate program, and that's generally the idea behind threads, except
threads are at the mercy of the pre-emptive scheduler. Almost the same result
can be achieved by creating multiple processes, but why go to the hassle of
loading and executing two tightly related processes with 1 thread each, when
you can do the same thing with 1 process that has 2 threads? A good example
of this is JOS's very own web server, which creates new threads whenever a
new connection has been established by a client.

Some new technologies are particularly keen on the use of threads, namely JAVA
and the BeOS. A good example of using multiple threads is given by BeOS, which
starts a seperate thread for every window displayed on the screen, so it can
update its on-screen appearance and remain responsive to the user, while also
doing other processing.

Unix programs have generally just started other processes if they wanted to do
two of their own things at once. Threads are much cleaner and nicer. Threads
themselves have their own attributes, such as priority (the higher the priority
the more processor time it's likely to get), state (whether they are running or
waiting for something), stack and zero page space, and some other things.

I know i've mentioned that JOS uses pre-emptive multitasking, but that doesn't
mean that doing:
		jmp *
is a good idea! Programs should still try and co-operate.

A typical menu program on C64 using the kernel has a structure something like
this:

1. Setup variables and interrupts
2. Set up menu
3. Check for input
4. If no input go back to 3
5. Process input

If you were to run this program on a multitasking system, it would chew up a lot
of processing time and slow everything else down. Polling for input on a
multitasking system is generally a bad thing, but blocking and waiting for input
is a good thing. So instead it would be best to do:

1. Setup variables and interrupts
2. Set up menu
3. Wait for input
4. Process input

Now this is the correct way to do it, as it only uses up cpu time when it's
actually received some input. But what happens if every process is waiting? What
runs then? Well there is a special process that runs when no other processes
are, it's called the Idle process, and does what it's name suggests, just sits
there and idles. Here is the thread code that runs in my idle process:

nully		jmp nully

For some reason I started calling it the Null process, and it's called that all
throughout JOS...

I have introduced you to a couple of the main ideas behind multitasking, but
wouldn't you like to know how it's done? Well here's how JOS does it..

For starters, since it's pre-emptive multitasking, JOS needs some way of
interrupting the currently running process after it's consumed its alloted
time. The C64 has 4 CIA timers capable of producing IRQ's and NMI's, and in
JOS's case i've decided to let it use CIA 1 Timer A, which produces an IRQ. This
of course means that a process could stop itself from being interrupted by doing
an SEI, but if they behave well that won't happen!

Rather than set this timer to the amount of time before a process should be
pre-empted (called a "timeslice"), I double up the use of TIMER A as the system
counter, which is used for timing another kind of process resource: timers.
Timers can either count upwards, or downwards and give off an alarm. They really
need a higher precision than a timeslice, so they set the timer to 20
milliseconds (about 1 PAL screen). The timeslice is then calcualted as 3 counts
of this timer i.e. 60 milliseconds. Why don't I use TIMER B for the system
timer? Well, because I want to leave as many resources open for application and
device driver processes.

I mentioned that processes and threads each have their own attributes, these
attributes are stored in Process Control Blocks (PCB's) and Thread Control
Blocks (TCB's).

Every process has a PCB, and every process has at least one thread, which has
it's own TCB. There is one process which is always loaded, and that's the Null
process. Each process's PCB and TCB's are contained in everyone's favourite data
structure, the circular (or double) linked lists. The Null PCB is always at the
head of the PCB list, and PCB's will only ever be on this one list, since they
are either alive (in the list), or dead (no PCB exists!). 

Threads on the other hand can be in various states, but in particular they can
be ready for the CPU, or waiting for something (blocked). When a thread is
ready, it's just waiting for its turn at the CPU, and it goes on the Ready
list, which is a queue. The Null thread is ALWAYS at the back of this queue, so
it only gets to run if nothing else can. The ordering of this queue is up to a
part of the kernel called the Scheduler. 


	    Front				 	Back
	 ------------	       ------------	    --------------
    --	|  Thread A  |  ----- |  Thread B  | ----- |  Null Thread |  --
   |	 ------------	       ------------	    --------------     |
    -------------------------------------------------------------------
	
Some OSes have complex schedulers which take in many parameters, like priority
and various CPU time measurements. On multi-user OSes like UNIX, this is
important because it wants to be "fair" to all processes. But for our purposes
and many other OSes, it's usually a whole lot simpler than that, it's just a
simple matter of which ever process/thread has the highest priority can run. If
two threads have the same priority, it normally comes down to "round robin"
scheduling, where they just take it turns. JOS doesn't even implement priorities
properly yet, because they actually don't make much difference to the normal
processing, at the moment it's just a simple round robin scheduler that doesn't
care about priorities.

What if a thread is blocked? It'll go onto a wait queue, and will return to the
ready queue only when it's ready to run. At this stage of JOS, the only thing a
thread will need to block for is IPC.

You may be wondering about the issue of relocatable code, as we all know the
6502 nor the 65816 is designed for running relocatable code. Sure, branches are
relative to the PC, but nothing else is. So everything needs to be physically
relocated before executing, and to do this properly without needing to code in a
specific way, a relocatable binary format is needed. Fortunately for me, Andre
Fachat had already designed such a format for OS/A65, and it fits JOS nicely
because it includes 65816 extensions. Of course you need a special assembler to
output this file format, which is where XA comes in. XA now even compiles for
JOS, so self hosted development is now possible. The binary format will be
talked about in greater detail in a future article.

Well, it's all very fine having a bunch of processes running, but that's no
operating system.. Who's looking after the devices? Who's managing the memory?
And how do we ask the drivers to do something for us? It's all IPC...

	    --------------------------------------------------
	   | Inter Process Communication - Let's get talking! |
	    --------------------------------------------------

Before I get into the specifics of IPC, I should give an idea of what typically
happens when JOS boots. Because JOS has a very scalable microkernel design, it
can load as many different device drivers and applications at boot time as it
wants and infact they can loaded and removed anytime at all. So there is no one
bootup procedure in JOS. There are certain things that happen every time,
however.

For starters, JOS has 2 system processes, which are always started at bootup.
They aren't actually loaded off disk because they are part of the microkernel
code. One is the memory manager and the other is the process manager. 

The memory manager as you would expect manages all the memory, but it doesn't
manage the Process space memory (Bank 0), that's the job of the Microkernel.
Process space memory (or kernel memory) is where all the PCB's, TCB's, Stack
space and Direct Page space is located. The Memory manager, manages all the
other RAM, e.g. Ram in Bank 1 and above, although, if there is no SuperRAM, it
allocates 00e000-010000 as system Ram instead of using it as kernel space RAM,
since it's more likely that you will run out of System Ram rather than kernel
RAM.

I won't go into the specifics of the Memory Manager just yet, I'll just tell you
that it performs the following requests:

Allocate any size block of RAM.
Free RAM.
Allocate any size block of Bank Aligned RAM. (Needed in some cases).
Reallocate RAM.
See how much RAM is left.
See what the largest block is.

All these things are requested via IPC, but there are Shared library routines
(such as malloc, free, realloc etc) for preparing the right IPC messages to
send.

The process manager's main functions are loading new processes + shared
libraries, and looking up device drivers & file-systems. Whenever you open a
file, you first must send a message to the process manager asking it where to
send the open message.

The very first process to start however is called the "init" process (it's
actually built into the microkernel, "init" isn't a filename), which starts the
2 system processes, then it starts a simple Ramdisk process and loads another
process from the ramdisk called "initp".

The "initp" process should then load a proper filesystem and disk device driver
also from the ramdisk, and "mount" this filesystem and executes another file
this time called "init". 

Note that "mounting" is preparing a filesystem for use, and all filesystems
should actually be "unmounted" before switching off, because all changes may not
be actually written to disk yet, even though the applications think they are.
I'm guessing this is why Macintoshes refuse to let you take a disk out without
the OSes permission!

The "init" file will usually be a shell script, and is responsible for starting
up most of the drivers. A shell script, if you've never heard of it, is a file
that has lists of commands to be run by the system, or more specifically the
shell program. If you've ever seen MS-DOS .bat files, you'll know what I mean.

A typical init script has to load a user interface, unless of course you're
using your machine as some kind of server, in which case you wouldn't need
one and could save yourself a bit of memory!

The text based interface would require the console driver (con.drv), and the
shell (sh). The console driver is capable of 4 virtual consoles, which you can
switch between by pressing CBM and 1-4. This lets you exploit multitasking, as
you could be running 4 different text apps on each of the screens. The shell is
a pretty basic shell at the moment (like DOS's command.com), but it's enough to
let you load and run any program. It also has support for pipes, but now I'm off
topic..

The init script could instead load the GUI, which I'm sure most people would
prefer to a text based interface!

The script also should load other drivers like: tcp/ip, ppp, digi sound driver,
other filesystems, modem drivers etc... Everything is of course optional, which
is where Microkernels really excell over their monolithic counterparts.

Well that's what happens at boot time, but how do the drivers and the
applications communicate? I've been mentioning "messages", and that's all that
JOS's IPC is: message passing. Message passing is a fast and effective way to do
IPC, and for a microkernel this is essential. I chose message passing because
it's the most flexible method, and you can actually implement other types of IPC
by using message passing.

You can think of message passing as an extended subroutine call, but rather than
being a call to a subroutine, it's a call to another process. A process, or in
particular a thread, can "send" a message to another thread, the other thread
"receieves" it, and then after it has processed it, "replies" to it. 

You can't just send a message and expect it to be receieved straight away, the
receiver has to be ready to receive it, which may not be straight away. If the
receiver isn't ready, the thread that sent the message will block and wait until
it's ready. Once the receiver has received it, it processes the message, and
will issue a reply, which then unblocks the sender, which can then continue
processing. This type of message passing is called "synchronous" message
passing, as it requires synchronization between the two threads. It may help to
think of "sending" as doing a JSR, "receiving" as the Program Counter being
transferred to the routine, and "replying" as executing an RTS. It's a little
more complicated than that, but essentially that's what it's like.

There is a great description of this kind of IPC at http://www.qnx.com/ in their
technical section, with diagrams and all -- highly recommended!

Normally, OSes have to copy messages between processes, because each process
gets its own address space, and can't view the memory of other processes, but
as we know, the 65816 doesn't have an MMU so all memory is shared, which means
that messages don't need to be copied, which gives it a significant speed
increase over message passing in OSes with MMU's. Of course it does mean that
processes can accidently screw up another process's memory, but who cares! :)

All messages in JOS is directed at Channels. Channels are a resource that allow
threads to receive message from other threads. Generally device drivers register
a channel and use it to receive requests from applications. Channels are
referred to by number, the only channels that have fixed numbers are the memory
manager (0) and the process manager (1). All other channels are looked up by
sending a message to the process manager's channel, e.g. Channel 1.

What exactly is a message? All the JOS system calls for IPC just deal with 24
bit pointers to messages, and the actual message data itself can be anything!
However the first byte of the message should be the message code, and always is
in JOS system messages. You could of course make your own protocol for your own
IPC, but it's probably not a good idea.

Each different kind of driver has its own set of message codes..

#define PROCMSG	$80
#define MEMMSG	$40

#define MMSG_Alloc	0+MEMMSG
#define MMSG_AllocBA	1+MEMMSG
#define MMSG_Free	2+MEMMSG
#define MMSG_Left	3+MEMMSG
#define MMSG_Large	4+MEMMSG
#define MMSG_LeftK	5+MEMMSG
#define MMSG_LargeK	6+MEMMSG
#define MMSG_KillMem	7+MEMMSG
#define MMSG_Realloc	8+MEMMSG

#define PMSG_Spawn	PROCMSG+0
#define PMSG_AddName	PROCMSG+1
#define PMSG_ParseFind	PROCMSG+2
#define PMSG_FindName	PROCMSG+3
#define PMSG_QueryName	PROCMSG+4
#define PMSG_Alarm	PROCMSG+5
#define PMSG_KillChan	PROCMSG+6
#define PMSG_WaitPID	PROCMSG+7

Those are the messages defined for the Process manager and Memory manager. Each
message code defines its own structure, for example the MMSG_Alloc message has
the structure:

	.word MMSG_Alloc
	.word !Size
	.byte ^Size,0

The message codes $e0-$ff are left for processes that want their threads to
communicate with each other.

Anything that wants to receive messages needs to have some code like this:

		
		jsr @S_makeChan		; make a channel System call
		sta Chan		; save it
		
loop		lda Chan		; 
		jsr @S_recv		; receieve a message from channel
		stx MsgP		; Save X/Y in MsgP
		sty MsgP+2		; MsgP is a zero page variable
		sta RcvID		; Save RcvID - for replying
		lda [MsgP]
		and #$ff		; 8 bit message code
		cmp #MSGCODE		; check which type
		beq processMes		; and process it
		cmp #MSGCODE2
		beq processMes2
		...
		ldx #-1			; replying with $ffff in X and Y
		txy			; means "message not understood"
		lda RcvID
		jsr @S_reply		; reply and loop back for more messages
		bra loop
		
All device drivers have a message loop like that. Which forces them to be
modular, and thus easier to code.
		
Ok now let's see what sending a message would look like:

		lda #PROC_CHAN
		ldx #!Message
		ldy #^Message
		jsr @S_sendChan		; Send the message
		...
		
Message		.word PMSG_WaitPID,2	; Wait for PID 2 to finish.

*note: it's generally a good idea to put messages on the stack, rather than use
global variables, since using the stack is thread safe. No other thread will
accidentally wipe over the message because they each have their own stack.

Just about everything that you consider an OS to be is done in JOS via IPC.
This includes file operations, such as opening and closing, reading and writing
files. How does the filesystem driver know which file you want to access after
you've opened it? It could include a connection number in the IO_READ and
IO_WRITE messages (you guessed it, the message codes for reading and writing!).
That's a little cumbersome, though. There is a better solution: connections.

What is a connection? It's a kernel object which keeps a track of the
destination channel of the messages directed at it. It also has an ID associated
with it, so server processes can tell which file, for example, it refers to.
Each process has a so called "file descriptor list" associated with it. People
who know much about UNIX programming will know about this. In JOS, this table is
really just a connection table. This table is just an array of connection
numbers, which the process can access. Each element in the array can point to
any connection number, which means that two file descriptors can actually point
to the same file, and in the case of the first three it usually does. The first
three are STDIN, STDOUT & STDERR, and they usually point to the screen, but not
always!

An example File Descriptor list: (0 = no connection)

   0      1      2     3     4     5     6     7     8     9 .... 32
 ------------------------------------------------------------------
|  1  |   1   |  1  |  2  |  3  |  0  |  0  |  0  |  0  |  0  .... |
 ------------------------------------------------------------------

E.G.
Connection 1 is connected to the /dev/con/1 device (the screen). Thus STDIN,
STDOUT and STDERR all point to this.
Connection 2 is connected to a file "/blah.txt" which is on the 1541 filesystem.
Connection 3 is a tcpip connection to altavista.com.

Connections are global objects, and whenever a process is loaded, it inheritis
its file descriptor table from the parent, which is how it receives its STDIN,
STDOUT and STDERR. File descriptors can also be explicitly redirected to other
connections, or just not inherited at all. This is how JOS performs shell
redirection.

I've discussed JOS's synchronous message passing, but what happens if you don't
want to block and wait for a reply? You might just want to notify a server that
an event has occurred, and don't need to know if it received it, nor what it
thinks about it. 

In this case you can send a pulse. A pulse is a tiny message (just 4 bytes),
which doesn't require a reply. Probably the best property of pulses are that
they can be sent during an interrupt. A good example of doing this is the
console driver, which implements virtual consoles. The console driver starts an
interrupt routine which scans the keyboard and checks for CBM key plus 1-4 and
then sends a pulse message to its channel telling it to switch consoles.

By now you might be thinking "Microkernels must be real slow with all that
process switching", but the switching code is pretty fast, particularly at
20mhz. There isn't as much switching as you would expect either, considering
that IO_READ and IO_WRITE messages deal with buffers as large as 64k, so it's
not as if ever single character requires a switch. 

	    --------------------------------------------------
	   |     Device Independence - Everything's a file!   |
	    --------------------------------------------------

One of the major things that people who are learning UNIX have to learn, is that
practically everything is a file. Devices such as the keyboard and screen (the
console) are accessed using a file. Why you may ask! Well there isn't one
compelling reason, but it just makes it handy if you can access the console as a
file, especially for debugging. Take for example, the ability to redirect screen
output to files, a program doesn't have to be explicity designed for doing that
if everything is a file, including the console, it's just a simple matter of
changing the output file.

Not only are devices files, but filesystems can be "mounted" on any directory,
which gets rid of the need for devices numbers. Navigating through different
filesystems is just a simple matter of changing directories. It also means that
applications don't concern themselves with what the actual filesystem and device
is, just that it's there. So applications will work with any devices that have
drivers.

Ok so now you know some of the reasons behind the "everythings a file", so how
is it done in JOS? I mentioned that the process manager is in charge of "looking
up" channels, but how does it perform this lookup?

The process manager contains a table with entries for file-systems, devices and
special processes. File-systems are names that end in a '/', device files
usually start with '/dev/' and special processes start with '*'. So the table
may look something like this:

Name		Channel		Unit
/		2		1		; file-system mounted at /
/usr/		2		2		; file-system mounted at /usr/
*digi		3		0		; digi driver
*tcpip		4		0		; tcp/ip
/net/		4		1		; tcp connections
*cbmfsys	5		0		; the cbm file-system
*packet		6		0		; the packet driver (ppp/slip)
/dev/null	1		0		; the process manager handles 
						; this
						
The name and channel fields are self explainitory but the Unit field allows a
channel to determine which of its names was used.

Whenever the process manager receives a request to look something up, depending
on what type of request it is (special process requests don't), it will prepend
the processes Current Working Directory to the filename (unless the name starts
with a '/'), and then parse the name for '.' and '..' directories, which alter
the string.

So for example you ask for the file "./hello/./../afile.txt" and your CWD was
"/usr/files/" it would be parsed as:

"/usr/files/afile.txt"

This string is then compared to the table, and finds the longest full match, in
this case it would find "/usr/" and return channel 2, unit 2, plus the string
"files/afile.txt", which is what is left over after subtracting.

The great thing about this whole "pathname space" approach is that processes
don't necessarily need to know what they're dealing with, and pieces of the OS
can be loaded and unloaded at will for the ultimate in scalability and
modularity.

You might think that setting up the request and dealing with the responses,
every time you want to open a file is a bit tiresome, but it's all handled for
you with the "open" library call.

		pea O_READ
		pea ^devcon1
		pea !devcon1
		jsr @_open	; returns file number in x or -1 on failure
		pla
		pla
		pla

		...
		
devcon1		.asc "/dev/con/1",0

That's all for now. In the next article, i'll be writing about process 
loading + shared libraries, networking, terminal IO (console + modems) + some
other things...

Hopefully you will have learned something from this article, and can see the
power that a real multitasking OS, such as JOS, can bring to the SuperCPU. 

Any feedback goes to jmaginni@postoffice.utas.edu.au , i'm particularly on the
lookout for people who can help with hardware; docs, code etc...
Also, check the JOS homepage at http://www.jolz64.cjb.net/ and join the JOS
mailing list if you're interested in updates.
.......
....
..
.                                    C=H 19

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

VIC KERNAL Disassembly Project - Part III
Richard Cini
September 1, 1999

Introduction
============

	In the last installment of this series, we examined the two remaining
hard-coded processor interrupt vectors, the IRQ and NMI vectors. Although we
took a complete look at the routines, we did not examine some of the
subroutines that IRQ and NMI call. We'll examine these routines first.
	Having completed the main processor vectors, we'll continue this
series by examining other Kernal routines.

Remaining Subroutines
=====================

	The NMI and IRQ routines together call 11 subroutines, five of which
we previously examined in Part I of this series, and two call the NMI vectors
in the BASIC ROM and A0 Option ROM. So, let's examine the four remaining
subroutines.

UDTIM/IUDTIM
------------

	The IRQ vector calls the update time function UDTIM through the
jump table at the end of the Kernal ROM, while the NMI function skips the
intermediate call through the jump table and directly calls the time function.

UDTIM:
FFEA 4C 34 F7    JMP IUDTIM			;$F734

F734   ;==========================================================
F734   ; IUDTIM - Update Jiffy Clock (internal)	
F734   ;	Called by IRQ; no params; no return
F734   ;				
F734          IUDTIM				
F734 A2 00    		LDX #$00			
F736 E6 A2     		INC CTIMR2	;bump timer tick
F738 D0 06     		BNE UDTIM1	;not 0, move on (no roll)         
F73A E6 A1     		INC CTIMR1	;rolled-over, INC next reg	
F73C D0 02     		BNE UDTIM1	;not 0, move on (no roll)          
F73E E6 A0     		INC CTIMR0	;rolled-over, INC next reg	
F740             
F740          UDTIM1			;done updating registers, 
F740					; check for 24hr roll
F740					; A0-A2 hold max of 4F1A00
F740 38          	SEC		;set carry
F741 A5 A2       	LDA CTIMR2	; get LSB
F743 E9 01       	SBC #$01	; minus 1
F745 A5 A1       	LDA CTIMR1	;
F747 E9 1A       	SBC #$1A	; minus 1Ah
F749 A5 A0       	LDA CTIMR0	;	
F74B E9 4F       	SBC #$4F	; minus 4Fh
F74D 90 06       	BCC UDTIM2	; ok
F74F             
F74F 86 A0       	STX CTIMR0	;24-hr roll-over, so reset
F751 86 A1       	STX CTIMR1	; registers to zero
F753 86 A2       	STX CTIMR2			
F755             
F755             UDTIM2			;no 24-hr rollover-continue
F755 AD 2F 91    	LDA D2ORAH	;check for STOP key
F758 CD 2F 91    	CMP D2ORAH			
F75B D0 F8       	BNE UDTIM2	;not same, check again
F75D             
F75D 85 91       	STA STKEY	;same, save status and exit
F75F 60          	RTS

	UDTIM is called every 1/60th of a second by the IRQ routine, and
begins execution by incrementing each of the time-keeping registers in
the Zero Page locations $A0 to $A2. As each is incremented, it is checked
for roll-over (i.e., for the count exceeding the maximum allowed for the
register). Taken together, the three consecutive memory locations make-up
the "jiffy clock" (as the VIC's RTC is sometimes referred; a "jiffy" being
1/60 of one second). 

	At the label UDTIM1, the code checks for a 24hr roll-over. The three
byte-sized registers (no pun intended) can store the 24-hour jiffy count
of 5,184,000 decimal, or 4F1A00 hex. If the count exceeds this value, the
registers are reset to zero.

	The BASIC TI function accesses the jiffy clock, representing the
count as a decimal number. Similarly, the TI$ function represents the jiffy
clock as a 24-hour HH:MM:SS clock instead of a jiffy count.

	UDTIM is also responsible for processing the STOP key on behalf of
the IRQ and NMI routines, so if a user program handles either of these
interrupts, the programmer must remember to call UDTIM in order to maintain
the time clock and STOP key functionality.

CCOLRAM
-------

	This short routine is responsible for determining the location of
the color ram. In the VIC, the screen and color memory locations change based
on the amount of RAM installed, as follows:

	Function	Unexpanded		Expanded
	--------	----------		--------
	User BASIC	$1000 00010000		$1200 00010010
	Screen Memory	$1E00 00011110		$1000 00010000
	Color RAM	$9600 10010110		$9400 10010100

	The two least significant bits of the most-significant byte of each
of the screen memory and color RAM pointer registers defines the resulting
location. If the bit pattern of the screen memory is "10", the code sets
the color RAM base to page $96. If the bit pattern is "00", the code sets
the color RAM base to page $94.

	The two other possible bit patterns result from screen memory
beginning at $1100 or $1F00, and produce color RAM locations of $9500
and $9700, respectively. The $1100 starting location will actually work,
but result in 256 bytes of wasted user RAM. The $1F00 starting location
will not work since the color RAM locations overlap the I/O Block 2
addresses, which have no RAM associated with them.

EAB2   ;==========================================================
EAB2   ; CCOLRAM - Calculate pointer to color RAM		
EAB2   ;				
EAB2        CCOLRAM				
EAB2 A5 D1       	LDA LINPTR	;get ptr to screen RAM LSB	
EAB4 85 F3       	STA COLRPT	;save it as color LSB	
EAB6 A5 D2       	LDA LINPTR+1	;get screen RAM MSB	
EAB8 29 03       	AND #%00000011	;mask bits 0-1
EABA 09 94       	ORA #%10010100	;OR with $94 to get color
EABA					; RAM pointer
EABC 85 F4       	STA COLRPT+1	;save as color ptr MSB	
EABE 60          	RTS		;exit	


ISCNKY
======

	This is the low-level keyboard scan function which is called 
60 times per second by the IRQ routine. ISCNKY scans the keyboard matrix
to retrieve a keypress, maps the key number to its ASCII equivalent, and
places the ASCII value at the end of the keyboard buffer. If IRQs are
disabled, the keyboard scanning is suspended. ISCNKY is accessible to user
programs through the Kernal jump table, although calling it with interrupts
enabled is not recommended.

To retrieve a character from the keyboard, a user program would typically
call GETIN ($FFE4), the buffered keyboard input routine. GETIN returns
the ASCII value of the character at the head of the keyboard buffer, or
zero if no character is available. 

VIA2 is directly connected to the keyboard. Port B is used as the column
strobe and Port A is used as the row input. To read the keyboard matrix,
the code brings all column strobe lines to 0 and reads the row inputs, in
order, until a key is found (or not found). The code also begins decoding
the ASCII using the "unshifted" decoding table. Three other decoding tables
are for shifted, C= (Commodore) keys, and shift+C= keys.

EB1E   ;===========================================================
EB1E   ; ISCNKY - Scan keyboard				
EB1E   ;	Scans keyboard for character. Called by IRQ routine. 
EB1E   ;  ASCII value placed in keyboard buffer.			
EB1E             ISCNKY				
EB1E A9 00       	LDA #$00	; set shft/ctrl flag to 0
EB20 8D 8D 02    	STA SHFTFL			

EB23 A0 40       	LDY #$40	; assume no keys pressed
EB25 84 CB       	STY KEYDN	;  ($40=no keys)

EB27 8D 20 91    	STA D2ORB	; bring all column bits low
EB2A AE 21 91    	LDX D2ORA	; read row inputs
EB2D E0 FF       	CPX #$FF	; any character keys pressed?
EB2F F0 5E       	BEQ PROCK1A	; no, exit
             
EB31 A9 FE       	LDA #%11111110	; begin testing at COL 0
EB33 8D 20 91    	STA D2ORB	; output bit pattern
	
EB36 A0 00       	LDY #$00	; zero character count reg

					; set default translation 
					; table to Table 1
EB38 A9 EA       	LDA #$EA 	;FIXUP2+2;#$5E
EB3A 85 F5       	STA KEYTAB			
EB3C A9 EA       	LDA #$EA 	;FIXUP2+3;#$EC
EB3E 85 F6       	STA KEYTAB+1			
EB40             
EB40             ISCKLP1		; begin testing loop
EB40 A2 08       	LDX #$08	; 8 rows to test in column
EB42 AD 21 91    	LDA D2ORA	; get column
EB45 CD 21 91    	CMP D2ORA	; test again - debounce
EB48 D0 F6       	BNE ISCKLP1	; not equal, retry
EB4A             
EB4A             ISCKLP2		; got bit pattern
EB4A 4A          	LSR A		; shift through carry flag
EB4B B0 16       	BCS ISCNK1+3	; CY=1 for key not pressed
EB4D             
EB4D 48          	PHA		; save column bit pattern
EB4E B1 F5       	LDA (KEYTAB),Y	; .Y is index into ASCII
EB4E					;  translation table
EB50 C9 05       	CMP #$05	; ASCII > 5, move on
EB52 B0 0C       	BCS ISCNK1	;  (<5=shft, C=, STOP, CTRL)
EB54             
EB54 C9 03       	CMP #$03	; ASCII=3 STOP key
EB56 F0 08       	BEQ ISCNK1	; got STOP so skip flag updt 
EB58             
EB58 0D 8D 02    	ORA SHFTFL	; save SHFT, CTRL, C= flag 
EB5B 8D 8D 02    	STA SHFTFL			
EB5E 10 02       	BPL ISCNK1+2	; move on to next row in col
EB60             
EB60             ISCNK1				
EB60 84 CB       	STY KEYDN	; save key#
EB62 68          	PLA		; restore col bit pattern
EB63 C8          	INY		; increment key count
EB64 C0 41       	CPY #$41	; 64 keys scanned?
EB66 B0 09       	BCS ISCNEXIT	; yes, return ASCII value
EB68             
EB68 CA          	DEX		; go on to next row in col
EB69 D0 DF       	BNE ISCKLP2	;  {loop}
EB6B             
EB6B 38          	SEC		; done with first column, so
EB6C 2E 20 91    	ROL D2ORB	;   move on to next column
EB6F D0 CF       	BNE ISCKLP1	;  {loop}
EB71             
EB71             ISCNEXIT		; function evaluation vector
EB71 6C 8F 02    	JMP (FCEVAL)	; CINT1A points this to SHEVAL
EB71					; the shift evaluation code
EB74             ;				
EB74             ; Process key image 				
EB74             ;				
EB74             PROCKY				
EB74 A4 CB       	LDY KEYDN	; get key number (as index)
EB76 B1 F5       	LDA (KEYTAB),Y	; covert key# to ASCII code
EB78 AA          	TAX		; copy ASCII code to .X
EB79 C4 C5       	CPY CURKEY	; is it the same as the
					;  current character?
EB7B F0 07       	BEQ PROCK1	; yes, do repeat eval
EB7D             
EB7D A0 10       	LDY #$10	; set repeat delay 
EB7F 8C 8C 02    	STY KRPTDL		
EB82 D0 36       	BNE PROCK4	; not same key, so exit
EB84             
EB84             PROCK1				
EB84 29 7F       	AND #%01111111	; test for {REVERSE}
EB86 2C 8A 02    	BIT KEYRPT	; do test
EB89 30 16       	BMI PROCK2	;  BIT7 set? reverse only
EB8B 70 49       	BVS PROCK5	;  BIT6 set? alpha or reverse
EB8D             
EB8D C9 7F       	CMP #$7F	; last non-revs'd character
EB8F             
EB8F             PROCK1A				
EB8F F0 29       	BEQ PROCK4
EB91             
EB91 C9 14       	CMP #$14	; {DEL}?		
EB93 F0 0C       	BEQ PROCK2	;  process {DELETE}/INS
EB95             
EB95 C9 20       	CMP #$20	; {SPACE}?
EB97 F0 08       	BEQ PROCK2	;  process {SPACE}
EB99             
EB99 C9 1D       	CMP #$1D	; {<-}?
EB9B F0 04       	BEQ PROCK2	;  process cursor right/L 
EB9D             
EB9D C9 11       	CMP #$11	; {CRS DN}?
EB9F D0 35       	BNE PROCK5	;  process cursor down/U
EBA1             
EBA1             PROCK2				
EBA1 AC 8C 02    	LDY KRPTDL	; get repeat delay
EBA4 F0 05       	BEQ PROCK3	;  if 0, check repeat speed
EBA6             
EBA6 CE 8C 02    	DEC KRPTDL	; not done delaying, so exit
EBA9 D0 2B       	BNE PROCK5	;  {exit}
EBAB             
EBAB             PROCK3				
EBAB CE 8B 02    	DEC KRPTSP	; decrement repeat speed cnt
EBAE D0 26       	BNE PROCK5	; not done delaying, so exit
EBB0             
EBB0 A0 04       	LDY #$04	; delay speed cnt reached 0,
					;  so reset speed count
EBB2 8C 8B 02    	STY KRPTSP	; save it
EBB5 A4 C6       	LDY KEYCNT	; get count of keys in kbd
					;  buffer
EBB7 88          	DEY		; at least one, so exit
EBB8 10 1C       	BPL PROCK5	;  {exit}
EBBA             
EBBA             PROCK4				
EBBA A4 CB       	LDY KEYDN	; get current key number
EBBC 84 C5       	STY CURKEY	; re-save as current	
EBBE AC 8D 02    	LDY SHFTFL	; get current shift pattern
EBC1 8C 8E 02    	STY LSSHFT	; save as last shft pattern 
EBC4 E0 FF       	CPX #$FF	; re-check for any keys down
EBC6 F0 0E       	BEQ PROCK5	; none, so exit
EBC8             
EBC8 8A          	TXA		; restore ASCII code to .A
EBC9 A6 C6       	LDX KEYCNT	; get count of keys in buffer
EBCB EC 89 02    	CPX KBMAXL	; more than maximum allowed?
EBCE B0 06       	BCS PROCK5	; yes, drop current key press
EBD0             
EBD0 9D 77 02    	STA KBUFFR,X	; save ASCII code in buffer
EBD3 E8          	INX		; increment buffer count and
EBD4 86 C6       	STX KEYCNT	;   save it
EBD6             
EBD6             PROCK5				
EBD6 A9 F7       	LDA #$F7	; clear bit for COL3 (STOP key
EBD8 8D 20 91    	STA D2ORB	; is in COL3); save it to VIA
EBDB 60          	RTS		; exit routine


	Part of the keyboard scanning includes evaluating whether or not
key modifier keys are pressed. Modifier keys include the SHIFT, Commodore,
and CTRL keys. The ASCII decoding table is changed based on whether or not
one of these keys is pressed. It also looks like the following code went
through several revisions considering the multiple patch areas (filled with 
NOPs). Alternatively, these areas could support alternate decoding schemes
for different languages.

EBDC             ;				
EBDC             ; Evaluate for shift/CTRL/Commodore keys
EBDC             ;				
EBDC             SHEVAL				
EBDC AD 8D 02    	LDA SHFTFL	; 1=SHFT; 2=C> 4=CTRL
EBDF C9 03       	CMP #$03	; C> + shft?
EBE1 D0 2C       	BNE PROCK6A	; no, select proper decode
EBE3             			;  table
EBE3 CD 8E 02    	CMP LSSHFT	; is the pattern the same as
EBE6 F0 EE       	BEQ PROCK5	; last one? Yes, exit.
EBE8             
EBE8 AD 91 02    	LDA SHMODE	; different pattern
EBEB 30 56       	BMI PROCKEX	;  {exit}
EBED             
EBED EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
EBF3 EAEA
EBF5 EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
EBFB EAEA
EBFD EA EA EA    	.db $ea, $ea, $ea			
EC00             
EC00 AD 05 90    	LDA VRSTRT	; get char ROM address
EC03 49 02       	EOR #%00000010	; flip between L/C and U/C
EC05 8D 05 90    	STA VRSTRT	;  ROMs
EC08             
EC08 EA EA EA EA 	.db $ea, $ea, $ea, $ea			
EC0C             
EC0C             PROCK6			; proper ROM is set, so go 
EC0C 4C 43 EC    	JMP PROCKEX	;  on with key image process 
EC0F             
EC0F             PROCK6A		; define correct decode table
EC0F 0A          	ASL A		; multiply index by 2
EC10 C9 08       	CMP #$08	; >= 8 (5 entries)?
EC12 90 04       	BCC $+6		; no, continue
EC14             
EC14 A9 06       	LDA #$06	; yes, assume CTRL table
EC16             
EC16 EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
EC1C EAEA
EC1E EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
EC24 EAEA
EC26 EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
EC2C EAEA
EC2E EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
EC34 EAEA
EC36 EA EA       	.db $ea, $ea
EC38             
EC38 AA          	TAX		; reset pointer to point
EC39 BD 46 EC    	LDA KDECOD,X	;  at right decoding table
EC3C 85 F5       	STA KEYTAB	;  .A is table index
EC3E BD 47 EC    	LDA KDECOD+1,X
EC41 85 F6       	STA KEYTAB+1
EC43             
EC43             PROCKEX	
EC43 4C 74 EB    	JMP PROCKY	; continue processing image	
EC46

EC46		;========================================================
EC46		; KDECOD - Pointers to keyboard decode tables	
EC46		;
EC46             KDECOD				
EC46 5E EC       	.dw KDECD1		;$EC5E Unshifted
EC48 9F EC       	.dw KDECD2		;$EC9F Shifted
EC4A E0 EC       	.dw KDECD3		;$ECE0 Commodore
EC4C A3 ED       	.dw KDECD5		;$EDA3 Control
EC4E 5E EC       	.dw KDECD1		;$EC5E Unshifted
EC50 9F EC       	.dw KDECD2		;$EC9F Shifted
EC52 69 ED       	.dw KDECD4		;$ED69 Decode
EC54 A3 ED       	.dw KDECD5		;$EDA3 Control
EC56 21 ED       	.dw GRTXTF		;$ED21 Graphics/text control
EC58 69 ED       	.dw KDECD4		;$ED69 Decode
EC5A 69 ED       	.dw KDECD4		;$ED69 Decode
EC5C A3 ED       	.dw KDECD5		;$EDA3 Control


	Now, let's look at a few very simple routines just so that we can
check them off of the list:

IIOBASE
=======

	IIOBASE is the internal label behind the Kernal IOBASE function.
Calling IOBASE results in code execution being transferred to IIOBASE:

IOBASE:
FFF3 4C 00 E5    	JMP IIOBASE		;$E500 IOBASE

	IOBASE returns the address of the beginning of the I/O region of
the VIC memory map in the .X and .Y registers. Locations $9110 to $912F are
the addresses reserved for the VIC's two 6522 VIAs. This is the first routine
in the Kernal ROM.

	The value of this function in the VIC is questionable since there
is no way to change the address at which the VIAs appear, and interestingly,
the Kernal code does not call IOBASE at all. The Kernal instead relies on
hard-coded addresses.

	However, one could conclude that the actual location of the VIAs
in the VIC's address space changed during the Kernal development process,
so IOBASE was somehow used to normalize the address. This also enabled code
portability between the VIC and the C64.

	The BASIC ROM appears to call IOBASE in the RND function. The
existence of other calls is unknown at this time since the BASIC ROM has
yet to be disassembled.

E500	;==========================================================
E500	; IIOBASE - Return I/O base address				
E500	;	Returns the IO Base address in .X(LSB) and .Y(MSB)
E500           IIOBASE				
E500 A2 10       	LDX #$10	;return $9110 as IO Base
E502 A0 91       	LDY #$91			
E504 60          	RTS			


ISCREN
======

	ISCREN is the internal label behind the Kernal SCREEN function.
Calling SCREEN results in code execution being transferred to ISCREN:

SCREEN:
FFED 4C 05 E5    	JMP ISCREN	;$E505 SCREEN

E505 ;==========================================================
E505 ; ISCREN - Return screen organization				
E505 ;	Returns the screen organization .X(columns) and .Y(rows)
E505 ;				
E505           ISCREN				
E505 A2 16       	LDX #$16       ;return 22 cols x 23 rows
E507 A0 17       	LDY #$17			
E509 60          	RTS	

	This code returns the row and column organization of the screen in
the .X and .Y registers. It doesn't appear that the Kernal calls this
function to determine the screen size, instead relying on hard-coded 
values under the assumption that the screen is 22x23. So, this function's 
utility appears to be purely for the benefit of user code.

IPLOT
=====

	IPLOT is the internal label behind the Kernal PLOT function. 
Calling PLOT results in code execution being transferred to IPLOT:

PLOT:
FFF0 4C 0A E5    	JMP IPLOT		;$E50A

E50A	;===============================================================
E50A	; IPLOT - Read/set cursor position				
E50A	; On entry:  SEC to read cursor position to .X(row) and .Y(col)	
E50A	;            CLC to save cursor position from .X(row) and .Y(col)
E50A	; 				
E50A           IPLOT				
E50A B0 07       	BCS READPL	;carry set? yes, read position
E50C 86 D6       	STX CURROW	;save row...
E50E 84 D3       	STY CSRIDX	;...and column
E510 20 87 E5    	JSR SCNPTR	;update position
E513             
E513           READPL				
E513 A6 D6       	LDX CURROW	;return row...
E515 A4 D3       	LDY CSRIDX	;...and column
E517 60          	RTS		

	The Kernal again does not call this function, instead managing cursor
movement by changing the values of the current row and current cursor index
(i.e., the cursor's position in the row). Upon storing the new cursor
location, the code commits the changes by jumping to an internal routine
in CINT1 which is responsible for moving the cursor block in screen memory.


Conclusion
==========

	In this installment, we examined several routines, two of which
are integral to the operation of the VIC. The Jiffy clock routine also
scans the STOP key, which is important to overall usability and the ability
to halt a program. The second routine, SCNKEY, is responsible for scanning
the keyboard matrix. That's pretty important, too.
	Next time, we'll examine more routines in the VIC's KERNAL, including
I/O routines. 
.......
....
..
.                                    C=H 19

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

JPEG: Decoding and Rendering on a C64
------------------------------------- Stephen Judd 
						<sjudd@ffd2.com>
				      Adrian Gonzalez
						<adrianglz@globalpc.net>

	In the C64 world there are a disturbing number of cases where
people have said, "It can't be done on a C64."  This goes on for a while
until someone actually takes a look at the task and its requirements,
and says "Not only can it be done, but it can be done easily."  JPEG is
one such case.
	This article is divided into two parts.  In part 1, I discuss
JPEGs and the decoding process.  The primary focus is on several important
issues not covered well, if at all, in existing documentation, especially
the IDCT; the article also covers the principles of decoding JPEGs and
JFIF files.
	In part 2, Adrian discusses Floyd-Steinberg dithering, and how it
can be applied to various C64 graphics modes (and how it can be used to
display jpegs!).  In both articles the actual C64 code and algorithms will
of course be discussed, and the source code is available at

	http://www.ffd2.com/fridge/jpeg

for both the decoder and the renderer.

	The decoder is about 4k of code, the renderer is around 2k, and
there are about 9k of tables.  With the grayscale versions, there is
ample memory left over.  With the color IFLI versions, memory is extremely
tight -- there are 32k of graphics, six 24-bit image buffers.  The Huffman
trees are stored in the screen RAM area.  The renderer crams all the data
into the graphics area, which is why you see garbage while the image is
rendering.  There are a few tens of bytes free in page 0, probably 100-200
bytes free in page 1, and a few tens of bytes free in page 2, and that's it!
Everything else just kind-of barely/exactly fits, and then only for
'typical' jpegs.

	Finally, Errol Smith deserves a special mention as the guy who first
tracked down some decent JPEG documentation.  Errol pointed me in the right
direction and within a few weeks we had JPEGs on a 64.

------
Part I: Decoding jpegs
------

	Decoding jpegs is a fairly straightforward process, and in
recent years some free documentation has become available.  This
article is meant to complement that documentation, by filling in
some of the gaps and detailing some of the broader issues, not to
mention some specific implementation issues.  The first part of this
article covers general jpeg issues: encoding/decoding, Huffman tree
storage, Fourier transforms, JFIF files, and so on.  The second part
covers implementation issues more specific to the C64.

	There are several sources of JPEG documentation online and in
the library.  Out of all of them, I found three that were particularly
useful:

	Cryx's jpeg writeup at http://www.wotsit.org 

	ftp://ftp.uu.net/graphics/jpeg/wallace.ps.gz, an updated
		article from one which appeared in the April 1991 
		"Communications of the ACM" (v34 no.4).
	
	"JPEG Still Image Data Compression Standard" by William B. Pennebaker
	 and Joan L. Mitchell, published by Van Nostrand Reinhold, 1993,
	 ISBN 0-442-01272-1.

The first, Cryx's writeup, is a programmer's description of JPEG files, so
it has good, detailed descriptions of the encoding/decoding process and
the file structure/organization, including a list of all the JFIF segments
and markers.  The second reference is also excellent, and explains most of
the basic principles of JPEGs, the how's and why's of the standard, and has
some helpful examples.  The third reference (the book) is very comprehensive,
but is written in a way which I feel tends to obscure the important points.
Nevertheless, it has an entire chapter on the discrete cosine transform and
several fast DCT algorithms, which is invaluable.  As an additional source
of information, some people might find the IJG's cjpeg/djpeg source code
helpful.

JPEG Encoding/Decoding
----------------------

	It's really simple, folks.

	Start with a grayscale image and divide it up into 8x8 pixel blocks
(just like a C64 bitmap).  The first block is the upper-left corner of the
image; the second block is to the right of the first block, and so on until
the end of the row is reached, at which point the next row begins.
	The next step is to take the two dimensional discrete cosine
transform of each 8x8 component, and filter out the small-amplitude
frequencies.  This will be explained in detail later, but the net result
is that you are left with a lot of zeros in the 64-byte data block, and
a few nonzero elements from which you can reconstruct the main features
of the image.  This filtering process is called the "quantization" step.
	The next step is to RLE-encode the resulting 8x8 block (since most
of the components are zero), and finally to Huffman-encode the RLE-encoded
data.  And that's it.  Done.  Finished.  Repeat Until Done.

	Color pictures are similar, but now each pixel has an 8-bit R, G,
and B value, so there will be three 8x8 blocks, for a total of 24 bits
(not quite like a C64 bitmap...).  The RGB values are converted to
luminance/chrominance values (RGB -> YCrCb), but what's important is that
for each 8x8 section of a color image there are three 64-byte blocks of
data, and each block is encoded as above.
	So to summarize: transform the data, filter ("quantize") the
transformed data, and RLE-encode and Huffman-encode the result.  Do this
for each component, and then move on to the next 8x8 block.  Therefore,
to decode the image data:

	read in the bits,
	find the Huffman code,
	unpack the RLE,
	de-quantize the data,
	and perform the inverse transform,

for each 8x8 block of image data to be plotted to the screen.  Repeat
until done.
	It turns out that there are other methods of JPEG compression
in the standard, such as arithmetic compression, but this is rarely
supported due to legal reasons (lame software patent owned by IBM, AT&T,
and Mitsubishi), and it doesn't seem to offer substantial compression
gains.  There are also different types of jpegs, most importantly
"baseline" or sequential jpegs, and "progressive" jpegs.  In
a progressive jpeg the image is stored in a series of "scans" which go
from lower to higher resolution.  I'll be focusing on baseline jpegs
(which are more common).
	Finally, it turns out that an 8x8 block of image data doesn't
have to correspond to an 8x8 block of pixels.  For example, each byte
of data might represent an average of a 2x2 block of pixels, so an 8x8
block of data might expand to a 16x16 block of pixels.  In a JPEG
the "sampling factor" determines how to expand an 8x8 block of data.
You can see that this can offer substantial compression gains, but will
coarsen the data; on the other hand, if the data is already coarse, it's
a way of getting a whole lot for nothing.  Most color jpegs use one-to-one
pixel mapping for the luminance, and one-to-four (each data byte = 2x2 pixel
block) mapping for the two chrominance components.  From an implementation
standpoint, this means that a decoder typically decodes 16 scanlines at a
time (16x16 pixel chunks).  For more details, see Cryx's document.

	Before a JPEG can be decoded, though, the decoder needs a fair
amount of information, such as the Huffman trees used, the quantization
tables used, information about the image such as its size, whether it's
a color or a grayscale image, and so on.  In a JPEG file, all information
is stored in "segments".

Segments
--------

A JPEG segment looks like the following:

	[header]	Two bytes, starting with $FF
	[length]	Two bytes, in hi/lo order (not usual 6502 lo/hi)
	[data]		Segment data

A list of JPEG (and JFIF) headers can be found in Cryx's document.

Let's have a look at a hex dump of a jpeg file (from unix, use
"od -tx1 file.jpg | more"):

0000000  ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48
0000020  00 48 00 00 ff fe 00 17 43 72 65 61 74 65 64 20
0000040  77 69 74 68 20 54 68 65 20 47 49 4d 50 ff db 00

The first two bytes are $ff $d8 -- these two bytes identify the
file as a jpeg.  All jpegs start with ff d8.
	Next we encounter the header ff e0.  ff e0 is a special header
which identifies this file as a JFIF file.  It turns out that in the
original JPEG standard a specific file format is not given; this
in turn led to different companies using their own formats, to try and
establish the "standard".  The JFIF format was put forwards to remedy
this problem, and is the de-facto standard -- but more on this later.
	In a JFIF file, the JFIF segment always follows the JPEG ID byte.
You can see that it is length 16, and that that length includes the two
length bytes.  Immediately following the length byte are the four letters
J F I F and the number 0; following that are some bytes for revision numbers,
the x/y densities, and some thumbnail info.
	The next segment starts with the header ff fe.  This is the
"comment" header; the length is $17 bytes.  Following the length bytes
are the ascii codes for "Created with The GIMP", a popular image
processing program.  The next header is ff db, which is the "Define
Quantization Table" header.  And on it goes, until the actual image
data -- a stream of Huffman-encoded bits -- is reached.

Huffman Decoding
----------------

	If you don't know anything about Huffman decoding, then I suggest
you read Pasi's nice article in C=Hacking #16, which has a nice example.
Briefly, a Huffman tree is a binary tree whose left and right branches
correspond to bits 0 and 1 respectively; starting from the top of the
tree, you read bits and move left or right accordingly until a leaf
is reached, containing the Huffman code value.  Then you start over again
at the top of the tree and decode the next Huffman code.

	In a JPEG, Huffman trees are stored in "Define Huffman Tree"
segments (header = ff c4):

0000300                                ff c4 00 1c 00 00
0000320  01 05 01 01 01 00 00 00 00 00 00 00 00 00 00 03
0000340  01 02 04 05 06 00 07 08

The first byte in the DHT segment (00) is an ID byte -- JPEGs can have up to
eight Huffman trees.  This is then followed by 16 bytes, where each byte
represents the number of Huffman codes of lengths 1, 2, 3, ..., up to
length 16, followed by the Huffman code values. In the above example, there
are 0 codes of length 1, 1 code of length 2, 5 codes of length 3, and so
on.  Following these 16 bytes are the Huffman values: 3, 1, 2, 4, ..., 8.
But what are the Huffman codes corresponding to those values?
	It turns out that these trees are so-called "canonical Huffman trees",
and work as follows: to get the next code, add 1 to the current code.
When the length increases, add 1 and shift everything left.  The exception
is that you don't increment until the first code is defined, so the first
code is always zeroes.
	For example, to decode the above DHT segment, start with Huffman
code = 0.  There are no codes of length 1, so we shift it left to get
code = 00 (and don't add 1 because the first code hasn't been defined yet).
There is one code of length 2, so we read the first Huffman value and
assign it to the current code

	Code	Value
	00	  3

That's the only code of length two, so now we move to length 3 by incrementing
and shifting: code = 010.  There are five values of length 3, and the next
five Huffman values are 1, 2, 4, 5, 6, so the Huffman tree is now

	Code	Value
	00	  3
	010	  1
	011	  2
	100	  4
	101	  5
	110	  6

and the rest of the Huffman tree is given by

	1110	  0
	11110	  7
	111110	  8

What's the best way to implement a Huffman tree?

The most obvious way is to use five bytes per "node", i.e.

	left pointer	(2 bytes)
	right pointer	(2 bytes)
	value		(1 byte)

where the left and right pointers are just offsets to be added to the
current pointer, and if left = right = $FFxx then this is a leaf.  If you
fetch a bit that says "go left", and the left pointer = $FFxx (but right
pointer is valid) then you've hit an invalid Huffman code -- i.e. decoding
error.  This five-byte method is used in jpx (grayscale decoder).
	But there is another rather cool method, first described to me by
Errol Smith, which uses only two bytes per node.  Now, the five-byte method
works fine in jpx, but in the full-color IFLI jpz code -- well, suddenly
memory becomes extremely tight, and without this routine jpz probably
wouldn't have happened on a stock machine.  The routine is also very
efficient, especially if implemented using 16-bit 65816 code.
	The trick is simply to organize the tree such that if the current
node is at location NODE, then the left node is at NODE+2 and the right
node is at NODE+(NODE).  Leaf nodes can be indicated by e.g. setting the
high bit.  So the decoding process is:

	get next bit
	if 0 then pointer = pointer + 2
	if 1 then pointer = pointer + node value
	if high byte of node value < $80 then loop

For example, the first part of the earlier Huffman tree

	00	  3
	010	  1
	011	  2
	100	  4

would be encoded as

0d 00 04 00 03 80 04 00 01 80 02 80 00 00 00 00 04 80
-----|-----|-----|-----|-----|-----|-----|-----|-----|

Try decoding the Huffman values, using the above algorithm.

Astute readers may ask the question: won't you decode incorrectly if
there is no left node?  Even more astute readers can answer it: in a
canonical Huffman tree, the only nodes without left-node pointers are
leafs.
	To see this, consider a counterexample: a tree that looks like

		o
	       /
	      o
               \
		o

This corresponds to Huffman code 01 -- one move left, one move right.
In a canonical Huffman tree, the only way to generate the code 01 is to
increment the code 00; since code 00 has already occured, there must be
a left-node.  In a canonical Huffman tree, you always create a left-node
before creating a right-node.  So error checking this kind of tree amounts
to checking the right-pointer; the only nodes without left-pointers are leafs.
Moreover, since left-nodes are always created first, you can add nodes in
the order they are created -- you never have to insert nodes between
existing nodes.
	Pretty nifty, eh?

Restart Markers
---------------

	The image data in a jpeg is a stream of Huffman-encoded bits.
The jpeg standard allows for "restart markers" to be perodically inserted
into the stream.  Thus a decoder needs to keep count of how far it is
in the data stream, and periodically re-synchronize the bitstream.  So
far so good -- this is explained in detail in Cryx's document.
	What _isn't_ explained is that the restart markers do not merely
re-synchronize the data stream, but when a restart marker is hit the DC
coefficients need to be reset to zero.  That is, it really does "restart"
the decoder.
	What's a DC coefficient, you may ask?  It's the very first element
in the 8x8 array, and instead of encoding the actual value a jpeg encodes
the _offset_ from the previous value.  That is, the decoded DC element is
added to the current DC value to get the new value.  That value needs
to be reset to zero when a restart marker is hit.
	Most jpegs do not use restart markers, but unless you reset the
coefficient you're going to spend a few months wondering why Photoshop images
don't decode correctly.
	Why is it called the DC coefficient?  You'll have to read the section
on Fourier transforms for the answer.

	Note also that when the byte $FF is encountered in the data stream
it must be skipped; the exception is if it is immediately followed by a 00,
in which case $FF00 represents the value $FF.  Why do I bring this up?
Because Cryx's document could be interpreted by naive people like myself
as saying this is true throughout a jpeg file, and it's only true within
the image data -- that in other segments, $FF is a perfectly valid byte.

Unpacking the RLE
-----------------

	Once a Huffman code is retrieved and decoded, the resulting byte
represents RLE-compressed data to be uncompressed.  This procedure is
described quite well in Cryx's document, so I'll just refer you to it.
This is repeated until you are left with a 64-byte chunk of data which
needs to be re-ordered and dequantized.  This process is again described
in Cryx's document; briefly, during the encoding process, the original
8x8 data is re-ordered into a 64-byte vector as follows:

	0  1  5  6  ...
	2  4  7  13 ...
	3  8  12 17 ...
	9  11 18 24 ...
	10 19 23 ...
	20 22 ...
	...

That is, the first element in the vector is the (0,0) component of the
8x8 array, the next element is the (1,0) component, the next element is
the (0,1) component, and so on.  The reason for this "zig-zag" ordering
is to enhance the RLE-compression, since it concentrates the lower
frequencies at the beginning of the vector and the higher frequencies --
most of which are typically zero-amplitude -- at the end of the vector
(more on this later).  The decoder thus needs to "un-zigzag" the vector
back into an 8x8 array.  All de-quantization amounts to is multiplying
each element by a corresponding element in a quantization table:

	data[i,j] = data[i,j]*quant[i,j]

The final step is to take the resulting 64-byte chunk and apply the 
inverse discrete cosine transform (IDCT).


Fourier Transforms and the (I)DCT
---------------------------------

	Let's begin with the definition you'll see in any document on
JPEGS (hear that?  That's the sound of one thousand eyes simultaneously
glazing over).
	OK, let's back up a moment.  In computers, grasping new ideas is
usually straightforward: you read about it, play around with it a little,
and ah, it makes sense.  Mathematics isn't like that.  These are ideas
that took people decades and centuries to figure out.  College students
spend multiple months, working hundreds of problems, to gain just a basic
working knowledge of a subject.  There's simply a constant learning process.
Fourier transforms represent a fundamentally different way of thinking,
and the timescale for enlightenment in the subject is years, not minutes.
So don't worry if you don't understand everything immediately; the purpose
of this part isn't to make you an instant expert in Fourier transforms, but
rather to give you a toehold into the subject that you can expand on over
time.

	So, let's begin with a definition that you'll see in any
document on JPEGS.  The one-dimensional discrete cosine transform (DCT)
of a function f(x) with eight points (x=0..7) may be written as

			 7	       2*x+1
	F(u) = c(u)/2 * sum f(x) * cos(-----*u*PI),	u = 0..7
			x=0		16

where c(0) = 1/sqrt(2) and c=1 otherwise.  This may look very mysterious to
you, and it should, because it is rather mysterious-looking.  For now,
think of it as some sort of grinder: you insert f(x) into the grinder,
turn the crank, and out pops a new function, F(u).  In other words, the
original function f(x) has been _transformed_ into a new function F(u).
	Notice that we need to perform a separate sum for each value of u:

	F(0) = 1/(2*sqrt(2)) * sum f(x)
	F(1) = 1/2 	     * sum f(x)*cos((2*x+1)*PI/16)
	F(2) = 1/2	     * sum f(x)*cos((2*x+1)*2*PI/16)

and so on.  So there are a total of eight summations, each of which
involves eight summands, for a total of 64 operations to perform.
	One of the important properties about this transform is that it
is _invertible_.  That is, you can take a transformed function F(u),
put it into the other end of the grinder, turn the crank backwards,
and out pops the original function f(x).  Moreover it is _uniquely_
invertible -- for every function f(x), there is one and only one transform
F(u), and vice-versa (the functions f(x) and F(u) are often called
a transform pair).  In this case, the inverse DCT (IDCT) is given by

		      7			 2*x+1
	f(x) = 1/2 * sum c(u)*F(u) * cos(-----*u*PI),	x = 0..7
		     u=0		  16

You'll notice that it is very similar to the forward transform, except
now the sum is over u, and c(u) is inside of the summation; as before,
there are 64 sums total to perform.  Expanding the sum gives

	f(x) = 1/2 * ( 1/sqrt(2) F(0) + F(1) * cos((2*x+1)*PI/16) +
				        F(2) * cos((2*x+1)*2*PI/16) + ...)

For now, just note that the original function f(x) is given by a sum
of the transformed function F(u) times different cosine components.
	The transform of a two-dimensional function f(x,y) is done by
first taking the transform in one direction (e.g. the x-direction)
followed by the transform in the other direction (e.g. the y-direction).
Thus the two-dimensional 8x8 discrete cosine transform of a function
f(x,y) may be written as

         c(u)c(v)     7   7               2*x+1             2*y+1
F(u,v) = --------- * sum sum f(x,y) * cos(-----*u*PI) * cos(-----*v*PI)
             4       x=0 y=0               16                16

 u,v = 0,1,...,7

where, as before, c(0) = 1/sqrt(2) and c=1 otherwise.  The IDCT is then
given by

          1     7   7                        2*x+1             2*y+1
f(x,y) = --- * sum sum c(u)c(v)*F(u,v) * cos(-----*u*PI) * cos(-----*v*PI)
          4    u=0 v=0                        16                16

 x,y = 0,1...7

Note that some documentation (e.g. Cryx's document) incorrectly gives c(u) 
and c(v) as c(u,v) = 1/2 for u=v=0 and c(u,v) = 1 otherwise.
	This is an _extremely_ expensive computation to do, requiring
64 multiplies of cosines (and computations of the arguments of the
cosines) to calculate the value at a _single_ point (x,y), and there are
64 points in each 8x8 block, so, even discounting the argument computation
(i.e. u*pi*(2*x+1)/16) we're looking at 64*64 = 4096 multiplications for
_every_ 8x8 block of pixels (where these are 16-bit multiplications).  On
a C64, in such a case, the decoding time could be measured in hours if not
days.
	But if this were the only way to compute a DCT, then JPEGs would
never have been DCT-based.  There are much faster methods of computing
Fourier transforms, that take advantage of the symmetries of the transform.
You may have heard of the Fast Fourier Transform, which is used in almost
all spectral computing applications; well, there are also fast DCT algorithms.
The one I used is actually an adaptation of the FFT.
	So the first task is: where do we find a fast DCT algorithm?  One
place to look is existing source code, like cjpeg/djpeg.  Unfortunately
I found it pretty incomprehensible, and hence tough to translate to 65816;
it is also pretty large.  And it's basically impossible to debug a routine
that isn't understood (if something goes wrong, then where's the error?).
	The next place to look is the literature -- many papers have
been written on fast DCT routines.  Unfortunately, the ones I found were
quite dense, very general (we only need an 8x8 routine, not an NxN routine),
and again, fairly complicated.
	What is needed is a _simple_, but fast, IDCT algorithm.  Salvation
came in the book by Pennebaker and Mitchell, mentioned at the beginning of
the article and available in the library.  This book has several 8x8 DCT
routines in it, with detailed discussions of the algorithms, both one-
dimensional and two-dimensional.  The 2D one is again fairly lengthy, but
the 1D ones are pretty fast and straightforward -- something like 29 adds
and 13 multiplies to compute 8 components of a 1D DCT.  Moreover the 13
"multiplies" are multiplies by constants, which means table lookups, not
full multiplications.  Compare with at least 1024 full multiplies and adds
using the DCT definition, and you can see that the fast routine is
*hundreds* -- and possibly thousands -- of times faster.  To put this in
perspective, it's the difference between taking 30 seconds to decode a picture
and taking 1-2 hours -- maybe even 10 hours or more -- to decode the same
picture!

	As mentioned earlier, we can do a 2D IDCT by doing a 1D transform
of the rows of some 2D array followed by a 1D transform the columns (or
vice-versa).  Thus a 1D routine is all that is needed.  Although there
are specialized 2D routines, they are quite large and significantly more
complicated than a 1D routine.  Small and straightforwards Good; large
and complicated Bad.  And cjpeg/djpeg makes the observation that they don't
seem to give much speed gain in practice.
	There's just one problem -- the book chapter discusses lots of
_forwards_ DCT routines, but devotes just one paragraph to _inverse_ DCT
routines!  "Just reverse the flowgraph" is the advice given, with a few
hints on reversing flowgraphs.
	To make a long story short, IF you reverse the flowgraph correctly,
AND you overcome the errors/misleading notation in the book, AND you
prepare the coefficients correctly before performing the transformation, 
then yes, by golly, it works!  And working code is awfully sweet after days
of intense frustration!  I have included an easy-to-read Java version of
the 1D IDCT routine at the end of this article.

	At this point, the more experienced programmers are asking, how
do you _know_ it works?  With so many possible 8x8 arrays, how do you test
and debug such a routine?  To answer these questions, it is important to
understand a few things about Fourier transforms.  In the process, we shall
also see why JPEG is based on the DCT, and why it is so effective at
compressing images.

Fourier Transforms for dummies
------------------

	There are several ways of thinking about a Fourier transform.
One way to think about it is that you can expand any function in a series
of sines and cosines:

	f(x) = a0 + a1*cos(wx) + a2*cos(2wx) + a3*cos(3wx) + ...
		  + b1*sin(wx) + b2*sin(2wx) + b3*sin(3wx) + ...

where the a0 a1 etc. are constant coefficients (amplitudes) and w 2w etc.
are the frequencies.  In the discrete cosine transform, the function is
expanded solely in terms of cosines:

	f(x) = a0 + a1*cos(wx) + a2*cos(2wx) + a3*cos(3wx) + ...

"Taking the transform" amounts to computing the coefficients a0, a1, a2, etc.
Once you know them, you can reconstruct the original function by adding up
the cosines.

	Now, let's forget about computing the coefficients, and stand back
for a moment and look at that expression.  Each coefficient tells "how much"
of f(x) is in each cosine component -- for example, the value of a2 says
"how much" of f(x) is in the cos(2wx) component.  Conversely, each
coefficient tells us how much of each "frequency" there is in f(x) --
a2 says how much frequency=2w there is, a0 says how much frequency=0 there
is, and so on.
	So another way of thinking about a Fourier transform is that it
transforms a function from the space (or time) domain into the _frequency_
domain -- instead of thinking about how much the function varies with x
(how it varies in space), we can see how it varies with _w_, the frequency;
instead of looking at "how much f" is at a given point in space or time,
we can look at "how much f" is at a given frequency.
	So, imagine measuring something simple, like the voltage coming
out of a wall socket.  A plot of the signal will be a sinusoidal function --
this is a graph of how the signal varies with time.  The Fourier transform
of this signal, however, will have a large spike at 60Hz (or 50Hz if you're
in Europe or .au).  Small amplitudes of other frequencies will probably be
seen, too, indicating noise in the signal.  So a graph of how the signal
varies with _frequency_ might look something like this:

			|
			|
			|
			|
	--^^--^----^----+-^---^----
		      60Hz

That is, lots of zero or very small amplitude frequencies, and a large
frequency amplitude at around 60/50Hz.
	If you've ever seen an equalizer display on a stereo, you've seen
a Fourier transform -- the lights measure how much of the audio signal there
is in a given frequency range.  When the bass is heavy, the lower frequencies
will have large amplitudes.  When there's some high instrument playing (or
lots of distortion), then the high frequencies will have large amplitudes.

	Now we can take this a few steps further.  The frequencies convey
a lot of information.  For example, cos(wx) wiggles very slowly if w is
small, and wiggles very rapidly if w is large (and it doesn't wiggle at
all if w=0).  (If you don't see this, just think of x as an angle which
goes around a circle: if x goes around the circle once, then 7x goes
around the circle seven times).  Therefore, a function which changes slowly
will have a lot of low-frequencies in the transform; a function which changes
rapidly will have large high-frequency components (rapid wiggles give rapid
changes).
	The zero frequency is special.  A constant function will have
only the zero-frequency component (since cos(0x) is a constant).  Moreover,
the zero-frequency represents the average value of the function over a
period of cosine -- this is easy to see because the average value of cos(x),
cos(2x), etc. is 0 over a full period: it is above zero half of the time,
and below zero the other half of the time, and the two halves cancel.

	Now consider an image.  A typical photograph changes fairly
smoothly -- there aren't many sudden sharp changes from black to white.
This means that the transform of some small area of the picture will have
fairly large-amplitude low-frequencies, but not much in the way of high
frequencies.  If those small-amplitude high-frequencies are simply thrown
away, then the image won't change much at all -- the high frequencies
represent super-fine details of the picture.  And that's why JPEG is a
"lossy" algorithm, and why it gets such high compressions -- the idea is
to throw away the fine details and the unnecessary components, and keep
just the major features of the picture.  It's also why JPEG isn't so great
for things like line-art, where the image can change rapidly -- you may
have noticed that things like slanted lines tend to get jagged in a jpeg.

	The important point to remember is that high frequencies correspond
to rapid changes in the image, low frequencies correspond to smooth changes,
and the zero frequency is the "average" value.  Because there were obviously
electrical engineers on the JPEG comittee, the zero frequency is referred
to as the "DC component" of the transform, and the nonzero frequencies are
referred to as the "AC components" (for Direct Current and Alternating
Current).

	Finally, for completeness, note that there is a difference between
a discrete Fourier transform and a continuous Fourier transform, namely
that one gives the transform in terms of discrete frequencies (w, 2w, 3w,
etc.) and the other gives the transform as a continuous function of
frequency.  When dealing with discrete data -- like an 8x8 set of values --
we necessarily use a discrete transform.

	Now, how can you test a Fourier transform routine?

Fourier Transforms for smarties
------------------

	The basic question is: how do we know if the IDCT is working
correctly?  Quite simply, by feeding it a problem we already know the
answer to.
	Remember that we are working with transformed data; each element
represents the amplitude of a specific frequency.  Imagine a transformed
vector with a single nonzero element, for example, let a3=10 and all the
other coeffs equal zero.  What will the inverse transform of this vector
be?  Since a3 is the amplitude of cos(3x), the transform will simply be...
a3*cos(3x)!  Similarly, if a1 is the only nonzero coefficient, the transform
will be a1*cos(x).
	The above explanation actually isn't _quite_ right, because of
the form of the IDCT used:

        	1     7                  2*x+1
	f(x) = --- * sum c(u)*F(u) * cos(-----*u*PI)
        	2    u=0                  16

Now it should be easy to see that if, say F(3)=10, and all the other F's
are zero, then the result of the transform -- whatever transform algorithm
is used -- must be

	f(x) = 5*cos(3*PI*(2x+1)/16)

So, for a one-dimensional IDCT, it is easy to test each component separately
and compare the result with the actual answer.  But what about a 2D IDCT
that has many nonzero components?
	There are two important properties of Fourier transforms which
come into play here.  The first is that Fourier transforms are _linear_;
a linear operator L satisfies

	L(c*f1) = c*L(f1), where c = constant
	L(f1 + f2) = L(f1) + L(f2)

That is, constants factor out of the operator, and operating on the sum
of two functions is the same as operating on each function separately
and adding them together.  As a simple example, consider the operators
L1(x) = x and L2(x) = x^2.  The first one satisfies the conditions above;
the second one does not.  Some other linear transforms you may be familiar
with are rotations, and taking the derivative.  You can test for yourself
that the Fourier transform satisfies the above conditions; you can also
look at the fast DCT algorithm and see that it only involves additions and
multiplications by constants, which are all linear operations.
	This property is enormously important here.  It first says that we
can multiply the transformed data by a constant, and the constant will
just multiply the final answer; said another way, if F(3)=10 and all other
F's are nonzero, then we know that F(3)=const*10 will work too, no matter
what the constant is!  So in testing one component at a time, you can
pretty confidently say "F(3) works" (as opposed to "F(3)=10 works, and
F(3)=11 works").  The _only_ thing that can cause problems is overflows
and other _computer_ issues; the basic algorithm _cannot_.
	Even more importantly, however, is that the transform of the sum
of two functions is equal to the sum of the transforms.  If we know that

	F1 = (0,0,10,0,0,0,0,0)

works, and we know that the transform of

	F2 = (0,0,0,10,0,0,0,0)

works, then we _know_ that the transform of

	F1 + F2 = (0,0,10,10,0,0,0,0)

works!  Moreover, since we can multiply each function by arbitrary constants,
we know that the transform of

	(0,0,a,b,0,0,0,0)

works, _no matter what a and b are_.  So we can _completely_ test a 1D DCT
simply by testing each component _separately_.  The _only_ things that can
cause problems are things like overflow, erronius multiplications, etc.
	Now, what about a 2D IDCT?  The way a 2D IDCT is computed is by
first transforming in one direction (e.g. the x-direction), then transforming
in the other direction (e.g. the y-direction).  Therefore, we can compute
the 2D IDCT by first transforming each row, then transforming each column
(or vice versa).

	Therefore, once the 1D IDCT works, so does the 2D IDCT.

	So, to summarize: to test the routine completely we simply need
to test each component of a 1D IDCT separately, and compare the result
with the known answer.
	And if you really want to test it on a 2D set of data, there is
an example DCT array given in the Wallace paper (and the result of the
inverse transform).

Quantization revisited
----------------------

	The quantization step filters out all the small-amplitude frequencies.
A JPEG can have up to four quantization tables; each table is a 64-byte (8x8)
set of integers.  When encoding a JPEG, taking the DCT of an 8x8 block of data
leaves an 8x8 block of amplitudes.  Each amplitude is divided by the
corresponding entry in the quantization table, thus filtering out the small
amplitudes in a weighted fashion.  The quantized amplitudes are then
re-ordered into a 64-byte vector which concentrates the lower frequences
(the ones more likely to be nonzero) at the beginning of the vector, and
the higher frequencies (more likely to be zero) at the end of the vector.
This last step (zig-zag reordering) clearly increases the efficiency of the
RLE encoding of the amplitude vector.
	The decoder just reverses these steps -- it dequantizes the data
(i.e. multiplies by the quantization coefficients) and re-orders the data,
before performing the IDCT.  Now, you may have noticed that the IDCT routine
has to prepare the coefficients by multiplying (dividing) by a set of
constants:

    for (int i=0; i<8; i++)
        F[i] = S[i]/(2.0*Math.cos(i*ang/2));

(This is done because the algorithm is actually an adapted FFT routine).
In principle, this step can be incorporated into the de-quantization step,
since dequantization is also just multiplying by constants.  In a 1D
transform this is very straightforward, but I see no way to extend it
to the 2D transform.  That is, it is possible to incorporate the above
into the quantization such that, say, the row transforms will not need
preparation, but the column transforms will still need the preparation.
I did not feel that this was a very useful "optimization", and simply
mention it here for completeness.
	Note also that a wise programmer would replace the Math.cos
calls above with constants, if the code were to be actually used in
a decoder.

Miscellaneous
-------------

	You may recall that all JPEGs begin with FFD8, and JFIF files
immediately follow this with the FFE0 JFIF segment.  Although most jpegs
have the JFIF segment, some don't!  For example, some digital cameras do
not include a JFIF header.  But the files decode just fine if you don't
worry about it.
	Moreover, be sure to skip unknown segments using the segment
length byte -- as opposed to, say, moving forwards in the file until
another valid segment header is found.
	When reading some of the other jpeg documentation, you'll read
that the byte $FF is a special byte, to be skipped (unless followed by $00).
Just to be clear, this only applies to the image data -- $FF is a normal
data byte within other segments.  Similarly, restart markers only appear
within the image data.

C64 Implementation
------------------

	As you probably understand by now, and as we shall see below,
jpegs on a C64 are far from being an impossible task.  So to wrap up,
this section will cover the main issues in implementing a jpeg decoder
on a C64, and examine some of the comments regarding jpegs being
"impossible" on a C64.

	One frequently-heard comment was that a C64 doesn't have enough
memory to decode a jpeg, so let's look at the numbers.  From the preceding
discussion, jpegs require memory for

	1 - Quantization tables
	2 - Huffman trees
	3 - Image data

The quantization tables are 64 bytes each, and there are a maximum of
four -- so, no big deal.  Using the two-byte storage method, the Huffman
trees typically take up around 1.5k, and using the 5-byte method they take
on the order of 4k.  The image data is stored in a jpeg on a row-by-row
basis, where each row is some multiple of 8 lines large.  The normal C64
display is 320 pixels wide, so that means an image buffer size of
320x8 = $0A00 bytes per 8 scanlines.
	So, a few K for the Huffman tables, and a few K for the image
buffers.  I think you'll agree that these are hardly massive amounts of
memory.
	Now, as you may recall, the data decoded from a JPEG file is
luma/chroma data -- Y (intensity) CrCb (chroma).  For a grayscale picture,
all that is needed is the intensity -- there's no need to convert to RGB.
You may also recall that, because of sampling factors, a jpeg might decode
to 16x16 blocks of data (or more), which means several 320x8 image buffers
need to be available -- at $0A00 bytes/buffer, there's plenty of buffer
space available.
	For a full-color picture, however, all three components need to
be kept, which means three buffers for each 320x8 row of data, which means
$1E00 bytes per row.  So there's still plenty of room for multiple buffers.
Until, that is, you throw IFLI into the mix -- but more on this later.
The bottom line is that jpegs really don't require much memory.

	Another common comment was that the C64 was far too slow to do
the necessary calculations, especially the discrete cosine transforms.
As was stated earlier, the IDCT routine used in this program needs some
29 adds and 13 multiplies to do a 1D transform.  More importantly, the
"multiplies" are always multiplies by a constant -- which means they
can be implemented using tables.  So, we're talking 29 16-bit adds and
13 16-bit table-lookups for the IDCT, which is really pretty trivial.
	Another important calculation is the dequantization, which means
doing 64 integer multiplications per 8x8 data block.  Each integer is 8-bits
large (and the result can be 16-bits), and the multiplications are done
using the usual fast multiply routine (let f(x)=x^2/4, then 
a*b = f(a+b)-f(a-b)), as described in all the C=Hacking 3D articles.
Again, not a big deal.
	So, in summary, the mathematical calculations are well within
the grasp of the 64.
	In fact, all the routines are quite straightforward -- only the
IDCT routine is special.

	One important issue is grayscale versus color.  The first program
released, jpx, is grayscale, and for several very good reasons.  Grayscale
is much faster to compute, since no RGB conversion needs to be done (the
intensity Y is exactly the grayscale levels).  It is more memory-efficient,
since the color components may be thrown away, and the bitmap requirements
are modest.  And it is easier and faster to render.
	With some pretty solid fundamental routines and a reasonable
grasp of the important issues, color was a reasonably straightforward
addition to the code, with just one problem: memory.  IFLI requires
32k for the bitmaps.  The IDCT routine uses some 6k of tables.  At least
two image buffers are needed, for almost 16k.  The RGB conversion code
uses table lookups.  The renderer needs memory for image buffers and tables.
The decoder needs memory for Huffman and quantization tables.  When we added
it all up, there just wasn't room.
	With a little more thought and planning, though, a few things
became clear: first, IFLI doesn't use the first three columns, which
means the image buffers only need to be 296x8 x 3 components = $1BC0 bytes
(instead of 320x8 x 3).  Typical jpegs use a maximum sampling factor of 2,
so using just two buffers requires $3780 bytes -- a savings of almost $0600
bytes over a 320-pixel wide bitmap.  Moreover, the needs of the renderer
came out to almost exactly 16K per bitmap, which means that all the data can
be squished into the two IFLI bitmaps and sorted out later.  So by scrimping
here and saving there, and economizing on tables and rearranging memory, we
were able to cram everything into 64k, with just a few hundred bytes to
spare -- pretty neat.

	And that, I think, sums up JPEG decoding on a C64.


/*
 * idct.java -- Attempts to implement the IDCT by reversing the flowgraph
 * as given in Pennebaker & Mitchell, page 52.
 *
 * Almost there!
 *
 * SLJ 9/15/99
 */

import java.lang.Math.*;
import java.io.*;
import java.util.*;

public class idct2d {

    // a1=cos(2u), a2=cos(u)-cos(3u), a3=cos(2u), a4=cos(u)+cos(3u), a5=cos(3u)
    // where u = pi/8

    static double ang = Math.PI/8;

//    static double a1=0.7071, a2= 0.541, a3=0.7071, a4=1.307, a5=0.383;
    static double	a1 = Math.cos(2.0*ang),
			a2 = Math.cos(ang)-Math.cos(3.0*ang),
			a3 = Math.cos(2.0*ang),
			a4 = Math.cos(ang)+Math.cos(3.0*ang),
			a5 = Math.cos(3.0*ang);

//    static double[] f = {31, 41, 52, 65, 83, 15, 34, 117},
    static double[] f = {10, 9.24, 7.07, 3.826, 0, -3.826, -7.07, -9.24},
	     F = {0, 0, 0, 0, 0, 0, 0, 256},
	     S = {0, 0, 0, 0, 0, 0, 0, 256};

    static double[][] trans = new double[8][8];

    void idct2d() {}

  void calcIdct() {
    double t1, t2, t3, t4;

    // Stage 1

    for (int i=0; i<8; i++)
	F[i] = S[i]/(2.0*Math.cos(i*ang/2));

    F[0] = F[0]*2/Math.sqrt(2.0);

    t1 = F[5] - F[3];
    t2 = F[1] + F[7];
    t3 = F[1] - F[7];
    t4 = F[5] + F[3];
    F[5] = t1;
    F[1] = t2;
    F[7] = t3;
    F[3] = t4;

    //printF();

    // Stage 2

    t1 = F[2] - F[6];
    t2 = F[2] + F[6];
    F[2] = t1;
    F[6] = t2;

    t1 = F[1] - F[3];
    t2 = F[1] + F[3];
    F[1] = t1;
    F[3] = t2;

    //printF();

    // Stage 3

    F[2] = a1*F[2];

    t1 = -a5*(F[5] + F[7]);
    F[5] = -a2*F[5] + t1;
    F[1] = a3*F[1];
    F[7] = a4*F[7] + t1;

    //printF();

    // Stage 4

    t1 = F[0] + F[4];
    t2 = F[0] - F[4];
    F[0] = t1;
    F[4] = t2;

    F[6] = F[2] + F[6];

    //printF();

    // Stage 5

    t1 = F[0] + F[6];
    t2 = F[2] + F[4];
    t3 = F[4] - F[2];
    t4 = F[0] - F[6];
    F[0] = t1;
    F[4] = t2;
    F[2] = t3;
    F[6] = t4;

    F[3] = F[3] + F[7];
    F[7] = F[7] + F[1];
    F[1] = F[1] - F[5];
    F[5] = -F[5];

    //printF();

    // Final stage
    f[0] = (F[0] + F[3]);
    f[1] = (F[4] + F[7]);
    f[2] = (F[2] + F[1]);
    f[3] = (F[6] + F[5]);

    f[4] = (F[6] - F[5]);
    f[5] = (F[2] - F[1]);
    f[6] = (F[4] - F[7]);
    f[7] = (F[0] - F[3]);
  }


    static public void main(String s[]) {

	idct2d test = new idct2d();
	int i,j;

	// Init to test transform in Wallace paper
	for (i=0; i<8; i++)
	  for (j=0; j<8; j++) trans[i][j]=0;
	trans[0][0] = 240;
	trans[0][2] = -10;
	trans[1][0] = -24;
	trans[1][1] = -12;
	trans[2][0] = -14;
	trans[2][1] = -13;

	//First the row transforms
	for (i=0; i<8; i++) {
	  for (j=0; j<8; j++) S[j] = trans[i][j];
	  test.calcIdct();
	  for (j=0; j<8; j++) trans[i][j] = f[j];
	}

	for (i=0; i<8; i++) {
	  System.out.println();
	  for (j=0; j<8; j++) System.out.print((int) trans[i][j]+" ");
	}
	System.out.println();

	System.out.println("Columns:");

	//Now the column transforms
	for (i=0; i<8; i++) {
	  for (j=0; j<8; j++) S[j] = trans[j][i];
	  test.calcIdct();
	  for (j=0; j<8; j++) trans[j][i] = f[j]/4 + 128;
	}

	//Print it out!
	for (i=0; i<8; i++) {
	  System.out.println();
	  for (j=0; j<8; j++) System.out.print((int) (trans[i][j]+0.5)+" ");
	}
	System.out.println();
    }
}
.......
....
..
.                                    C=H 19

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

-------
Part II: Bringing "True Color" images to the 64
-------
         by Adrian Gonzalez <adrianglz@globalpc.net>

The Commodore 64 has a somewhat limited resolution, 16 predefined colors,
and some very peculiar restrictions as to the number of different colors
that can be placed next to each other.  These restrictions make drawing
colorful pictures on the 64 a difficult task, and displaying full color
photographic images almost impossible.

I've been fascinated with bringing full color images to the c64 for a long
time now.  My first image conversion project was a C program that could
convert 16 color IFF pictures to koalapaint format.  I started work on this
project somewhere back in 1992 or so.  It ran on the Amiga, and it was one
of my first 'serious' C projects, so I was basically refining my C skills
while doing it.  After some time I rewrote the converter completely and
added support for Doodle, charsets and a few other things.

Shortly after and with the help of a few friends on the net, I learned about
a "magical" graphic mode called FLI.  Before I could do a FLI converter,
however, somebody on irc #c-64 pointed me to a couple of 'amazing' images
available on an ftp site that were supposedly in some new, colorful vic
mode.  I was reluctant because I thought I had seen the best graphics a c64
could do.  Boy was I wrong.  I was absolutely amazed by this 'new' VIC mode
called IFLI.  Shortly thereafter the thought of doing an IFLI converter grew
stronger and stronger in my head and the idea of a FLI converter practically
vanished.  After several weeks of hard work I came up with my first attempt
at IFLI conversion.  Several years passed until there was a reason to port
this converter/renderer to the c64.  The reason, of course, was Steve Judd's
JPEG decoder.

My involvement with the JPEG project kind of started before Steve even
started to work on it.  About two years ago, Nate Dannenberg asked me to
do a renderer for his QuickCam interface.  I first came up with a 160x100
renderer in 4 grays.  After that I came up with the 2 gray 320x200 hires
renderer that was used first for Nate's Quick cam, and later modified to
work with the first version of Steve's JPEG decoder.  This same renderer
was later hacked into rendering drazlace grayscale images.

The big challenge, of course, was porting the full color IFLI renderer to
the c64.  I don't think I would've ever bothered if it wasn't for jpx.
We faced the obvious restriction of the c64's limited RAM (The IFLI image
itself takes up half the c64's memory!).  Things were tight, but it the end,
it worked out just fine.  But how exactly does the renderer do it's magic?
What's all that garbage on the screen while it's rendering?  Well, I'd
like to start off by giving a quick explanation of what dithering is, and
how the renderer uses a particular kind called Floyd-Steinberg dithering.


Floyd-Steinberg Dithering
-------------------------

Dithering is the process of using patterns of two or more colors to
trick the eye into seing a different color.  Let's say that you want to
display 3 shades of gray with just two colors, you could have dither
patterns such as:

. . . .  * . * .  * * * *
. . . .  . * . *  * * * *
. . . .  * . * .  * * * *
. . . .  . * . *  * * * *
. . . .  * . * .  * * * *
. . . .  . * . *  * * * *

Where the dots (.) are black pixels and the asterisks (*) are white
pixels.  If the pixels are small enough, the eye will see the middle
pattern as a shade of gray.  This is the basic concept behind dithering.

Floyd-Steinberg dithering is an 'error diffusion' dither algorithm. 
Basically that means that when drawing an image, if a color in the
source image can't be matched with the available colors we have to use
the closest available color.  After that we have to figure out the
difference between the color we wanted to use (source image color) and the
closest one we had available.  That difference, or error, has to be
distributed (diffused) amongst adjacent pixels.

For example, imagine we have a video chip that can only display black and
white pixels.  Black pixels would be 0% brightness and white pixels 100%
brightness.  Let's say we want to use this chip to display an image with 100
shades of gray.  We can store the image as an array of numbers from 0 to 99,
where 0 represents 0% brightness and 99 represents 100% brightness.  A small
part of our test image could look something like this (5 x 2 pixel chunk of
the image):

00  25  45  75  99
30  50  80  30  10

Without dithering, the best we could do is pick the color closest to the one
we want to display, so we'd end up with something like:

00  00  00  99  99
00  99  99  00  00

Where 00 is black and 99 is white.  Basically, any pixels with brightness
greater or equal to 50 were converted to white (99) and the rest were
converted to black (00), since those are the only two colors our hypotetical
video chip can display.

With Floyd-Steinberg error diffusion dithering we also plot the closest
color we have, but instead of just moving on to the next pixel we calculate
by how much we were off (error) and diffuse that amount among adjacent pixels.
Going back to our test image, the first pixel is completely black so we can
display it right away without incurring any error, because we matched the
color exactly.  The second pixel (25) is dark gray so we plot it with the
closest color we can, in this case, black (00).  We then proceed to compute
the error, which is equal to the color we wanted (25) minus the color we
had available (00), so for this pixel, the error is +25.  We then diffuse
the error (+25) to the adjacent pixels.  F-S dithering uses the following
distribution:

       C.Pix  7E/16 

1E/16  5E/16  3E/16

Where C.Pix is the current pixel, and E is the error.  Basically that
means, add seven sixteenths of the error to the pixel to the right of the
current pixel, five sixteenths of the error to the pixel below the current
pixel, etc.

So in our example, we wanted to plot a dark gray pixel (25) but we only
had black available (00), so the error is +25.  So then we have (rounded
off)

(7/16)E = 11
(5/16)E = 8
(3/16)E = 5
(1/16)E	= 2

When we add this to the original image buffer, we get:

(Original)
00  CP >45< 75 100
25  50  80  30  10

(Diffused)
00  CP >56< 75 100
27  58  85  30  10

Again, CP stands for 'Current pixel'.  After doing these calculations, we're
ready to move on to the next pixel.  You'll notice that the third pixel
(originally 45) would have been plotted as black but now, because of the
error diffusion, the new value is 56 so we'll plot it as white, and the
error will be 56-99 = -43.  We then repeat the procedure:

(7/16)E = -19
(5/16)E = -13
etc

And adjust the buffer accordingly.  Repeat this procedure for each pixel,
processing each scanline from left to right and scanlines from top to
bottom and the result is a nice looking dithered image.  Note that errors
can be positive or negative, so we should prepare for a case such as this:

55 00 00
00 00 00

Get the 55, plot it as white, and we have an error of -44, so that means
that our buffer needs to be able to handle negative values as well.  After
difusing, the buffer would look like:

 CP -20  00
-14  -8  00

Note also that the 1E/16 was discarded because we're at the left edge of
the screen.  The same overflow condition applies to the opposite case:

 44  99  99
 99  99  99

The error +44 will make the values of adjacent pixels greater than 99,
which is the maximum that can be displayed.  The buffer needs to be able to
hold values large enough to accomodate for this.

Now let's assume our hypothetical video chip manufacturer came up with a new
video chip that can display 4 grays: black (0), dark gray (33), light gray
(66), and white (99).  If we want to plot an image with 100 shades of gray
we will still always plot the closest color we can, i.e. 0-16 will be
plotted as 0 (black), 17-49 as 33 (dark gray), etc.  The error will be
positive or negative depending on whether we're under or over the color we
wanted to plot.  For example, the color 15 would be plotted as 0 (black),
with an error of +15, while the color 20 would be plotted as 33 (dark gray)
with an error of -13.  And I think I've managed to confuse everybody
including myself, but if you read this paragraph over, it should make at
least some sense.  Always remember the error is computed as the color we
want minus the color we have.

As if things weren't fun enough, we can also apply this to a full color
(RGB) display where we have 3 buffers, one for each primary color (red green
and blue).  Each buffer contains the corresponding levels of each primary
color for a given pixel.  Everything works exactly the same, except now
colors are specified as triplets, for example:

   R   G   B
(  0,  0,  0) black
( 99,  0,  0) bright red
( 99, 99,  0) bright yellow
( 99, 99, 99) white

When we plot a color we now have to compute three errors, one for each
primary color component.  Each component is used to figure out the error for
its corresponding buffer.  For example, let's say we want to draw a red
pixel (80, 0, 0) but our video chip can only display bright red (99, 20, 0).
The error would still be computed as the color we want minus the color we
can display:

We want:
r1=80, g1= 0, b1=0

We have:
r2=99, g2=20, b2=0

The error would be: (r1-r2, g1-g2, b1-b2) = (-19, -20,  0).  After computing
the error we proceed to distrubute it in the same fashion as before, except
that we now have three image buffers, each with its own error to be
distributed among its adjacent pixels.  The best way to visualize this is to
imagine you're displaying 3 independent images, each with it's own error. 
In the previous example, we would diffuse the -19 in the red buffer, the -20
in the green buffer and the 0 in the blue buffer.

With grayscale images, finding which shade of gray was the closest to the
one we wanted to display was pretty straightforward.  With full color
images, the way to figure out the closest color changes a little bit.  In
order to find which of our available colors is the closest match for the
color we want to display, we need to calculate the 'distance' from the color
we want to each of the colors we have available and use the one with the
shortest distance.  To do this you can imagine the RGB color space as a
cube, with the R, G, and B as each of the 3 axis.  The origin (0,0,0) is
black, and the corner opposite to the origin (99,99,99) is white, so
figuring out the distance between two colors is as simple as figuring out
the distance between two points in space:

color1 = (r1, g1, b1)
color2 = (r2, g2, b2)
d = sqrt( (r1-r2)^2 + (g1-g2)^2 + (b1-b2)^2 )

Let's say that our video chip can display 5 colors:  black, red, green, blue
and white.  The RGB triplets for these colors would be:

( 0, 0, 0): Black
(99, 0, 0): Red
( 0,99, 0): Green
( 0, 0,99): Blue
(99,99,99): White

Let's also say we want to find out which of these is the closest match for
the color (50,80,10).  We have to compute the distance between this color
and all of our 5 available colors and see which one is the closest.  The
calculations would be as follows:

Black:
sqrt( ( 0-50)^2 + ( 0-80)^2 + ( 0-10)^2 ) = 94.87

Red:
sqrt( (99-50)^2 + ( 0-80)^2 + ( 0-10)^2 ) = 94.35

Green:
sqrt( ( 0-50)^2 + (99-80)^2 + ( 0-10)^2 ) = 54.42

Blue:
sqrt( ( 0-50)^2 + ( 0-80)^2 + (99-10)^2 ) = 129.70

White:
sqrt( (99-50)^2 + (99-80)^2 + (99-10)^2 ) = 103.36

In this case, the color with the shortest distance is Green (54.42).  Note
that we're not interested in knowing the exact distance, just knowing which
color has the smallest distance, so it's safe to toss out the square root
in order to things faster.  If we don't calculate the square root we end up
with the following squared distances:

Black:  9000
Red:    8901
Green:  2961
Blue:  16821
White: 10683

Of course, Green still has the smallest distance^2, and we're saved from
performing a potentially troublesome (and slow) calculation.

Based on the previous explanation, we're ready to move on to implementing
Floyd-Steinberg dithering on the C64.  We will need to have the RGB values
for each C64 color handy in order to be able to compute the error and the
closest colors for each pixel we want to draw.

This article would probably end at this point if the C64 would let us
choose any of the 16 colors for any pixel on the screen, but we're not quite
that lucky.


Multicolor Bitmap Mode
----------------------

The VIC-II video chip on the C64 has somewhat strict color limitations.  In
multicolor bitmap mode, the screen has a resolution of 160x200 and it's
divided into 4x8 pixel 'cells'.  Each of these cells can have up to 3
different colors out of the C64's 16 colors plus one background color common
to all cells on the screen.  If we wanted to display a 4x8 cell like this:

4  4  4  3
4  4  3  3
4  3  3  3
3  3  3  0
3  3  3  0
1  3  3  3
1  1  3  3
1  1  1  3

We could choose color 3 as the background color common to all cells, and the
colors 0, 1 and 4 as the colors available to this particular cell (called
foreground, multicolor 0, and multicolor 1).  We can't display any
additional colors on this cell.  This makes multicolor bitmap mode a very
tough choice for displaying true color images.


FLI Mode
--------

Flexible Line Interpretation (FLI) mode is a software graphics mode in which
the video chip is tricked by software in order to achieve higher color
placement freedom.  It is basically the same as multicolor bitmap mode,
except that each 4x8 cell is further divided into eight 4x1 cells.  Each
4x1 cell can have 2 completely independent colors, 1 color common to the
entire 4x8 cell and one background color common to the entire image (some
implementations of FLI change the background color on every scanline as well).
One small downside of FLI mode is that the leftmost 3 columns of cells are
lost due to the trickery used to get the video chip to fetch color data on
every scanline.  This means that the effective display area is reduced from
160x200 to 148x200.


IFLI Mode
---------

IFLI mode or "Interlaced" FLI mode is basically two FLI images alternating
rapidly.  The C64 has a fixed vertical refresh rate of 60 frames per second
for NTSC models and 50 frames per second for PAL models.  This means that
the screen is redrawn 60 times per second on NTSC units and 50 times per
second on PAL units.  IFLI alternates between two FLI images, displaying
each for 1/60th of a second (1/50th for PAL), giving the illusion of a
single blended image with more than 16 colors.  One of the biggest
advantages of IFLI mode is that one of the FLI images is shifted one hires
pixel (1/2 of a multicolor pixel) to the right to give a pseudo 320x200
hires effect.

For example, let's say a little part of the images looks like this:
(11 = one multicolor white pixel, 33 = one multicolor cyan pixel, etc)

Image1
11335577

Image2
 22446688

Alternating these two would give an effect that looks like:
12345678

Except that the colors would also mix and blur slightly, giving the illusion
of more colors than the VIC-II can actually display.  Of course, some color
combinations work better than others.  Don't expect to mix black and white
and get a nice looking shade of gray (you'll get a very flickery shade of
gray because of the alternation).

The renderer in jpz doesn't attempt to mix colors, mainly because I was
never happy with the results I got by doing that.  Instead, it treats the
IFLI display as a 'true' 296x200 display capable of displaying any single
one of the c64's 16 colors in any position.  Note that the 3 column 'bug'
also applies to IFLI, so the resolution is 296x200 instead of 320x200.

The color restrictions are somewhat more complex in IFLI mode.  The renderer
in jpz treats the display as if it was made up of 8x8 cells, with each cell
divided into eight 8x1 cells, and each of those divided into two 4x1 cells
(fun, huh?).  To illustrate this better, look at the following 8x8 cell
sample:

A I A I A I A I
B J B J B J B J
C K C K C K C K
D L D L D L D L
E M E M E M E M
F N F N F N F N
G O G O G O G O
H P H P H P H P

The odd columns belong to a 4x8 cell in the first FLI image and the even
columns belong to a 4x8 cell in the second FLI image like this:

Image 1    Image 2
AAAA       IIII
BBBB       JJJJ
CCCC       KKKK
DDDD       LLLL
EEEE       MMMM
FFFF       NNNN
GGGG       OOOO
HHHH       PPPP

Remember the two images are offset by half a multicolor pixel to give the
pseudo-hires effect.  As for the color restrictions, each 4x1 cell on each
image has 2 completely independent colors, but each 8x8 cell (the
combination of the 4x8 cells from the two images) shares one color, and the
entire image shares one background color.

The renderer in jpz is divided into two parts.  The first part takes the
source RGB image and remaps it to the c64's colors, using Floyd-Steinberg
dithering as described in the first part of this article.  This part outputs
an array of numbers, each number corresponds to a c64 color.  The second
part of the renderer takes this array of c64 colors and displays it in IFLI
mode as best as it can, taking into consideration the color placement
limitations mentioned above.

The second part of the renderer works with blocks of 8x8 pixels and follows
these steps:

1) Choose one color as common to the entire 8x8 cell
2) Choose two colors for each 4x1 cell
3) Render the 8x8 block (as two 4x8 cells, one on each FLI image)

In step one the renderer has to determine which one of the C64's 16 color
would be the most helpful when chosen as common to the 8x8 block.  This
means that the common block color should be chosen to aid in 4x1 cells with
more than 2 different colors (remember that 4x1 cells only have 2 completely
independent colors for them).  If we wanted to display a 4x1 cell like

 1 15 12 12
 
We have two independent colors for the cell, which could be chosen as 1 and
15.  We need either the common 8x8 block color or the background color to be
12 so we can correctly display this 4x1 cell.  So how do we decide?  We
create a histogram!

A histogram is nothing more than a count of how many pixels of each color we
have in a particular area (in this case an 8x8 block).  Note that we only
want to count the cases in which the common block color would actually be
helpful for displaying a particular 4x1 cell.  This is easier to explain
with an example 8x8 block:

1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 3 1 3 1 4 1

If we count all the colors in this block we would find 60 ones, one 2, two
3's, and one 4, and we would decide that 1 is the best choice as a common
color for the 8x8 block because it's the most 'popular' color.  A closer
look reveals that this block will be rendered as the following 4x8 blocks:

Image1  Image2
1111    1111
1111    1111
1111    1111
1111    1111
1111    1111
1111    1111
1111    1111
2334    1111

Note that in the last 4x1 cell of image 1 we have 3 different colors.  We
have the ability to choose only two individual colors for this 4x1 cell, so
if we choose 2 and 3, we won't be able to display 4 and our common 8x8 block
color can't help us either.  The best solution in this case is to _not_
count 4x1 cells with 2 or fewer different colors.  This means that the only
cell we would count in our histogram is the last 4x1 cell in image 1.  So
the new histogram would be one 2, two 3's, and one 4.  We would proceed to
choose 3 as the common 8x8 block color and this allows us to render the
entire 8x8 block without a single problem!

In theory, the same should be done for the background color, in order to
choose the best background color for the picture we're rendering, but that
would mean that we have to do a histogram for the entire image before
starting to render it.  In practice, we don't have enough memory on the C64
to do this while reserving enough memory for an IFLI display (and decoding a
JPEG), so we choose black as the default background color.

The second step in the process is to choose two colors for each 4x1 cell. 
This is done with the same histogram technique described earlier, except we
have to take into consideration the color we picked as common to the entire
8x8 block so we don't repeat any colors and have the best chances of
representing the original image as closely as possible.   Basically, a
histogram is made for each 4x1 cell, and the top two most popular colors are
picked, assuming they're not the same as the background color (black) or the
common 8x8 block color.  For example, let's say the common 8x8 color is
white (1) and we have a 4x1 cell that looks like this:

1223

The histogram would be:  two pixels of color 2 (red), one pixel of color 1
(white) and one pixel of color 3 (cyan).  In this case, since white is
already our common 8x8 block color, we skip it and pick colors 2 and 3 as
our 4x1 cell colors.  The same skipping is done with black pixels because
black is already available as the background color.

The third and last step is to render the actual image with the correct
bitpairs.  As you may know, multicolor images sacrifice half the horizontal
resolution in favor of more colors.  Basically, bits are paired up to have 4
possible combinations:

00: Background color (black in our case)
01: Upper nybble of screen memory   (4x1 cell color #1)
10: Lower nybble of screen memory   (4x1 cell color #2)
11: Video matrix color nybble (Common 8x8 block color)

All that's left to do is to output the corresponding bit pairs in each 4x1
cell to match the colors in the source (remapped) image as close as
possible.

Depending on the complexity of the source image, there can be a few or a
lot of 4x1 cells where we can't match all the colors.  Remember we only have
2 completely independent colors for each 4x1 cell, and a cell can
potentially have each pixel be a different color.  When this happens, the
best we can do is approximate the colors we can't match with the ones we
have available.  The renderer does this with a color closeness lookup table
to avoid having to compute the color distances in realtime.

The table is basically a list of what colors are most similar to any
particular c64 color, ordered from the most similar to the least.  Let's say
we want to plot the color white (1) but none of our bitpairs for the current
cell can represent it.  We have to look up white in our table and get the
first color closest to it.  If that color isn't available either, we will
fetch the next closest color from the table and try again untill we find a
match.

It is worth mentioning that due to the memory limitations of the C64 the
bitmaps are stored in memory in 'packed' form while rendering.  If you go
back to the brief description of FLI mode, you'll remember that the leftmost
3 char columns were lost due to VIC chip limitations.  When rendering, the
bitmaps are stored contiguously in memory, without these 3 char block gaps
in order to have enough room to render the entire image.  After the entire
image is rendered, it is 'unwound' by a small routine and then finally
displayed in its full IFLI glory.  In the stock version of the renderer you
can see this 'unwinding' take place right before the image is displayed. 
Also, the colorful blocks on the screen while the image is being rendered
are the actual buffers where the floyd-steinberg dithering is taking place
(note that all of this is invisible in the SCPU version due to the memory
mirroring optimizations provided by the hardware).

Well, that basically wraps up this article.  I hope that it will give the
reader an idea of the enormous amount of calculations that have to take place
in order to be able to convert the images to a format suitable for viewing
on our beloved C64.  I also hope it explains the basic principles behind the
rendering of these images, and why it takes so long for a stock system to
display them.
.......
....
..
.                                    - fin -