Bank-Switched (Double-Buffer) Scrolling on the Commodore 64

First published on: 6th September 2019
Updated on: 10th November 2020

Introduction

Note: This article assumes the reader has some measure of technical knowledge concerning the Commodore 64's hardware scrolling feature. You may also require the focus of a man with a gun held to his head ;-)

As you may know by now, Parallaxian began as something of a technology demonstrator rather than the starting point for a playable game.

Accordingly, when I revived the project in 2018, it became clear that almost none of the original codebase was going to survive the game's reboot.

Among its many shortcomings, the legacy code often hogged clock cycle time, having been programmed with a just-make-it-happen mentality and zero regard for speed of execution.

However, a handful of bytes survived the cull... we're talking something like 1% of the orginal code as of midsummer 2019, chief among which was the scroll code for the most distant layer of the parallaxed landscape: the mountains, aka "layer 1".

The layer 1 scroll code was compact, meaning it had a tiny RAM footprint, but as is so often the case when dealing with interrupts, RAM-efficiency tends to equate to slothfulness of execution which, for an increasingly task-laden raster interrupt, is pure Kryptonite.

The first image below reveals the extent of the issue, with the solid pale grey bar at the bottom of the screen indicating the amount of raster time (itself a direct representation of the number of clock cycles used in execution) consumed by the old layer 1 scroller.

That's close to 40 raster lines, a huge outlay in execution time just to scroll the distant mountains and handle their map feed... If that could be reduced, it would free up raster time for new things, such as a music player perhaps, or more sophisticated sprite handling.

Clearly, something had to be done.

Raster time consumed by inefficient old layer 1 scroll code

Old Code, Geriatric Speed

First of all, examining the old layer 1 scroll code, it was based on the "redraw" scroll principle (aka "moving window method") rather than the copy-and-feed method used in layers 2, 3 and 4, meaning its character / "software" scroll and the subsequent feed at the far edges of the screen (depending on scroll direction) were fused together into one operation and not the two operations used in the other layers.

This suited the fact that the mountains were, very RAM inefficiently, not mapped as blocks / tiles, but in 6 linear rows consuming 256 bytes apiece... I have no idea why the younger me opted for that configuration, other than it may relate to the fact that this layer's scroller was derived from that used by Phil Nicholson - with his written consent - in his game Deadline (the only time I can recall using someone else's scroll code).

(Just to reiterate, the other parallax layers use the copy-and-feed method of scrolling, as per the diagram below).

The second major point of note was that the layer 1 scroller used RAM-efficient rolled loops in its redraw process, unfortunately pouring glue into the interrupt's task-handling, so my starting point in trimming down the raster time expenditure for this scroller was to experiment in replacing all rolled loops with unrolled, that is, "speed code", albeit at the expense of losing the RAM-efficiency of the old code.

At this point it would be instructive to pause and consider the implications of using speed code as opposed to rolled loops by looking at the actual code used to perform the layer 1 software scroll - in this case, from right to left - in the 1995 original; figures in squared parentheses are CPU clock cycles, and * = an extra cycle where a RAM page boundary is crossed:

	LDX $97	[3] ($97 = holds map pos.)
	LDY #$00	[2]
	LDY #$00	[2]
LOOP1	LDA $3900,X	[4*] ($3900-$39FF: Row 1 map)
	STA $40A0,Y	[5]
	LDA $3A00,X	[4*]
	STA $40C8,Y	[5]
	LDA $3B00,X	[4*]
	STA $40F0,Y	[5]
	LDA $3C00,X	[4*]
	STA $4118,Y	[5]
	LDA $3D00,X	[4*]
	STA $4140,Y	[5]
	LDA $3E00,X	[4*]
	STA $4168,Y	[5]
	INX	[2]
	INY	[2]
	CPY #$27	[2]
	BNE LOOP1	[2* if branch is taken]

Aside from revealing a penchant for writing 6502 in CAPITALS (in contrast to my lower case web development habit), we can see that this tiny section of code performs the software scroll for all 6 rows of layer 1's mountains using 3 + 2 + ((4 + 5 + 4 + 5 + 4 + 5 + 4 + 5 + 4 + 5 + 4 + 5 + 2 + 2 + 2) x 38) + (4 + 5 + 4 + 5 + 4 + 5 + 4 + 5 + 4 + 5 + 4 + 5 + 2 + 2) clock cycles

= 5 + ( 38 x (60)) + 58 clock cycles

= 5 + 2280 + 58 clock cycles

= 2343 clock cycles in the best case scenario, where there are no page boundary traverses in the code.

If we divide that figure by 63, we'll get the number of raster lines consumed by this cycle-guzzling scroller: just over 37 raster lines!!! (which is horrendously wasteful of precious interrupt time).

Unrolling the Loop

Unrolling the scroll code's loop to make it into speed code, it's going to become ultra bloaty in RAM, but the upside is it executes faster; that would give us something nice and long-winded like this:

COL01_TO_00	LDA ROWA+01	[4]
	STA ROWA+00	[4]
	LDA ROWB+01	[4]
	STA ROWB+00	[4]
	LDA ROWC+01	[4]
	STA ROWC+00	[4]
	LDA ROWD+01	[4]
	STA ROWD+00	[4]
	LDA ROWE+01	[4]
	STA ROWE+00	[4]
	LDA ROWF+01	[4]
	STA ROWF+00	[4]
COL02_TO_01	LDA ROWA+02	[4]
	STA ROWA+01	[4]
	LDA ROWB+02	[4]
	STA ROWB+01	[4]
	LDA ROWC+02	[4]
	STA ROWC+01	[4]
	LDA ROWD+02	[4]
	STA ROWD+01	[4]
	LDA ROWE+02	[4]
	STA ROWE+01	[4]
	LDA ROWF+02	[4]
	STA ROWF+01	[4]
	etc. up to Column 39...
	... finishing with the map feed...
	LDX $97	[3] ($97 holds map pos.)
	LDA L1MAPA,X	[4*]
	STA ROWA+39	[4]
	LDA L1MAPB,X	[4*]
	STA ROWB+39	[4]
	LDA L1MAPC,X	[4*]
	STA ROWC+39	[4]
	LDA L1MAPD,X	[4*]
	STA ROWD+39	[4]
	LDA L1MAPE,X	[4*]
	STA ROWE+39	[4]
	LDA L1MAPF,X	[4*]
	STA ROWF+39	[4]

Suddenly, it's a comparatively massive bite taken out of RAM to perform the layer 1 scroll, but... the number of clock cycles consumed is reduced (in best case, i.e. with no page boundary traverses on the map feed section) to:

(39 x (4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4)) + (4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 3) = 1872 + 48 = 1920 clock cycles.

Dividing that by 63, we get the new raster line consumption for the scroller... 30.5 (approx), which means the updated, bloaty speed code has shaved almost 7 lines of raster time off the layer 1 software scroll process, i.e., an 18% saving on execution time.

(I think it looks more impressive if we call that "approximately 20%").

But it's still too much raster time for a background effect that plays only a supporting role to the gaming experience - albeit an indispensable role - so a better way should be found, within the constraints of maintaining the game's other critical screen display features of the open vertical borders, the vertical parallax effect, the sprite plexors and parallax scrolling on the other landscape layers.

Spreading the Char Scroll Burden

You may have heard of coders using a method called Double Buffering to reduce execution time in software scrolling (where software scrolling is defined as the task of shifting rows or columns of screen data by one charspace in the scroll direction).

The principle is solid: allocate two screen banks to the play area cognisant of the fact that only one can be switched on at any given moment and perform the clock cycle-hungry software scroll on the switched-off screen bank while the pixel perfect hardware scroll is shifting the display fed by the switched-on screen bank smoothly in the required direction; then, when it's time for the software scroll to kick in, we just reset the hardware scroll position and swap screen banks to reveal the completely pre-scrolled char data.

The downside is we lose 1K of RAM from the 16K graphics bank being used for the game, as that's how much memory a 40 x 25 char screen plus its sprite pointers consumes; and speaking of sprite pointers, we need to alternate them as well with each bank-switching / hardware scroll reset event to maintain consistency in our sprite rendering, but that's the least taxing part of the whole deal.

More difficult by far is distributing the software scroll efficiently across the switched-off bank in a way that maintains the landscape's integrity when the plane performs a u-turn forcing the various parallaxed layers' scrollers to change direction.

And even harder still is joining up the wraparound in the planet's landscape under this distributed load method; this actually took me weeks to perfect*, largely because I found it awkward to visualise the full panoply of synchronicity in my weary mind... at one stage, I even made a physical model of a landscape loop by taping the ends of a strip of paper together and feeding it through two slots cut in a sheet, not that it helped very much!

I think it's fair to affirm, therefore, that bi-directionally scrolling games are an order of magnitude harder to develop than one-way scrolling games, as the relatively small number of the former in comparison to the latter would seem to confirm! (* = John Rowlands also took weeks to get Mayhem in Monsterland's scroller working as required, so I don't feel so bad admitting my own slowness with this).

That digression aside, the layer 1 scroller spreads the software scroll burden across the "middle six" of the concomitant hardware scroll positions on the switched-on screen bank, as follows:

LEFT1

Hardware scroll position 7 = 2 column map feed (necessary because of the "leapfrog" system described below, plus toggle screen bank
Hardware scroll position 6 = 6 columns scroll
Hardware scroll position 5 = 7 columns scroll
Hardware scroll position 4 = 7 columns scroll
Hardware scroll position 3 = 7 columns scroll
Hardware scroll position 2 = 7 columns scroll
Hardware scroll position 1 = 4 columns scroll
Hardware scroll position 0 = do nothing

RIGHT1

Hardware scroll position 7 = do nothing
Hardware scroll position 6 = 6 columns scroll
Hardware scroll position 5 = 7 columns scroll
Hardware scroll position 4 = 7 columns scroll
Hardware scroll position 3 = 7 columns scroll
Hardware scroll position 2 = 7 columns scroll
Hardware scroll position 1 = 4 columns scroll
Hardware scroll position 0 = 2 column map feed, plus toggle screen bank

To avoid complexity / foul-ups at the boundary conditions, a 6-7-7-7-7-4 pattern is used on the inner six hardware scroll positions rather than trying to evenly spread it across all 8 (5 per pixel shift), the idea being that if the plane changes direction when the hardware scroll is at, say, position 3 going left, which performs a 7 column portion of the distributed scroll on the switched-off bank, it will backtrack on itself by the same 7 columns on the corresponding position going right (i.e. position 3)... something you would never have to worry about in a one-way scrolling game.

Leapfrogging by 16 pixels

The other key factors with this method are (a) setting up the two screen banks so that they are offset one from the other by 8 pixels (i.e. a single column) and (b) making the actual software scrolling happen in a "leapfrog" fashion, that is, it scrolls by 2 columns (16 pixels) on each software scroll rather than the conventional 1 column so that by the time the hardware scroll is reset, we toggle screen banks to find everything nicely pre-scrolled into place without skipping a heartbeat or batting an eyelid.

Bank-Switched Leapfrog Scroll Method — Bank-Switched "Leapfrog" Scroll Method

By now, this maybe sounds fiendishly complicated, but that's likely more to do with my written explanation than the technical, coding-side reality; hence the video below, which should (I hope!) make it easier to understand.

So was it all worth it in terms of raster time saved?

If we measure (as per the screen capture below) the grey bar indicating the raster time consumption at full scroll (which happens during any of the 7 column buffered-out scrolls), we see it's now 9 pixels deep at full load.

That's less than 25% of the original 37 pixel deep raster bar for scrolling layer 1, equating to a 75% saving on raster time or, putting it yet another way, the bank-switching / load-distributed scroller for layer 1 executes 4 times faster than the original redraw scroll routine and more than 3 times faster than the speed code version thereof, although I should point out that the new layer 1 design uses 5 rows rather than the original's 6; that said, the time-saving attributable to the new method is absolutely staggering.

UPDATE: 3rd Dec 2020: This has effectively been compressed further down to 1-2 lines of interrupt raster time consumed - see this related microblog post.

Bank-Switching the Panel Zone + Other Parallax Layers

Just as a point of interest, the panel zone also uses bank-switching to hide the fact that it is shunted up / down in accordance with the vertical parallax effect, so rather than only serving as a clock-efficient scroller, this time bank-switching is used as a clock-efficient corrector expedient for the panel zone, obviating the need for a huge swathe of unrolled corrective scroll code, which was the original remedy for that side effect of the vertical parallax.

So far, this method hasn't been used with any other layers; layers 2 and 3 have no map feed, but rather simply wrap, whereas the foreground layer, layer 4, still uses speed code for its 3 char rows plus 3 colour rows (i.e. it's a full colour scroll); being a full colour scroller, it's impossible to distribute the colour RAM's software scroll across the hardware scroll positions as, unlike the screen banks, colour RAM is fixed ($D800-$DBFF = 55296-56319).

Layer 4 presents an additional issue that reveals a limitation of this method, i.e, at full speed, layer 4 scrolls at 8 pixels (1 char) per frame, leaving no hardware scroll transitions in which to perform the load-distribution... actually, this limitation applies to a lesser extent to layers 2 and 3 also, which scroll at 2 and 4 pixels per frame respectively when the plane is flying at maximum forward airspeed.

So while it's still feasible to apply the method for layers 2 and 3, it's not going to be possible for the foreground layer (i.e. layer 4) using the hardware scroll shifts.

However, layer 4 has been split into 2 parts, distributed into two separate IRST handlers... At a push, it could conceivably be spread further across handlers, but it won't yield an overall raster time saving gain.

NOTE: The second video below highlights the optimisations to compress and load distribute the scrolling of layers 2, 3 and 4.

Compensating for Lost Sprite Definitions

Losing that 1K of RAM from the 16K VIC-II bank equates to the loss of 16 sprite definitions, leaving the game with only 10K of readily accessible sprite designs, given that the game uses 2 charsets at any given time, consuming 4K of RAM from the designated VIC-II bank (as per the schema below).

VIC Bank with 2 Screen Buffers + 2 Charsets

But that's nothing to whine about... if anything, it accelerated the plan to optimise the handling of sprite data anyway.

Hence the little cameo of the plane on the airspace indicator no longer has two unique sprite definitions, one for facing left and one for right; just one now suffices and it is mirrored in real time by 3-4 lines of code, a feat made possible because of its simple design (in fact, even that could be optimised further via an on-the-fly roll-in from outside the VIC bank during the plexing process).

The same mirroring principle was applied to the afterburner flames and the laser flak from the chrome domes, which are flipped horizontally via a real-time definition swap from outside the VIC-II bank, which is a thousand times easier, more RAM-efficient and, best of all, quicker, than using an algorithm to flip everything over.

Sprites + Potential Raster Split Issues

Speaking of sprites, there is also the risk of an unpleasant side effect when multiple sprites are rendered in the vicinity of any screen split where a new IRST handler fires and resets the visible screen bank for the screen layer managed by said handler.

That side effect boils down to a delay in the screen bank switchover, which leaves a row of chars (or part thereof) from the wrong screen bank on display.

This has been overcome in Parallaxian using one of the NMI instances, triggered before the problematic screen split happens, to perform the bank switch nice and early; a near-future article on this website (presently being written) should explain the great utility of NMIs in split screen game environments.

However, it's not necessary to solve this issue with NMIs; an intermediate IRST handler should suffice in most conceivable cases.

No Source Code Reveal?

You might now reasonably be asking where the source code for this is, considering that I have shown code examples for the older methods.

Well, the quick answer is that the principles have been outlined, so a competent coder will know how to implement it on that basis, or might even know anyway from personal experience of doing the same... but there is also the important point that I just do not want to reveal the "tech" for this before the game is released... it's convoluted, lengthy, uses a small handful of "illegal" op codes (for obsessive extra cycle-saving) and a little part of it is executed via the NMI, so it would be difficult to present that in a way that anyone could just use "off-the-shelf".

And sure, I could show a stripped down version, but still, I would hate to deprive the reverse-engineers of their hobby...

Video 1: Bank-Switched Scrolling on Layer 1

To conclude this article, you can check out the videos below in which I ramble ex tempore through the bank-switching process as described above and attempt to provide a kind of visualisation of it by means of a deliberately out-of-sync version of the scroller, as well as showing the finished effect... I'm not convinced it makes anything clearer, but try to enjoy it anyway!

Video 2: Compressing Scrolling on Layers 2, 3 and 4

____

PS - If you value my work and want to support me, a small donation via PayPal would be nice (and thanks if you do!)