Wednesday, 21 October 2009

Atari Jaguar Homebrew - What's this "Lay off the 68k" and "GPU in Main" Malarkey? (TECHNICAL)

One of the bugbears in the community recently has been based around these two phrases. Now I’m involved in these too, though perhaps not as vocally as a particular other advocate (Mr Scavone I'm looking at you). So what is my position?

Lay off the 68000

The idea of not using the 68k is simple, it’s slow in comparison to the RISC chips, it quarters the width of the bus, it’s a pain.

Is this true?

Yes. I found that by stopping the 68k instead of leaving it in a DIV loop (to minimise BUS accesses) I was able to gain about 5-10% performance for my engines running on the RISC processors.

What did Atari have to say about it?

Well in a memo to Faran Thomason, Leonard Tramiel said “For example, the interleaving of GPU and 68k code has never, in our experience, gained any performance. The best thing that can be done with the 68k for overall system performance is to execute a halt instruction.”

Now for sure Atari weren’t right about everything, as we’ll see when we come to the GPU in main section later on, but in this case I agree with Leonard who was perhaps the most technically knowledgeable of the Tramiel clan.

Does this matter?

Well that completely depends on the kind of game that’s being coded. For a 2D shooter, platformer, puzzle game or ST conversion for example, then no of course there’s plenty of power in the Jag that you can forget about this optimisation completely and never feel any pain.

If it’s a heavily processor intensive 3D application then this kind of improvement is worth considering and I would highly recommend that in those circumstances a STOP $2000 is a very useful command to issue.

GPU in Main

What’s this all about?

There is a bug which prevents JUMP (jump absolute) and JR (jump relative - a short distance) commands working properly within the Jaguar's Main RAM. Atari knew about this and said that it essentially restricts the Jaguar RISC chips to only running programs within their own very small local memory.

Now back in 2004, when I was working on my voxel engine, I was coding it on the GPU, then I switched it to the DSP and then back again, this happened several times as I tried out different ideas. During on of these experiments I noticed data was not changing in RAM where I expected it to... when I looked I found I had a program resident on the DSP which was running on the GPU – that is the GPU program was running in external RAM. It was therefore easy to transfer this program to Main RAM and run it there also.

Addition or removal of instructions caused problems so I reasoned that this was to do with the alignment of the jumps and the jump to locations. This proved to work for long alignment of JUMP from and to locations and I was thus able to JUMP reasonably reliably… jumping between Main RAM and Local proved slightly trickier until I introduced an extra pipeline/pre-fetch clear with the addition of a MOVEI.

No my early code (other than the voxel engine which had worked essentially by chance) proved very awkward to manage and not ENTIRELY to follow the rules I had set out. Though by trial and error I was able to word align some jumps to gain reliable execution.

Steve Scavone (Gorf) of 3DSSS in the meantime was able to (completely independently) correctly identify exactly when jump to and from locations should be word or long aligned – something which would immensely reduce the amount of trial and error required in my own code.

Steve Scavone and I therefore formulated a set of rules for the use of GPU code running in Main RAM.

RISC in Main RAM rules:
  1. JUMP from locations must be LONG aligned (Addresses ending in 0,4,8 or C in hexadecimal)– sometimes for jumps between Local and Main, PHRASE alignment seems necessary (Addresses ending in 0 or 8 in hexadecimal - actually recent studies have shown some cases where the JUMP FROM address in local has needed to be word aligned - the jump too was always LONG)) [Actually there's even a further point - if the jump is called very frequently - sometimes even the alignment and MOVEI seem not to help - one therefore needs to try to keep the jumps a little further apart in time]
  2. In order to help clear the pipeline/pre-fetch for jumps between Main Ram and local RAM on the RISC chip, it is advisable to immediately precede the JUMP with a MOVEI. (Recent work has indicated that it is sometimes beneficial if the MOVEI is NOT the regester used in teh JUMP)
  3. JR from locations seem to be free
  4. External page jumps must be LONG aligned
  5. Internal page jumps must be word offset from long aligned (addresses ending in 2,6,A or E in hexadecimal)
  6. Two NOP's should come after a JUMP or a JR instruction (experimentation with a single operand instruction instead of the first NOP is possible).
Now the question of speed has arisen in the past - is the GPU running from main slower than running in Local and/or slower than the 68k.

In my own coding I've found the speed of GPU running from main to be highly dependent on load on the main BUS (eg. from the Blitter) and also upon the occurrence of page misses caused by the GPU main code flipping from load locations back to its own code pointer location.

Speeds have been typically in the range of 20-90% of the speed of the same code running from GPU local ram. (By which i mean 5 times slower to almost the same speed). It would therefore be foolish for example to put a tight, commonly called routine in main ram, its far better called in local, particularly if it is required to perform LOAD or STORE instructions.

Now lets take a look at the 20% speed. That's a pretty large cut in speed... a LOT slower than from local but faster than the 68k?

Let's look at it from a maths point of view... I believe the speed of the Atari ST 68k at 8MHz was reported as being very roughly about 1MIP... Therefore at 13.3MHz this should be about 1.65MIPS... in theory the GPU can reach 26.6MIPS, in practice this tends to be more like 17MIPS in other words 10x the speed of the 68k, even if we run at 20% its still twice as fast as the 68k and that's not even taking into account the effects on BUS.

Is it worth running from main instead of paging code to/from the GPU local?

Ummmmmmm.... Sometimes yes, sometimes no. Its a hard question to answer and depends on a great many factors including the BUS load and the code in question.

Does any of this matter?

Of course these methods are not vital in order to write good Jaguar games - just useful if one wishes to try to push the hardware to its limits.

In the current project it was plain that we would be pushing the Jaguar quite hard - to that end these and other ideas (such as the Blitter interrupt Stack idea) were adopted very early on in the process. We'd like to think they were worth it, and certainly performance enhancements were clear.

All the Best,
Joe (Atari Owl)


  1. Although not a coder, I always wanted to code jag games. This information is great. I consider gorf and you master coders for the jaguar. I always wanted a 3DO star control 2 type game for the jaguar. I know the jag could do it, but what would the advatages and disadvatages be compaired to this classic game? How would you go about using the jaguars power in a astorids type game?
    Thanks Joeyb.

  2. Atari Owl said...
    Hi Joeyb

    [This is an improved version of my prior comment]

    You know i've never played a Star Control Game before - i'll have a look at what it is.

    Would you be thinking of a 2D Asteroids or 3D like DarXide?

    If its 2D then a lot of the optimising techniques are less necessary and the question of space becomes more important.

    A strong argument can be made to leave most of the game code on the 68k (keeping many of the moving sprites as objects on the OP rather than blitting them to s screen buffer) and then keep the DSP for sound and the GPU for deompressing graphics in order to cram as much graphic detail as possible into the smallest amount of memory

  3. Hi Owl,

    Thanks for writing something I've wanted to read publicly for a long time - a reasoned, balanced and informative rundown of the GPU in main techniques. I'm glad you got around to doing this but I'm especially pleased with the way in which you've done it - not only to outline how it's done & what it achieves, but also to point out that it is not as black & white as either being better or not.

    It does add a level of complexity to proceedings but the benefits in the speed of code execution could make that worthwhile given a suitable application & a programmer of sufficient talent.

    Therefore I suppose it's safe to say GPU in main is no 'quick fix' for poor execution speed of a routine and that there is no replacement for coding ability. Given the right circumstances though, you have proven it can be a benefit to execution speed.

    Anyway, big thumb for that post, just sorry it took so long for me to find & then to ramble on and on in my reply ;)

    Just one point I'd like to raise though (& I'm not being picky, I just want see if what I think I remember from my uni days in systems architecture class some 18 years ago is still intact or if I might be suffering bitrot). It is regarding the MIPS ratings of the 68k in the Jaguar compared with the GPU. I'm not sure it's a great way to compare processors even when the architecture is similar, but using MIPS to compare CISC vs RISC will give fairly meaningless numbers, will it not? It's sort of like comparing cm & inches - you need more of the former to get to the same place as the latter :) I think someone said that using MIPS ratings with CISC vs RISC processors is a 'Meaningless Indicator of Processor Speed' ;) I don't doubt the GPU totally owns the 68k though as I've seen the difference first hand, yet there are a few occasions where even the slower-clocked 68k can perform tasks even faster than the GPU using SPM.

    Anyway, thanks again for a great read.

  4. Hello There

    Many Thanks for the kind words

    Regarding the MIPS comparison ... ummmm ... yes you have a point given that in many cases (by definition) the CISC has far more capable commands than those on the RISC.

    I'd say though that, based on my own experience at least, I've not had a situation where the GPU running from Main RAM was slower than the 68k for a given task, but that on occasions the difference HAS been somewhat marginal - so its entirely possible that a case could be found where the 68k was faster.

    In addition the additional work required to run the GPU from main, in some cases could well be deemed to be not worth the effort.

  5. Never once did I ever claim that the GPU mian RAM code was the answer to all the Jaguar's problems. It is a more efficient way to gain more power from the Jaguar and nothing more.

    I do love the way certain members of certain clique's still obsess(and out right lie) about how I viewed and portrayed the workaround as the savior of the Jaguar.

    There are many instances where if used improperly, even the workaround(like any other place one can write a piece of code on any given processor) can prove to be a detriment. My whole point was that if you expect to move more polygons and do a much more efficient job of coding the Jaguar, then the workaround goes a long way to helping this process.

    Anyone else claiming I said otherwise is out right lying and simply enjoying attacking me on forums wher I can't (and no longer care to BTW) defend what I said, but this should suprise no one as these certain individuals depend on their ability to get away with stuff no one else would get away with.

    Straight up corwardess if you want to know the truth. I guess when you know you can't defend you attacks when the person you are attacking CAN answer you back, the best thing to do is to conjure up a ton of lies and do it on a forum where the attacked does not have the ability(or cares to even) answer back his critics.

  6. I have no wish for this blog to become part of the battleground of this benighted topic. I do not want a battle waged here because it can't be in other fora.

    Everyone has had their say now.

    Please, no more.