The Owl Project: RISC in Main

One of the bugbears in the community recently has been based around these two phrases. Now I’m involved in these too, though perhaps not as vocally as a particular other advocate (Mr Scavone I'm looking at you). So what is my position?

Lay off the 68000

The idea of not using the 68k is simple, it’s slow in comparison to the RISC chips, it quarters the width of the bus, it’s a pain.

Is this true?

Yes. I found that by stopping the 68k instead of leaving it in a DIV loop (to minimise BUS accesses) I was able to gain about 5-10% performance for my engines running on the RISC processors.

What did Atari have to say about it?

Well in a memo to Faran Thomason, Leonard Tramiel said “For example, the interleaving of GPU and 68k code has never, in our experience, gained any performance. The best thing that can be done with the 68k for overall system performance is to execute a halt instruction.”

Now for sure Atari weren’t right about everything, as we’ll see when we come to the GPU in main section later on, but in this case I agree with Leonard who was perhaps the most technically knowledgeable of the Tramiel clan.

Does this matter?

Well that completely depends on the kind of game that’s being coded. For a 2D shooter, platformer, puzzle game or ST conversion for example, then no of course there’s plenty of power in the Jag that you can forget about this optimisation completely and never feel any pain.

If it’s a heavily processor intensive 3D application then this kind of improvement is worth considering and I would highly recommend that in those circumstances a STOP $2000 is a very useful command to issue.

GPU in Main

What’s this all about?

There is a bug which prevents JUMP (jump absolute) and JR (jump relative - a short distance) commands working properly within the Jaguar's Main RAM. Atari knew about this and said that it essentially restricts the Jaguar RISC chips to only running programs within their own very small local memory.

Now back in 2004, when I was working on my voxel engine, I was coding it on the GPU, then I switched it to the DSP and then back again, this happened several times as I tried out different ideas. During on of these experiments I noticed data was not changing in RAM where I expected it to... when I looked I found I had a program resident on the DSP which was running on the GPU – that is the GPU program was running in external RAM. It was therefore easy to transfer this program to Main RAM and run it there also.

Addition or removal of instructions caused problems so I reasoned that this was to do with the alignment of the jumps and the jump to locations. This proved to work for long alignment of JUMP from and to locations and I was thus able to JUMP reasonably reliably… jumping between Main RAM and Local proved slightly trickier until I introduced an extra pipeline/pre-fetch clear with the addition of a MOVEI.

No my early code (other than the voxel engine which had worked essentially by chance) proved very awkward to manage and not ENTIRELY to follow the rules I had set out. Though by trial and error I was able to word align some jumps to gain reliable execution.

Steve Scavone (Gorf) of 3DSSS in the meantime was able to (completely independently) correctly identify exactly when jump to and from locations should be word or long aligned – something which would immensely reduce the amount of trial and error required in my own code.

Steve Scavone and I therefore formulated a set of rules for the use of GPU code running in Main RAM.

RISC in Main RAM rules:

JUMP from locations must be LONG aligned (Addresses ending in 0,4,8 or C in hexadecimal)– sometimes for jumps between Local and Main, PHRASE alignment seems necessary (Addresses ending in 0 or 8 in hexadecimal - actually recent studies have shown some cases where the JUMP FROM address in local has needed to be word aligned - the jump too was always LONG)) [Actually there's even a further point - if the jump is called very frequently - sometimes even the alignment and MOVEI seem not to help - one therefore needs to try to keep the jumps a little further apart in time]
In order to help clear the pipeline/pre-fetch for jumps between Main Ram and local RAM on the RISC chip, it is advisable to immediately precede the JUMP with a MOVEI. (Recent work has indicated that it is sometimes beneficial if the MOVEI is NOT the regester used in teh JUMP)
JR from locations seem to be free
External page jumps must be LONG aligned
Internal page jumps must be word offset from long aligned (addresses ending in 2,6,A or E in hexadecimal)
Two NOP's should come after a JUMP or a JR instruction (experimentation with a single operand instruction instead of the first NOP is possible).

Now the question of speed has arisen in the past - is the GPU running from main slower than running in Local and/or slower than the 68k.

In my own coding I've found the speed of GPU running from main to be highly dependent on load on the main BUS (eg. from the Blitter) and also upon the occurrence of page misses caused by the GPU main code flipping from load locations back to its own code pointer location.

Speeds have been typically in the range of 20-90% of the speed of the same code running from GPU local ram. (By which i mean 5 times slower to almost the same speed). It would therefore be foolish for example to put a tight, commonly called routine in main ram, its far better called in local, particularly if it is required to perform LOAD or STORE instructions.

Now lets take a look at the 20% speed. That's a pretty large cut in speed... a LOT slower than from local but faster than the 68k?

Let's look at it from a maths point of view... I believe the speed of the Atari ST 68k at 8MHz was reported as being very roughly about 1MIP... Therefore at 13.3MHz this should be about 1.65MIPS... in theory the GPU can reach 26.6MIPS, in practice this tends to be more like 17MIPS in other words 10x the speed of the 68k, even if we run at 20% its still twice as fast as the 68k and that's not even taking into account the effects on BUS.

Is it worth running from main instead of paging code to/from the GPU local?

Ummmmmmm.... Sometimes yes, sometimes no. Its a hard question to answer and depends on a great many factors including the BUS load and the code in question.

Does any of this matter?

Of course these methods are not vital in order to write good Jaguar games - just useful if one wishes to try to push the hardware to its limits.

In the current project it was plain that we would be pushing the Jaguar quite hard - to that end these and other ideas (such as the Blitter interrupt Stack idea) were adopted very early on in the process. We'd like to think they were worth it, and certainly performance enhancements were clear.

All the Best,
Joe (Atari Owl)

The Owl Project

Contributors

Raven's Outstanding Graphics

Our Exceptionally Talented Musician

Starcat's Projects

Links

Retro Projects Worth Seeing - RELEASED!

Wednesday, 21 October 2009

Atari Jaguar Homebrew - What's this "Lay off the 68k" and "GPU in Main" Malarkey? (TECHNICAL)

Project Demo Video

Posts Can Be Viewed in the Following Groups

Team Members

Current Platforms

Followers

Labels