Development On I-War we had to have separate passes - 3d transform- clipping and drawing because we "could not "run the code from outside GPU. It was a pain but the games were done.
Source: Andrew Seed
Memo These documents are converted from the contents of Eric Smith's development hard disk, originally written in AtariWorks STW format.
File dated 31.8.1995 To: John Skruch, Ted Tahquechi From: Eric Smith Subj: Some Notes on Netwar
Here are some ideas on how Netwar might be sped up. These are first impressions, and I haven't had time to go through all the code, so obviously these ideas might not help. Take everything with a big grain of salt. My overall impression is that the biggest room for improvement is in the polygon transformation code rather than the polygon drawing code proper. Have the programmers tried to profile the code at all, to see how much time is being spent in the various pieces of the program? That would probably help them to find the "hot spots" where careful hand-tuning can give a big performance boost.
Global Optimizations --------------------
(1) Stopping the 68000 usually improves the performance of GPU code, especially for bus intensive things like rendering. This can be accomplished by putting a stop #$2000 in the inner loop of the "waitgpu" function, and changing the GPU code to send an interrupt to the 68000 just after clearing "semaphore" to indicate that the GPU is finished.
(2) It looks like surface normals are re-calculated every frame. Obviously if these were pre-calculated and stored with the object data it would save a lot of time. I confess to being a bit puzzled by the "view vector" calculation and visibility check. This looks like it's a backface removal operation; if so, then using a constant view vector (the direction vector for the viewer) would work, and would save the re-calculation of the view vector for each polygon.
(3) For small polygons (e.g. such as appear in enemies), pixel mode rendering can actually be faster than phrase mode, since much less set up is required per line. It might be worth doing a pixel mode version of shaded.s and using it for small objects (i.e. pretty much everything except for arena walls and floors).
"Micro" Optimizations on the Code ---------------------------------
(4) Quite a bit of code in shapes.s seems to be concerned with normalizing the vectors. This can be done without branches via the "normi" instruction. For example, if (r14, r15, r16) is a vector to be normalized to 14 bits, and r0 and r1 are scratch registers, the normalization code goes: move r14,r0 ; find absolute values of x,y,z move r15,r1 ; and or them together abs r0 ; so that we can find how big they abs r1 ; can possibly be... or r0,r1 move r16,r0 abs r0 or r0,r1 ; the highest bit of r1 gives us a rough ; estimate of the vector's magnitude normi r1,r1 ; find how much to shift r1 to make it ; 24 bit addqt #10,r1 ; now normalize it to 14 bits sha r1,r14 sha r1,r15 sha r1,r16
(5) Interleaving GPU code is really important for performance. Generally the GPU interleaving in the renderer (shaded.s, textured.s) looks pretty good, but there are a few places where it could be optimized a bit more. Two consecutive stores to internal RAM can incur a wait state, so re-coding
store r18,(r14+15) ;sets B_COUNT store r22,(r14+14) ;sets B_CMD
movefa r8,r25 jump t,(r25) nop
as:
movefa r8,r25 store r18,(r14+15) ;sets B_COUNT jump t,(r25) store r22,(r14+14) ;sets B_CMD
gains more than 3 cycles.
(6) I noticed some divide code that does: abs r17 div r18,r17 jr cs,.fdu_zinc neg r17 neg r17 .fdu_zinc:
Changing this to:
abs r17 jr cc,.fdu_zinc div r18,r17 neg r17 .fdu_zinc:
will allow the divide to operate in parallel with other code in the case where r17 was positive to begin with, i.e. about half the time. Even better is to duplicate some code in both branches:
abs r17 jr cc,.fdu_zinc div r18,r17 ... stuff that doesn't depend on r17 ... jr .fdu_divfinished neg r17 .fdu_zinc: ... stuff that doesn't depend on r17 (same as above) ... .fdu_divfinished:
Divides are slow, so quite a lot of useful work can be done while they're executing. The only thing to watch out for here is to make sure that two divide instructions don't overlap; in cases where they might, just put in a spurious read (like "or r17,r17").
(7): In shapes.s the interleaving isn't quite as well optimized; for example:
movefa r0,r19 ;scale polygon points movefa r1,r20 movefa r2,r21 movefa r3,r22 movefa r4,r29 sub r19,r14 sub r20,r15 sub r21,r16 sub r22,r17 imult r29,r14 imult r29,r15 imult r29,r16 imult r29,r17 sharq #14,r14 sharq #14,r15 sharq #14,r16 sharq #14,r17 add r19,r14 add r20,r15 add r21,r16 add r22,r17
would run nearly twice as fast if re-coded as:
movefa r4,r29
movefa r0,r19 ;scale polygon points movefa r1,r20 sub r19,r14 sub r20,r15 imult r29,r14 imult r29,r15 sharq #14,r14 sharq #14,r15 add r19,r14 add r20,r15
movefa r2,r21 movefa r3,r22 sub r21,r16 sub r22,r17 imult r29,r16 imult r29,r17 sharq #14,r16 sharq #14,r17 add r21,r16 add r22,r17
(8) If there is room left in GPU RAM, it might be worth re-coding shaded.s to eliminate jumps (which are slow and expensive) at the cost of duplicating some code, particularly in the innermost loop.
|