To get a feeling for how the CPU is built I started determining the standard cell types for the top-left corner of part 03, like this: http://au-ro-ra.net/psx/temp/03-01-a-small.png

Green means I'm very sure it's the correct standard cell; yellow means it's in your list of standard cells, but without label; red means it's not in your list of standard cells (or I couldn't find it); cyan means it's hard to make out but I made an educated guess; magenta means it's too dirty to identify correctly.

I really want to help out with the project, so should I continue this way? From the relatively low quality of photographs I would guess the automatic detection of standard cells doesn't work adequately.

2

(46 replies, posted in CPU (including GTE, MDEC, DMA and interrupts))

As far as I understand it works as follows (someone please correct me if I'm wrong):

The circuit multiplies a 13- by a 12-bit number, both signed. Technically, the result is a 25-bit number; however, the circuit drops a) bits 0-6 and b) bit 25 (the carry into the 25th bit, to be precise). That means that internally a 17-bit number is used to accumulate further products. The overall result, then, is once again stripped of bits 0-3 (i.e. bits 7-10 in the original 25-bit number) to leave us with a 13-bit number that is used as the input of the second pass (another instance of the multiplier).

I haven't gotten to it yet, but judging from the circuit schematics, in the very end 8 bits of 17-bit result of the second pass are selected and divided by 2 (with clamping) to arrive at an 8-bit number.

Your example (256*256) would result in the "25-bit" number 65536. Then, bits 0-6 are dropped and you get 512.

If you multiply for instance 257*256 you get 65792 (0x10100). Now drop bits 0-6 to get 514 -- this number is used for accumulation. However, if it were to be passed to the second stage, an additional 4 bits are dropped and the result is 32, which is then multiplied by 12 bits from the scale matrix.

As a side note, the second pass probably uses 13 bits from the previous result and 12 bits from the scale matrix because a) they wanted to use the same circuit design again and b) just a guess, but the additional bit was more accurate/helpful when used for the previous result, not the scale matrix.

3

(46 replies, posted in CPU (including GTE, MDEC, DMA and interrupts))

In IDCT.circ, the DFF straight below RES00 doesn't get the clock; if you fix that, the circuit should work (I only tested a few cases, though)!

The RISC core is actually an LR33300 from LSI; here are the first four chapters of its manual: http://www.sm.luth.se/csee/courses/smd/D0013E/manual/
Unfortunately, I couldn't find the rest online anywhere. However, the "L64360" uses the same RISC core and describes a few of its registers; the mysterious register 0xfffe0130 is explained on page 14-7 of the manual. I tested a few of its bits a while back, and it definitely is the same register. For example, I could turn "load scheduling" on and off with bit 16.

5

(46 replies, posted in CPU (including GTE, MDEC, DMA and interrupts))

I THINK that I understood the multiplication algorithm that was used in the IDCT multiplier, and at least the bottom part of the circuit (MUX ARRAYS CONTROL and MUX ARRAYS, as well as the very bottom of the multiplication tree). I'll just write down my thoughts, so it might not be the best explanation. Also, I didn't yet think about fixed-point or negative numbers, so there might still be some aspects missing.

The algorithm leverages the fact that you can express every product of an N-bit number A and an M-bit number B as the sum of ceil(N/2) (possibly negative) terms of B. For example, assume that N=4 and M arbitrary. Then every possible product can be written as sum of N/2=2 terms:

0 * B = 0 * B
1 * B = 1 * B
2 * B = (4 - 2) * B
3 * B = (4 - 1) * B
4 * B = 4 * B
...
13 * B = (-4 + 1) * B,   + 16 * B is assumed
...

The circuit looks at bit triples at a distance of two bits each (i.e. they overlap by 1 bit) to achieve this; a -1th bit of 0 is assumed. Each 12-bit number has 12/2=6 of these triples that correspond to the the 12/2=6 14-bit outputs L0-L5 of MUXs. Each triple represents a factor that Ln gets multiplied with:

000 := 0
001 := 1
010 := 1
011 := 2
100 := -2
101 := -1
110 := -1
111 := "-0"

MUX ARRAYS CONTROL looks at the triples and multiplies the input by 2 (shifts it by 1) if necessary (Ln_D1,2) - that's why Ln is 14 bits long. If all bits are 0 or 1, the input is treated as -1 (all 1s). Apart from MUX ARRAYS CONTROL the 2nd bit is used to get the complement of the input (all triples with the 2nd bit = 1 have a negative factor; Ln_C1,2). This fits perfectly with the chart above. Observe that for the 2's complement, there's still a +1 missing; that one is in the multiplier tree: if for a triple, the associated factor is negative or zero (because then the input = -1), +1 is added. This is checked by the 12-OAIs. 

In the end, L0-L5 get summed up to get the final result; each Ln has the actual value (Ln << 2*n). Here's an example for our 4-bit numbers:

13 * B = 1101b * B => assume implicit -1th bit: 1101.0
divide into overlapping triples:
L0:   010 := 1 * B
L1: 110   := (-1 * B)
=> ((L0 << 0) + (L1 << 2)) =  (1 - (1 << 2)) * B = (1 - 4) * B = -3 * B; +16 * B is assumed so = 13*B

However, I still don't know how the tree is working exactly (especially as there are still errors in the Logisim circuit). There's not enough levels to allow for all additions, so that's probably what the WS elements are for. The AND_FLAG input (as I think nocash has already mentioned) serves to initialize the accumulation operation if it is 0; if it is 1, the result of the multiplication is added to the result of the multiplication one cycle before. I think one multiplication needs 2? cycles, but being pipelined each cycle a result is output and accumulated.