The circuit multiplies a 13- by a 12-bit number, both signed. Technically, the result is a 25-bit number; however, the circuit drops a) bits 0-6 and b) bit 25 (the carry into the 25th bit, to be precise). That means that internally a 17-bit number is used to accumulate further products. The overall result, then, is once again stripped of bits 0-3 (i.e. bits 7-10 in the original 25-bit number) to leave us with a 13-bit number that is used as the input of the second pass (another instance of the multiplier).

I haven't gotten to it yet, but judging from the circuit schematics, in the very end 8 bits of 17-bit result of the second pass are selected and divided by 2 (with clamping) to arrive at an 8-bit number.

Your example (256*256) would result in the "25-bit" number 65536. Then, bits 0-6 are dropped and you get 512.

If you multiply for instance 257*256 you get 65792 (0x10100). Now drop bits 0-6 to get 514 -- this number is used for accumulation. However, if it were to be passed to the second stage, an additional 4 bits are dropped and the result is 32, which is then multiplied by 12 bits from the scale matrix.

As a side note, the second pass probably uses 13 bits from the previous result and 12 bits from the scale matrix because a) they wanted to use the same circuit design again and b) just a guess, but the additional bit was more accurate/helpful when used for the previous result, not the scale matrix.

]]>Unless... are there any details important for emulation envolved? Like rounding errors in the multiplication result?

And very dumb question: Does anybody know yet how many LSBs get stripped from the multiplication results?

Ie. what are the results for 256*256, in first and second multiplier sections?

The results should be right-shifted - so they must be something else than 65536.

NB. Some more info on the MDEC including RLE section was posted here: http://www.psxdev.net/forum/viewtopic.p … &t=551

]]>The algorithm leverages the fact that you can express every product of an N-bit number A and an M-bit number B as the sum of ceil(N/2) (possibly negative) terms of B. For example, assume that N=4 and M arbitrary. Then every possible product can be written as sum of N/2=2 terms:

```
0 * B = 0 * B
1 * B = 1 * B
2 * B = (4 - 2) * B
3 * B = (4 - 1) * B
4 * B = 4 * B
...
13 * B = (-4 + 1) * B, + 16 * B is assumed
...
```

The circuit looks at bit triples at a distance of two bits each (i.e. they overlap by 1 bit) to achieve this; a -1th bit of 0 is assumed. Each 12-bit number has 12/2=6 of these triples that correspond to the the 12/2=6 14-bit outputs L0-L5 of MUXs. Each triple represents a factor that Ln gets multiplied with:

```
000 := 0
001 := 1
010 := 1
011 := 2
100 := -2
101 := -1
110 := -1
111 := "-0"
```

MUX ARRAYS CONTROL looks at the triples and multiplies the input by 2 (shifts it by 1) if necessary (Ln_D1,2) - that's why Ln is 14 bits long. If all bits are 0 or 1, the input is treated as -1 (all 1s). Apart from MUX ARRAYS CONTROL the 2nd bit is used to get the complement of the input (all triples with the 2nd bit = 1 have a negative factor; Ln_C1,2). This fits perfectly with the chart above. Observe that for the 2's complement, there's still a +1 missing; that one is in the multiplier tree: if for a triple, the associated factor is negative or zero (because then the input = -1), +1 is added. This is checked by the 12-OAIs.

In the end, L0-L5 get summed up to get the final result; each Ln has the actual value (Ln << 2*n). Here's an example for our 4-bit numbers:

```
13 * B = 1101b * B => assume implicit -1th bit: 1101.0
divide into overlapping triples:
L0: 010 := 1 * B
L1: 110 := (-1 * B)
=> ((L0 << 0) + (L1 << 2)) = (1 - (1 << 2)) * B = (1 - 4) * B = -3 * B; +16 * B is assumed so = 13*B
```

However, I still don't know how the tree is working exactly (especially as there are still errors in the Logisim circuit). There's not enough levels to allow for all additions, so that's probably what the WS elements are for. The AND_FLAG input (as I think nocash has already mentioned) serves to initialize the accumulation operation if it is 0; if it is 1, the result of the multiplication is added to the result of the multiplication one cycle before. I think one multiplication needs 2? cycles, but being pipelined each cycle a result is output and accumulated.

]]>Anyways, good to know that the YUV-to-RGB stage appears to be simple, aside from that overflow/rounding issue. Is that BOTH happening overflows

(MSBs), and rounding (LSBs)?

Btw. that 12bit fractions (5744,-2928,-1408,7264) are all multiples of 4. So actually, they are having only 10bit fractions.

]]>And, for colored images, the YUV-to-RGB part does still lack some details about fractions/rounding, particulary in the

G=(-0.3437*B)+(-0.7143*R)andR=(1.402*R)andB=(1.772*B)calculations.

The MDEC seems to be using fixed point math with a 12-bit fractional part, just as the GTE and GPU does in places.

Using these coefficients:

RCr ( 5744)

GCr (-2928)

GCb (-1408)

BCb ( 7264)

or in other words:

Red = Y + 1.4023 * Cr

Green = Y - 0.3437 * Cb - 0.7148 * Cr

Blue = Y + 1.7734 * Cb

I've been able to verify all 16.8M combinations of Y/Cb/Cr and the resulting R/G/B values.

However, the MDEC seems to be using too few bits, causing "interesting" overflow/rounding issues for some edge cases, which needs to be handled.

]]>org wrote:PS. So called "MUXs" block looks like barrel shifter, but I'm not sure)

What would that shifter be good for? If the multiplication really passes in a single CLK cycle then I couldn't imagine how/why it could use shifting, unless it's done on the two CLK edges.

I don't understand the circuit at all, but I still think it's hard to believe that it could finish multiplication in a single CLK.

It looks like it done one multiply operation per clock. And it will take 8 clocks to finish multiplication of one number in matrix.

Launch logisim and try to look for yourself. It's real funny to try to guess what is going on there ) Like a puzzle solving.

]]>PS. So called "MUXs" block looks like barrel shifter, but I'm not sure)

What would that shifter be good for? If the multiplication really passes in a single CLK cycle then I couldn't imagine how/why it could use shifting, unless it's done on the two CLK edges.

I don't understand the circuit at all, but I still think it's hard to believe that it could finish multiplication in a single CLK.

By the way, can you tell me how many fractional bits in rle result and in scaletable matrix?

I could... the scaletable consists of JPEG constants (1.000, 1.387, etc.) multiplied by sqrt(2) multiplied by 4000h, so it should have 14bit fractional part. For the RLE numbers I've no clue if they have some nominal number of fractional bits.

But technically... the only that does really matter is how many fractional bits are stripped from the result. As said here, http://board.psxdev.ru/post/57/#p57 the full result should be 24bit..28bit wide (24bit would be minimum, and 28bit would be safe for avoiding overflows), so to get that down to 17bits, the 24bit..28bit result would be right-shifted by 7..11 bits.

In so far, to avoid rounding issues, it may be best to start with values where you know that the lower 11bits of result are all zero, so you can still see the 'intact' result even if some or all of those 11bits get removed. I would start with some simple stuff, set only one bit each factor, then check if and where you see the bit in result, and then try setting some more bits in one (or both) factors...

]]>Only one CLK per multiplication?

The "111111110101 * 0000000000001 = 01010101010101111" stuff looked like incomplete result, as if it were having multiplied only each 2nd bit, so I was thinking that it might multiply the other bits in a second CLK cycle (and add them to result from first cycle).

Using "111111110101" (=negative) as test value looks more complex than needed, I would start with simple positive numbers.

I fixed a lot of bugs in Circuit. It do some calculations but I think there are still a lot of bugs.

multiplication 111111110101 * 0000000000001 now = 11111111111111110

000000000000 * 0000000000000 = 00000000000000000

000001000101 * 0000010100000 = 00000000001010101

By the way, can you tell me how many fractional bits in rle result and in scaletable matrix?

]]>The "111111110101 * 0000000000001 = 01010101010101111" stuff looked like incomplete result, as if it were having multiplied only each 2nd bit, so I was thinking that it might multiply the other bits in a second CLK cycle (and add them to result from first cycle).

Using "111111110101" (=negative) as test value looks more complex than needed, I would start with simple positive numbers.

What do you mean by "if you enable AND_FLAG whole circuit enable some kind of counter mode"?

The main purpose of the AND_FLAG should be initializing the result to ZERO, so eight multiplications will give you this:

ZERO + Result#1 + Result#2 + ... + Result#8

Which, well, yes, that "summing up" could be called "counting up", in case you meant that.

Or if there's some other counter bound to AND_FLAG, that might indicate that multiplication would need more than one CLK. Ie. the circuit might use AND_FLAG for setting result to ZERO, and also for initializing some CLK counter (that would be needed only for the first initial multiplication assuming that that counter would wrap to zero automatically after each multiplication). Only, I can't see any such CLK counter bound to AND_FLAG in the circuit.

I probably don't have time & hardware resources for installing logisim (and java).

]]>https://code.google.com/p/psxdev/source … /IDCT.circ

You need Logisim : http://ozark.hendrix.edu/~burch/logisim/

]]>But if you enable AND_FLAG whole circuit enable some kind of counter mode )

Surely there are some errors in connections ))

How many inputs does have that circuit?

- 12 bit from the left are input RLE result

- 13 bit from the bottom are input from scale table matrix

- 17 bit output from the top are result of pass1

- IDCT_CLK to toggle

- AND_FLAG wtf

The two factors, the old result (for summing up), and the AND_FLAG and IDCT_CLK, and... are there more inputs?

Do you already know how many CLK's it would take to do one single multiplication?

If it takes more than one CLK, then there should be some counter for the separate steps, either inside of the circuit or somewhere externally. And whereever that counter might be located, there should be probably some input to reset that counter?