#### Re: MDEC

Color space conversion is still a hidden gem. But we found 768 bytes SRAM buffer, which is supposed to be 16x16 RGB output macroblock FIFO.

PSXDEV.ru board (english only pls kthx)

Everything you wanted to know about PlayStation, but were afraid to ask :=)

You are not logged in. Please login or register.

Color space conversion is still a hidden gem. But we found 768 bytes SRAM buffer, which is supposed to be 16x16 RGB output macroblock FIFO.

org wrote:

we found 768 bytes SRAM buffer, which is supposed to be 16x16 RGB output macroblock FIFO.

Sounds good. There should be also something for 64-byte mono output (might be using a fragment of the RGB buffer).

And there should be two 64-byte color buffers (for the low-resolution Cr and Cb data blocks), and maybe one (or four) 64-byte mono buffers (for the hi-resolution Y data blocks).

And a 64x12bit buffer for the decompressed RLE data, and probably some FIFO for incoming uncompressed RLE data.

About my attempts to reproduce the MDEC decompression formula: Grumph. Yesterday, I've changed some test parameters, and that has smashed everything, and I am now getting huge errors for both qscale=0 and qscale>0. The old test used only one non-zero RLE value. The new test used 32 non-zero RLE values (still a very simple test case with all 32 values set to the same value, and with the whole scaletable also being filled by a single value - but even that simple setup is too complex for my poor formula).

Akari wrote:

We test multiplication and sum soon and can give you more precise data )

You mean you are (almost) ready to simulate the circuit? That would be great!

Knowing the 17bit result for some simple multiplications would be neat, something like this:

100h*100h (for knowing how many fractional bits of the result get stripped).

7FFh*001h (for knowning if there's something rounded-up).

One multiplication (without summing) would be enough, ie. with the incoming sum being forced to zero via the AND gates at the right side of the circuit.

Currently, I am assuming that the whole multiplication result is 24bit, so I am stripping lower 7bit to make it a 17bit value. And then I am 'masking' it to 17bit size (or actually I am sign-expanding it from 17bit to 32bits; since my test program is using 32bit maths). However, that 'masking' is somehow smashing the final output. The output is better when allowing the values to exceed 17bits (which probably means that one should strip more than 7 lower bits, so that the more significant bits fit into the 17bit space).

Stripping more than 7 lower bits (=rounding down further) might also fix the rounding issue where I got too large results.

nocash wrote:

You mean you are (almost) ready to simulate the circuit? That would be great!

Maybe Org do simulate multiplication when he has time. It is possible now.

By the way looks like I found IDCT logic. There are few counters that activate address selectors on units and related to CLK signal. More info soon )

Maybe Org do simulate multiplication when he has time. It is possible now.

This circuit is way heck too tough ))) But I'll find the strength to :=)

Well.. I did it

Whole IDCT pass1 :

MUX controls:

MUXs:

And most complicated part, multiplier FA/HA tree:

I just toggle IDCT_CLK and it is working, or kinda of )))) I experimented with inputs and it show something Need to check it again and fix possible errors.

PS. So called "MUXs" block looks like barrel shifter, but I'm not sure)

To see high resolution images open it as URLs.

Okay...

Lower image looks like having all bits zero.

Upper image shows 111111110101 * 0000000000001 = 01010101010101111.

Hope that isn't the actual/final result : - )

This stuff is really weird)) Can break your brain

How many inputs does have that circuit?

The two factors, the old result (for summing up), and the AND_FLAG and IDCT_CLK, and... are there more inputs?

Do you already know how many CLK's it would take to do one single multiplication?

If it takes more than one CLK, then there should be some counter for the separate steps, either inside of the circuit or somewhere externally. And whereever that counter might be located, there should be probably some input to reset that counter?

It takes single clock to stabilize (I mean IDCT_CLK is changed from 0 to 1 only once, to make result valid).

But if you enable AND_FLAG whole circuit enable some kind of counter mode )

Surely there are some errors in connections ))

How many inputs does have that circuit?

- 12 bit from the left are input RLE result

- 13 bit from the bottom are input from scale table matrix

- 17 bit output from the top are result of pass1

- IDCT_CLK to toggle

- AND_FLAG wtf

BTW you can experiment with it by yourself. Circuit is uploaded to SVN:

https://code.google.com/p/psxdev/source … /IDCT.circ

You need Logisim : http://ozark.hendrix.edu/~burch/logisim/

Only one CLK per multiplication?

The "111111110101 * 0000000000001 = 01010101010101111" stuff looked like incomplete result, as if it were having multiplied only each 2nd bit, so I was thinking that it might multiply the other bits in a second CLK cycle (and add them to result from first cycle).

Using "111111110101" (=negative) as test value looks more complex than needed, I would start with simple positive numbers.

What do you mean by "if you enable AND_FLAG whole circuit enable some kind of counter mode"?

The main purpose of the AND_FLAG should be initializing the result to ZERO, so eight multiplications will give you this:

ZERO + Result#1 + Result#2 + ... + Result#8

Which, well, yes, that "summing up" could be called "counting up", in case you meant that.

Or if there's some other counter bound to AND_FLAG, that might indicate that multiplication would need more than one CLK. Ie. the circuit might use AND_FLAG for setting result to ZERO, and also for initializing some CLK counter (that would be needed only for the first initial multiplication assuming that that counter would wrap to zero automatically after each multiplication). Only, I can't see any such CLK counter bound to AND_FLAG in the circuit.

I probably don't have time & hardware resources for installing logisim (and java).

nocash wrote:

Only one CLK per multiplication?

The "111111110101 * 0000000000001 = 01010101010101111" stuff looked like incomplete result, as if it were having multiplied only each 2nd bit, so I was thinking that it might multiply the other bits in a second CLK cycle (and add them to result from first cycle).

Using "111111110101" (=negative) as test value looks more complex than needed, I would start with simple positive numbers.

I fixed a lot of bugs in Circuit. It do some calculations but I think there are still a lot of bugs.

multiplication 111111110101 * 0000000000001 now = 11111111111111110

000000000000 * 0000000000000 = 00000000000000000

000001000101 * 0000010100000 = 00000000001010101

By the way, can you tell me how many fractional bits in rle result and in scaletable matrix?

Akari wrote:

By the way, can you tell me how many fractional bits in rle result and in scaletable matrix?

I could... the scaletable consists of JPEG constants (1.000, 1.387, etc.) multiplied by sqrt(2) multiplied by 4000h, so it should have 14bit fractional part. For the RLE numbers I've no clue if they have some nominal number of fractional bits.

But technically... the only that does really matter is how many fractional bits are stripped from the result. As said here, http://board.psxdev.ru/post/57/#p57 the full result should be 24bit..28bit wide (24bit would be minimum, and 28bit would be safe for avoiding overflows), so to get that down to 17bits, the 24bit..28bit result would be right-shifted by 7..11 bits.

In so far, to avoid rounding issues, it may be best to start with values where you know that the lower 11bits of result are all zero, so you can still see the 'intact' result even if some or all of those 11bits get removed. I would start with some simple stuff, set only one bit each factor, then check if and where you see the bit in result, and then try setting some more bits in one (or both) factors...

org wrote:

PS. So called "MUXs" block looks like barrel shifter, but I'm not sure)

What would that shifter be good for? If the multiplication really passes in a single CLK cycle then I couldn't imagine how/why it could use shifting, unless it's done on the two CLK edges.

I don't understand the circuit at all, but I still think it's hard to believe that it could finish multiplication in a single CLK.

nocash wrote:

org wrote:PS. So called "MUXs" block looks like barrel shifter, but I'm not sure)

What would that shifter be good for? If the multiplication really passes in a single CLK cycle then I couldn't imagine how/why it could use shifting, unless it's done on the two CLK edges.

I don't understand the circuit at all, but I still think it's hard to believe that it could finish multiplication in a single CLK.

It looks like it done one multiply operation per clock. And it will take 8 clocks to finish multiplication of one number in matrix.

Launch logisim and try to look for yourself. It's real funny to try to guess what is going on there ) Like a puzzle solving.

nocash wrote:

And, for colored images, the YUV-to-RGB part does still lack some details about fractions/rounding, particulary in the

G=(-0.3437*B)+(-0.7143*R)andR=(1.402*R)andB=(1.772*B)calculations.

The MDEC seems to be using fixed point math with a 12-bit fractional part, just as the GTE and GPU does in places.

Using these coefficients:

RCr ( 5744)

GCr (-2928)

GCb (-1408)

BCb ( 7264)

or in other words:

Red = Y + 1.4023 * Cr

Green = Y - 0.3437 * Cb - 0.7148 * Cr

Blue = Y + 1.7734 * Cb

I've been able to verify all 16.8M combinations of Y/Cb/Cr and the resulting R/G/B values.

However, the MDEC seems to be using too few bits, causing "interesting" overflow/rounding issues for some edge cases, which needs to be handled.

I haven't looked into that part yet since I am still having troubles to calculate the incoming Y,Cr,Cb values (I guess you don't know how get those 100% correct, too).

Anyways, good to know that the YUV-to-RGB stage appears to be simple, aside from that overflow/rounding issue. Is that BOTH happening overflows

(MSBs), and rounding (LSBs)?

Btw. that 12bit fractions (5744,-2928,-1408,7264) are all multiples of 4. So actually, they are having only 10bit fractions.

I THINK that I understood the multiplication algorithm that was used in the IDCT multiplier, and at least the bottom part of the circuit (MUX ARRAYS CONTROL and MUX ARRAYS, as well as the very bottom of the multiplication tree). I'll just write down my thoughts, so it might not be the best explanation. Also, I didn't yet think about fixed-point or negative numbers, so there might still be some aspects missing.

The algorithm leverages the fact that you can express every product of an N-bit number A and an M-bit number B as the sum of ceil(N/2) (possibly negative) terms of B. For example, assume that N=4 and M arbitrary. Then every possible product can be written as sum of N/2=2 terms:

```
0 * B = 0 * B
1 * B = 1 * B
2 * B = (4 - 2) * B
3 * B = (4 - 1) * B
4 * B = 4 * B
...
13 * B = (-4 + 1) * B, + 16 * B is assumed
...
```

The circuit looks at bit triples at a distance of two bits each (i.e. they overlap by 1 bit) to achieve this; a -1th bit of 0 is assumed. Each 12-bit number has 12/2=6 of these triples that correspond to the the 12/2=6 14-bit outputs L0-L5 of MUXs. Each triple represents a factor that Ln gets multiplied with:

```
000 := 0
001 := 1
010 := 1
011 := 2
100 := -2
101 := -1
110 := -1
111 := "-0"
```

MUX ARRAYS CONTROL looks at the triples and multiplies the input by 2 (shifts it by 1) if necessary (Ln_D1,2) - that's why Ln is 14 bits long. If all bits are 0 or 1, the input is treated as -1 (all 1s). Apart from MUX ARRAYS CONTROL the 2nd bit is used to get the complement of the input (all triples with the 2nd bit = 1 have a negative factor; Ln_C1,2). This fits perfectly with the chart above. Observe that for the 2's complement, there's still a +1 missing; that one is in the multiplier tree: if for a triple, the associated factor is negative or zero (because then the input = -1), +1 is added. This is checked by the 12-OAIs.

In the end, L0-L5 get summed up to get the final result; each Ln has the actual value (Ln << 2*n). Here's an example for our 4-bit numbers:

```
13 * B = 1101b * B => assume implicit -1th bit: 1101.0
divide into overlapping triples:
L0: 010 := 1 * B
L1: 110 := (-1 * B)
=> ((L0 << 0) + (L1 << 2)) = (1 - (1 << 2)) * B = (1 - 4) * B = -3 * B; +16 * B is assumed so = 13*B
```

However, I still don't know how the tree is working exactly (especially as there are still errors in the Logisim circuit). There's not enough levels to allow for all additions, so that's probably what the WS elements are for. The AND_FLAG input (as I think nocash has already mentioned) serves to initialize the accumulation operation if it is 0; if it is 1, the result of the multiplication is added to the result of the multiplication one cycle before. I think one multiplication needs 2? cycles, but being pipelined each cycle a result is output and accumulated.

In IDCT.circ, the DFF straight below RES00 doesn't get the clock; if you fix that, the circuit should work (I only tested a few cases, though)!

Sounds good. Overlapping triples are a bit too abstract for my brain. I might have a rough (and possibly wrong) idea what you are talking about, but I would prefer not to think too much about it :- )

Unless... are there any details important for emulation envolved? Like rounding errors in the multiplication result?

And very dumb question: Does anybody know yet how many LSBs get stripped from the multiplication results?

Ie. what are the results for 256*256, in first and second multiplier sections?

The results should be right-shifted - so they must be something else than 65536.

NB. Some more info on the MDEC including RLE section was posted here: http://www.psxdev.net/forum/viewtopic.p … &t=551

As far as I understand it works as follows (someone please correct me if I'm wrong):

The circuit multiplies a 13- by a 12-bit number, both signed. Technically, the result is a 25-bit number; however, the circuit drops a) bits 0-6 and b) bit 25 (the carry into the 25th bit, to be precise). That means that internally a 17-bit number is used to accumulate further products. The overall result, then, is once again stripped of bits 0-3 (i.e. bits 7-10 in the original 25-bit number) to leave us with a 13-bit number that is used as the input of the second pass (another instance of the multiplier).

I haven't gotten to it yet, but judging from the circuit schematics, in the very end 8 bits of 17-bit result of the second pass are selected and divided by 2 (with clamping) to arrive at an 8-bit number.

Your example (256*256) would result in the "25-bit" number 65536. Then, bits 0-6 are dropped and you get 512.

If you multiply for instance 257*256 you get 65792 (0x10100). Now drop bits 0-6 to get 514 -- this number is used for accumulation. However, if it were to be passed to the second stage, an additional 4 bits are dropped and the result is 32, which is then multiplied by 12 bits from the scale matrix.

As a side note, the second pass probably uses 13 bits from the previous result and 12 bits from the scale matrix because a) they wanted to use the same circuit design again and b) just a guess, but the additional bit was more accurate/helpful when used for the previous result, not the scale matrix.

Powered by PunBB, supported by Informer Technologies, Inc.