I use m1 to determinate cells. There are a lot of different quality of m1 so i can be sure that i recognize cell correctly.

I don't need list of cells now. It's less time consuming part of work. Most time i spend in creating block diagramm like that
http://forum.emu-russia.net/download/file.php?mode=view&id=1012

You can grab part of cpu which you want and start trace it. We use photo of green m2 as base level and scale all other photo to it. Easiest way to start - trace from units.

nocash wrote:
org wrote:

PS. So called "MUXs" block looks like barrel shifter, but I'm not sure)

What would that shifter be good for? If the multiplication really passes in a single CLK cycle then I couldn't imagine how/why it could use shifting, unless it's done on the two CLK edges.
I don't understand the circuit at all, but I still think it's hard to believe that it could finish multiplication in a single CLK.

It looks like it done one multiply operation per clock. And it will take 8 clocks to finish multiplication of one number in matrix.

Launch logisim and try to look for yourself. It's real funny to try to guess what is going on there ) Like a puzzle solving.

nocash wrote:

Only one CLK per multiplication?
The "111111110101 * 0000000000001 = 01010101010101111" stuff looked like incomplete result, as if it were having multiplied only each 2nd bit, so I was thinking that it might multiply the other bits in a second CLK cycle (and add them to result from first cycle).
Using "111111110101" (=negative) as test value looks more complex than needed, I would start with simple positive numbers.

I fixed a lot of bugs in Circuit. It do some calculations but I think there are still a lot of bugs.

multiplication 111111110101 * 0000000000001 now = 11111111111111110

000000000000 * 0000000000000 = 00000000000000000
000001000101 * 0000010100000 = 00000000001010101

By the way, can you tell me how many fractional bits in rle result and in scaletable matrix?

nocash wrote:

You mean you are (almost) ready to simulate the circuit? That would be great!

Maybe Org do simulate multiplication when he has time. It is possible now.

By the way looks like I found IDCT logic. There are few counters that activate address selectors on units and related to CLK signal. More info soon )

http://wiki.psxdev.ru/images/6/61/Circuit002_logic4.jpg

nocash wrote:

I really hope that I could get signed-div-2 implemented ; - )

Not just "signed-div-2", but "signed-div-2 with clamping" ie

if( result < -128 )
{
    result = -128;
}
else if ( result > 127 )
{
    result = 127;
}
nocash wrote:
Akari wrote:

- Pass 1 has 13 bit scaletable input and 12 bit of RLE input. While caculating sum - 17 bit result is used. When summing is done - only upper 13 bit of result is stored.
- Pass 2 uses 13 bit of pass 1 result and upper 12 bit of scaletable matrix. While caculating sum - 17 bit result is used. When summing is done - only upper 10 bit of result is passed to next stage.

Pass 2 is using only 12bit scaletable, not 13bit??? That might explain some of my rounding errors. I would have NEVER imagined that it might use only 12bits there!

I am unsure which 17bits of the multiplication result are used. Theoretically, 12bit*13bit would give 25bit result. But for signed numbers it could be squeezed into 24bit (except that -800h*-1000h would overflow). And after summing up 8 values, the result might grow by factor 8, so it might be needed to be 28bits wide instead of just 25bit. If the result could be 24..28 bits wide - how many LSBs should be removed to get the 17bit value?

Assuming that the result is only 24bits wide, then the overall formula should remove fractional bits like so:

  pass 1, after multiplication:  strip 7bit (to reduce result from 24bit to 17bit)      ;\
  pass 1, after summing:         strip 4bit (to reduce sum from 17bit to 13bit)         ; is there any rounding
  pass 1, before multiplication: strip 1bit (to reduce scaletable from 13bit to 12bit)  ; done in these steps?
  pass 2, after multiplication:  strip 7bit (to reduce result from 24bit to 17bit)      ;
  pass 2, after summing:         strip 7bit (to reduce sum from 17bit to 10bit)         ;/
  after pass 2:                  strip 1bit (signed div2)                               ;<-- with rounding-up
  ------------------------------------------------------------------
  total                         strip 27bits

Looks almost right. In my pseudo source code, I've divded the result by 2000h (=stripped 13bit) in each pass, aka stripped 26bit in total.
When stripping 27bits the result would be too small... but wait, you have said that the input from RLE unit is 12bits? I was thinking that RLE output is signed 11bit. But if it's 12bit, then stripping 27bits on the final result would be just right.

Though I was quite sure about RLE being 11bits, if there is an extra fractional bit, then it didn't seem to affect my hardware test results. You don't happen to see some 11bit-to-12bit expansion on the pass 1 input (ie. something that creates an extra fractional bit, or an extra sign bit)?

Nope just 12 wires that come from far far away smile
I don't know if this is RLE result or some other things, but it is 12 bit for sure.

And after pass 2 division by 2 - strips 2 bits (but with clamping). Overall result is 8 bit: 7 value + 1 sign.

During multiplication all bits after 17 only carry is passed to next stage. You can see it in left part. Only data for upper 17 bits are go to D-triggers and futher. We don't know yet how reducing to 17 bit is working. It seems like calculation is done as sum of 8 numbers. Six of them are 13 bit of scaletable data (in 1st pass), 7th data from RLE, 8th is 17 bits of previous result. Six 13bits are formed through some calculations from 12 RLE inputs.

We don't see any rounding during this calculations.

We test multiplication and sum soon and can give you more precise data )

Akari wrote:
org wrote:

We found some circuit that makes the conversion on the output result of IDCT. It takes 10 bit as input and 8 bit as output.

Solved!!!

This is division by 2 with rounding up. Result is clamped 127, -128.

For example

1    1
2    1
3    2
4    2
5    3
6    3
7    4
8    4

251    126
252    126
253    127
254    127
255    127
256    127
257    127
258    127
259    127
260    127

509    127
510    127
511    127
-512    -128
-511    -128
-510    -128
-509    -128
-508    -128
-507    -128

-10    -5
-9    -4
-8    -4
-7    -3
-6    -3
-5    -2
-4    -2
-3    -1
-2    -1
-1    0
nocash wrote:

That feedback could be two things: Either the summing-up part (when multiplying & summing 8 entries from two matrix columns; in that case the feedback would be passed to the sum/addition hardware, not to the actual multiplier). Or, it could be the 2nd pass (when the whole matrix-by-matrix multiplication is done another time).

nocash wrote:

Stores 16 outputs? Are you sure? For a matrix column it should store 8 values. And for a whole matrix it should store 64 values.

Looks like multiplication and summing with previous result is done in one pass through one network of different adders.

There are two identical sum_multipliers for pass 1 and pass 2. Looks like it done matrix multiplication "column-by-column" to speedup the multiplication process. Pass 2 can be calculated in paralel as soon as first 8 sums are calculated. Not need to wait until first two matrix multiplication is completed. And it takes only 8 records with 13 bits each to store, instead of 64x13.

unit 1 looks like dualport buffer. It has 16 records 13bits each. While first 8 records is used in pass2 - second 8 records are calculated. After second records calculated first 8 are not needed anymore and we can store next result here while using 2 set to pass2.

nocash wrote:
Akari wrote:

4) Other input to sceme are 6 bit of something.

That sounds a bit too less (in case you are talking about second multiplier input).

Input is 12 bit smile

nocash wrote:

Btw. don't know if you already have this in mind: The matrix multiplier should be most likely working like this:

  src=blk, dst=temp_buffer
  for pass=0 to 1
    for x=0 to 7
      for y=0 to 7
        sum=0
        for z=0 to 7
          sum=sum+src[y+z*8]*(scaletable[x+z*8]/8)
        next z
        dst[x+y*8]=(sum+0fffh)/2000h      ;<-- or somehow different rounding in this place and/or other places
      next y
    next x
    swap(src,dst)          ;<-- or maybe actual HW uses another destination buffer in 2nd pass instead of swapping src/dst
  next pass

I don't understand the decapped circuit too well... as far as I understand, it does seem to have a shift-register for the multiplication, so I guess it's using the good old "shift-and-add" mechanism to do the multiplication (?) and aside from that, there should be another addition mechanism for the "sum" calculation - or, wait, it might be also using the "sum" value as initial value in the "shift-and-add" part; so maybe there is only one addition unit. Hope you'll understand the circuit better than me : - )

I can give you some info to test:
- Pass 1 has 13 bit scaletable input and 12 bit of RLE input. While caculating sum - 17 bit result is used. When summing is done - only upper 13 bit of result is stored.
- Pass 2 uses 13 bit of pass 1 result and upper 12 bit of scaletable matrix. While caculating sum - 17 bit result is used. When summing is done - only upper 10 bit of result is passed to next stage.

I look at the next stage right now. It looks like some futher rounding is done. I see only 7 bit output.

Newest foundings.

1) Scaletable matrix stored in UNIT 00. It's stored as 32 records 26 bits each. And later through multiplexer upper or lower 13 bits selected (lower part of sceme).

2) Output of all this calculations is 17 bit and all of them are going back to calculations again (result input is on the right part of sceme)

3) 13 bits of output is stored in UNIT 01. This is the only output of all those calculations. UNIT 01 can store 16 such outputs.

4) Other input to sceme are 6 bit of something.

5) UNIT 00 has ine more output to some other part. So scaletable matrix used somewhere else.

http://wiki.psxdev.ru/images/thumb/4/4f/Circuit002_logic.jpg/800px-Circuit002_logic.jpg

Full version are here http://wiki.psxdev.ru/images/4/4f/Circuit002_logic.jpg

nocash wrote:

Uhhhh. But meanwhile... you have changed the circuit picture?
And it is now doing something with 13 of 17 bits (instead 8 of 11 bits)?
That would smash my idea : - )

I updated picture with lastest info (now all triggers used) - now this input has 17 bits.
Bottom input still 13 bits.

Left part was also updated. There are some carry calculations. I still can't understand what are bone during calculations. Some strange manipulations without strict pattern.

nocash wrote:

Cool!

So the multiplier output would be... 10bits? (The arrows pointing upwards on the of the image).

What are the small blue rectangles? Signals that you haven't figured out where they come from?

LSB is on left of the image, right? (the .png file is a bit too small to read text in it, but I'd guess LSB=left from the wires/arrows).

Is there only one multiplier? Or 8 multipliers (for multiplying a whole matrix column at once)? Just for curiosity, wouldn't matter too much if they've used parallelism or not. More important would be knowing if there's rounding in the multiplaction result and/or in the following sum-up additions.

And is there a second matrix multiplier unit (for the second part of the Temp=RLE*Scaletable (first part), and then RESULT=TEMP*Scaletable (second part) multiplication)?

We don't know for sure if this is really part of IDCT conversion, but it looks like it.
It has 2 inputs 13bits and 11bits and has 10bits output.

Scheme above updated so you can see a bit of progress. If you need i can send you original odt file.

I don't see any second part or something like this. Only strange 8bit that are go from 11bits data on the right. I don't know what can this be.

http://wiki.psxdev.ru/index.php/CPU_CIRCUIT_002

11

(1 replies, posted in Forum news)

Yes )