Page 6 of 6

Posted: Mon 31 Jul, 2006 5:26 am
by tr1p1ea
About a year ago i got the routine down to 70, i didnt make use of sp, just the increasing the high-byte of the buffer pointer to get the next byte trick.

Then doynax posted his optimised grayscale routine (in april of this year) and geniously pointed out that:

(A ^ B) & C ^ B = (A & C) | (B & ~C)

Very suprised no-one bothered to look at this before! :).

His routine used a custom buffer setup as well as being layer interlaced. You can find the thread here: http://kvince83.tengun.net/maxboard/vie ... php?t=1668

My routine still uses the conventional buffer setup and i had another look and got it down to 59. Im sure it can be optimised further again.

Posted: Mon 31 Jul, 2006 8:09 am
by Jim e
Figured you would come in with something about the masks, I never bothered looking for another way to apply masks. If its correct then its definately gonna speed things. I can imagine a trick to get it down to 51 clocks i believe. Atleast 10,000 clocks could be killed. How ever I'd like to point out its pointless to go below 64, after that your just ruining compatibilty for the busted lcds.

I almost want to update RGP now.

Posted: Mon 31 Jul, 2006 8:50 am
by tr1p1ea
Still, its nice to know that there is room to play with ... plus you could make it app compatible easier.

It would be cool if you could outline your 51 cycle idea as well :).

Posted: Mon 31 Jul, 2006 10:52 am
by Jim e
Well purely streching for the basis of speed, I'd go sigma's way The best damn grey routine ever method. That being unrolling the whole thing into a one of the safe ram buffers. I estimate it running at ~40000 and being about 515 bytes.

Code: Select all

loop:
	out ($10),a
;---------------
;repeat 64 times
	ld a,(de)		; 51 tstates
	xor (hl)
	and c
	xor (hl)
	out ($11),a
	add hl,sp
	ld e,l
;2 or 3  inc d need to be inserted in there somewhere.
;---------------
	dec h
	dec h
	dec h
	dec d
	dec d
	dec d	;only needed if 3 was used before.
	inc b
	ld a,b
	cp $2c
	jp z,loop
The basis of the code is that Buffer1 is aligned with Buffer2, as long as thats true it will work, which current implementations of Gpp and RGP this is true. This actually could be practical to use if weren't to fast. Slight alteration will let c and b be used as masks so the noise could be reduced. Just loop the fucking thing and we get 64 tstates which is completely resaonable and fast enough.

I thinks at the point were its over kill. Its quicker than fastcopy, buts its not worth it if it kills compatibilty. It could be faster still if the buffer logic changes but no one wants that.