MaxCoderz

Posted: **Sat 24 Mar, 2007 9:22 am**

hmm ok maybe I was wrong.

Posted: **Tue 07 Aug, 2007 11:25 am**

Change

ld a,KEY_GROUP
out (1),a
nop
nop
in a,(1)

to

Code: Select all

ld a,KEY_GROUP
out (1),a
ld a,(de)
in a,(1)

This saves 1 byte and 1 tstate.

Posted: **Wed 08 Aug, 2007 1:42 pm**

CalcKing wrote:Expanding on Dwedit's trick, I created a macro system for Bot Attack.

Code: Select all

#define	curVPutS(curPos)	call	cur_vputs \ .dw	curPos

	...

	curVPutS(1+(256*9))			; Draw author's name... ;-)
	.db	"by Peter Wakefield",0

;-----> Optimized vputs routine
; inputs:   bytes following call: CoordLSB,CoordMSB,Null-terminated string
; output: String displayed, smaller than using redundant code
cur_vputs:
   pop   hl
   ld   e,(hl)
   inc   hl
   ld   d,(hl)
   inc   hl
   ld   (pencol),de
   bcall(_vputs)
   jp   (hl)

Why didn't you do it like this:

Code: Select all

#define	print(xcoord,ycoord,string)	call	cur_vputs \ .dw	xcoord+(256*ycoord) \ .db string,0

	...

	print(1,9,"by Peter Wakefield")

I guess it should work just as well (the API uses something like this somewhere I think, can't check now since my server is dead). It shouldn't make a difference in speed or size, but it greatly improves the readability of your code.

Oh, wait, if you do it like this, it'll also save you a few instructions (faster, but possibly bigger depending on how many times you use the macro):

Code: Select all

#define	print(xcoord,ycoord,string)	ld de,xcoord+(256*ycoord) \ call cur_vputs \ .db string,0

	...

	print(1,9,"by Peter Wakefield")

;-----> Optimized vputs routine
; inputs:   bytes following call: CoordLSB,CoordMSB,Null-terminated string
; output: String displayed, smaller than using redundant code
cur_vputs:
   ld   (pencol),de
   pop   hl
   bcall(_vputs)
   jp   (hl)

Posted: **Thu 08 May, 2008 8:11 am**

[quote="sigma"]If you want a 16-bit loop counter, never, ever do this:

Code: Select all

- ; Loop body
  ;   .
  ;   .
  ;   .
  ld  a, d
  or  e
  jp  nz, -

That would be useful, if each run of the loop must be in constant time.

Just nitpicking about the "never, ever"

Juha

Posted: **Fri 09 May, 2008 11:53 pm**

Time variation added by doing it the correct way shouldn't bother you in most (if not all) occasions.

Posted: **Tue 07 Jul, 2009 5:24 pm**

Amazing that I never thought of this before, but in a multiplication you can actually stop after the operand that you are shifting out to test the bits becomes zero (not when, but after, very important difference) because you will never add anything to the result from that point onwards. It makes the loop slightly slower, but you will have an early exit in many cases - an early exit that often saves more cycles than it adds to the loop itself, and you won't need a loop counter (which is very cool when you're multiplying bigger things and you need all the registers you can get)

Proof of concept:

Code: Select all

DE_times_A:
   ld hl,0
   or a ;have to reset carry
_loop:
   rra
   jr nc,_skip
   add hl,de
_skip:
   sla e
   rl d
   or a   ;slightly slower than the usual djnz
      ;as a bonus it will reset the carry, needed for the rra
   jr nz,_loop  ;4+12 vs 13
   ret

Pro:
* early exit saves a lot of time for small value of A
* not used BC
* intro is 3 cc's faster (well, that's nothing..)
Con:
* slightly slower loop (3 cc's per iteration more)
Neither:
* exactly the same code size as the usual algorithm

disclaimer: I haven't slept much for a while due to the high temperature, it could be that I'm completely out of my mind. Please notify me if that's the case..

Posted: **Wed 08 Jul, 2009 12:11 pm**

So, now for a little cc analysis.
* one iteration of the old version takes 45 or 51 cc's (unless it is the last, then it's 5 less)
* (3*X)-3 cc's are added in the new version where X is the number of iterations
* 45*(8-X) cc's are saved where X is the number of iterations
* in the worst case, X=8 and 21 cycles are added.
* in the best case, X=1 (note: it can't be 0) it's 315-0=315 cc's faster
* the second-worst case is X=7: 45-18=27 cc's faster
* all timing differences: -315, -267, -219, -171, -123, -75, -27, 21
* the average (that is, 0.5 * 21 + 0.25 * 27 etc) is: -25.4 cc's
I apologize in advance for all errors that I will have made.

Why weren't we all using this way before? Or were we, but just not me?

Posted: **Wed 08 Jul, 2009 1:23 pm**

I think most people use this one: http://map.grauw.nl/sources/external/z80bits.html#1.1

Posted: **Wed 08 Jul, 2009 1:27 pm**

Hm ok, I compared it to:

Code: Select all

DE_Times_A:          ; HL = DE × A
    LD     HL, 0      ; Use HL to store the product
    LD     B, 8       ; Eight bits to check
_loop:
    RRCA             ; Check least-significant bit of accumulator
    JR     NC, _skip  ; If zero, skip addition
    ADD    HL, DE
_skip:
    SLA    E         ; Shift DE one bit left
    RL     D
    DJNZ   _loop
    RET

(asm in 28 days, day 15)

Posted: **Wed 08 Jul, 2009 1:44 pm**

tr1p1ea wrote:I think most people use this one: http://map.grauw.nl/sources/external/z80bits.html#1.1

*http://baze.au.com/misc/z80bits.html#1.1

I don't know why Google ranks that outdated .nl one higher than the original.

MaxCoderz

[TI ASM] Optimizations

Re: [TI ASM] Optimizations

Re: [TI ASM] Optimizations

Re: [TI ASM] Optimizations

Re: [TI ASM] Optimizations

Re: [TI ASM] Optimizations