MaxCoderz

Posted: **Sat 04 Oct, 2008 8:50 am**

So I'm coming back to TI programming in ASM after a year or so out of the loop, and it seems a lot more manageable now that I have some proper asm experience under my belt.

I just read and understood several tutorials on interrupts, and I want to make my own greyscale routine to practice with them. I'll probably end up using someone else's since its been done very well already by Duck and Jim e both, but I want to do this so that I at least understand how it works.

I've pretty much got the concept: Game runs, when the interrupt triggers, the greyscale routine very quickly copies the active buffer to the screen, but applies a rotating mask to it first so that the pixels flicker at a set rate and appear (due to LCD lag) to be at a set value.

What I want to know is simple compared to all of this: How does one interface with the LCD screen directly? As in, how would I write even a regular copy routine, without all the greyscale garbage? Worse, is there a hardware sync I need to worry about?

Thanks for any answers, looking to finally give my Pokemon project the start it deserves.

-gamefreak

Posted: **Sat 04 Oct, 2008 12:31 pm**

The rotating mask is optional.

To communicate with the LCD driver, you give it orders about what position you want the draw cursor to move to, and what direction to move after a write, and whether it's in 8 bit or 6 bit mode. Then there's the data port where you do writes to the screen.

There is a minimum time between any LCD writes. You poll the LCD driver in a loop to see if it's ready to accept another command.

$10 is the LCD control port (and polling port)
$11 is the LCD data port

LCD Control Port
LCD Data Port

example of LCD status polling:

lcd_busy_loop:
in f,($10) ;or just use any other register you want, only the returned flags matter
jp m,lcd_busy_loop

Posted: **Sat 04 Oct, 2008 1:40 pm**

The LCD hardware is also documented in the SDK manual (TI-83 Plus Specific Information, Drivers Layer, Display, Writing Directly to the Display Driver).

Posted: **Sat 04 Oct, 2008 5:40 pm**

So do I understand that you need to lcdbusy to poll the display driver before every byte written?

If this is the case, it would make sense to try to "prep" the next byte before doing this, and then to ideally get your byte prep code done in less than the time that lcdbusy would take if called immediately after a write.

So, the obvious question that the SDK doesn't seem to want to answer: how many instructions/cycles does one normally have to wait in-between writes? I know it specifies 10us, but how does this translate into instructions?

Also, in X increment mode, if it "increments" off the right side of the screen, does it wrap around to the other side? Or do I need to do that myself?

-gamefreak

Posted: **Sat 04 Oct, 2008 7:00 pm**

Do not rely on instruction delays, as this changes depending on the speed of the Z80 (which depends on whether the user has selected a speed or the state of the batteries) and the hardware (some calculators have a slower LCD driver). You should poll the busy bit, and (as you have correctly noted) it's good to prepare the byte you're about to output before polling the busy bit.

If you absolutely have to calculate timing; a Z80 running at 6MHz will go through 6,000,000 T-states per second, which is 6 per microsecond. 10 microseconds therefore comes out as 60 T-states.

Note that with LCDs the X axis is frequently vertical with the Y axis horizontal (I'm not sure why). So when auto-incrementing X (the TIOS default) the LCD's internal address counter moves down the display, not across. I'm not sure on the rules of wrapping (and can't see mention of any wrapping on the datasheet), but don't forget that the internal LCD RAM stores a 120x64 pixel bitmap so there's some data outside the area visible on the LCD itself.

Autoincrementing X is more popular than Y, as you only need to set the LCD address 12 times (once for each column) rather than 64 times (once for each row).

Posted: **Sun 05 Oct, 2008 7:07 am**

The LCD wraps rows just like you would expect, both when going up and down.
The LCD wraps columns when going left and right, but it uses the full 120 pixel width instead of 96 pixels. It is also possible to get to an out of range column, either by changing between 8-bit and 6-bit mode, or by writing out of range column numbers to the port, those wrap back to zero at column #31.

Probably should also always set the LCD row to zero as the first thing done, since there is no guarantee that the TIOS left the LCD row at zero.

Posted: **Sun 05 Oct, 2008 1:29 pm**

There are 2 major issues with grayscale. The first is flicker, the other is timing. Both of these dictate the quality of your image and also the amount of CPU time you have left over for the remainder of your program.

Most people know the basic idea behind 4lvl grayscale, that is constantly displaying the contents of your dark buffer for a longer amount of time than your light buffer. Black is on 100%, dark gray for 66%, light gray for 33% and white for 0%. Most people have tried the basic way of just switching buffers and the result is often a flickery mess

.

There have been several successful cases over the years where interlacing techniques have greatly reduced the amount of flicker in grayscale. It started off on a byte level, but has been taken to the bit level as well, which is what you see in the most modern grayscale routines like GPP/RGP.

The basic idea is to use a series of masks to interlace 2 bytes from your buffers into one byte ready for output. The masks you use are designed to do exactly what the principal behind 4lvl grayscale is; display the dark buffer for longer than the light buffer. This in mind, your mask sequence could be:

110110110110110110110110

or broken down to:

11011011 01101101 10110110

And to prepare a byte you would do:

LCD = (darklayer AND mask) OR (light AND inverted mask)

AND mask with byte from Dark Layer
invert mask
AND mask with byte from Light Layer
OR resultant bytes together
output to LCD

The reason you invert the mask is because 11011011 will give you 75% of the dark layer, so if you invert (complement) it to 00100100 it will give you 25% of the light layer, which is what you want (you will notice that this is different than the 66%/33% mentioned earlier, but its still gray

).

Then after each frame you circularly rotate through the 3 masks. This is done because if you use the same mask for 2 (or more) successive frames you will be showing certain bits from your gray buffers for too long, thus you will throw out the timing/ruin the formula and the result will not be very grayscaley!

.

Timing is another important aspect of grayscale for reducing both flicker and improving speed. To achive the best most consistent image, you must synch your grayscale routine to the refresh rate of the calcs LCD. Jim e measured this to be around ~62Hz which means for best results you should aim to have your routine update the LCD 62 times every second (62fps might be easier to understand, if you think of the 'f' as meaning 'grayscale frames' and not 'game/program frames'). The goal for everyone has been to perform the byte preperation (the masking and such) in less than the amount of time required between successive writes to the calcs LCD. In clock cycles this has been measured to be around 64-65ccs (which is a bit over 10microseconds iirc). Achieving this means that your grayscale routine wont add any overhead over what is required to output a frame to the LCD (any optimising passed this stage would just be for kicks

).

A typical fastcopy is around 50,000ccs (probably a little less). This is as fast as you can possible get it since any faster will void the required delays and make the LCD controller unhappy and thus bork your display. This means that to output at the optimal ~62Hz, you are using up 50000 * 62 = 3100000cc's or 51.66% of your available CPU time just for the grayscale routine alone.

It is all well and good to poll until the LCD is ready, and this might be a preferrable technique if you dont intend on using your grayscale routine for a game (would be fine for titlescreens etc), since technically you can synch to 62Hz with a routine that takes over 96,000ccs, but this would almost eat 100% of your CPU! This is the sole reason why people have been crazy about optimising grayscale routines since they want as much processing power as they can get available for their actual project. Of course, things like ccs are somewhat relative to a few factors like battery life and such. Also not all calcs use the same LCD controllers and even if they do, lots are configured differently (hence the need for ALCDFix (possibly one of the most useful ti programs ever made)). This is why ccs arent always reliable for timing, and why its not always a good idea to use them. That said, the bonus is that measuring by cc's theoretically cant cause problems with a correctly configured LCD (after using ALCDFix if need be) since slower instructions as a result of lower battery power will only increase your delays between writes for a bit and thus keep the LCD controller happy anyway

.

Err anyway, the best way to time your routine is of course to use an interrupt that fires at around 62Hz. There are many timer combinations you could come up with to achieve this.

For historical purposes, there was a rather long struggle to achieve 4lvl grayscale in under that magical LCD delay time of around 64-65ccs between writes. GPP had it at 77ccs and RGP had it at 73ccs. GPP had different timing that didnt properly synch to the LCD (~62Hz) so it could randomly use anywhere from 60%-80% or more of your CPU time. RGP used better timing to get it down to around 59% CPU time i guess. I used a cheap 'inc h' trick to get it down to 70ccs but thats still like ~55% CPU time. It kinda remained at that for a year or so until someone named doynax posted this magnificient realisation:

Code: Select all

(A ^ B) & C ^ B = (A & C) | (B & ~C)

He had a custom buffer implementation that ran at an average of 63.3ccs per write which was extrodinary. The non-standard buffer kind of overcomplicated things and borked the overall routine but he had made the magical realisation of refining the logic behind it when such a move had never occured to anyone else. Making use of this new information it was a simple mod to get my old one down to 59ccs, which was too fast anyway, so i just padded it with an inc hl to round it out to a nice safe 65ccs (there is a 64cc version but 65ccs is safer) ... something like this:

Code: Select all

inc hl
add hl,sp
ld a,(hl)
inc h
inc h
inc h                 ; + 768 = byte in other buffer
xor (hl)
and d                 ; mask
xor (hl)
out ($11),a           ; 65ccs

Jim was made aware of the logic change and also made use of the 'inc h' trick in his latest version of RGP ... though im not sure if its publically available. Although it should theoritically be ok to have it run at 64ccs, expeirments have shown that its too close to the limit and often throws calcs out of whack so 65ccs seems a safe bet ... either way its close enough (should be about 52% CPU time)

.

For fun there was a small discussion with regards to how low you could get the interlacing down too (37 i think i managed) ... could be a fun exercise

.

That should be a basic outline of bit-interlaced 4lvl grayscale and a little bit of history. I hope i didnt ruin any fun by posting some code

. I also hope this stuff makes sense and that i havent made any fatal errors (passed midnight here)

.

Heres a full 4lvl routine for reference (i dug it up but it should work ok), could be improved upon marginally but the result wouldnt be overly noticeable enough for me to be bothered

. Yeah it uses some smc, even in parts where it doesnt have to, can still be made smaller (16 loops of 4 writes) or even APP compatible easily enough i guess:

Code: Select all

;--------------------------------------------------------------
; gsRoutine - tr1p1ea
;--------------------------------------------------------------
; 4lvl bit-interlaved grayscale routine
;
gsRoutine:
        ld (_smc_sp + 1),sp

        ld a,$80
        out ($10),a

        ld a,(_smc_gr_3 + 1)
        ld b,a
        ld a,(_smc_gr_2 + 1)
        ld (_smc_gr_3 + 1),a
        ld a,(_smc_gr_1 + 1)
        ld (_smc_gr_2 + 1),a
        ld a,b
        ld (_smc_gr_1 + 1),a

_smc_gr_1:
        ld d,%11011011
_smc_gr_2:
        ld e,%01101101
_smc_gr_3:
        ld c,%10110110

        ld hl,gActiveBuf1+756

        ld a,$20

gsCopy_1:
        ld (_smc_gr_4 + 1),a
        ld sp,-757
        ld b,7
        out ($10),a
gsCopy_2:

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and d
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and e
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and c
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and d
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and e
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and c
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and d
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and e
        xor (hl)
        out ($11),a

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and c
        xor (hl)
        out ($11),a

        djnz gsCopy_2

        inc hl
        add hl,sp
        ld a,(hl)
        inc h
        inc h
        inc h
        xor (hl)
        and e
        xor (hl)
        out ($11),a

        ld sp,-767
        add hl,sp
_smc_gr_4:
        ld a,$00
        inc a
        cp $2C
        jp nz,gsCopy_1
_smc_sp:
        ld sp,$0000
        ret

Posted: **Sun 05 Oct, 2008 8:07 pm**

I'm *almost* following the replies here, except that I'm thrown off by the interrupts a bit. I was under the impression that interrupts were set in stone, that they automatically fired off 140 times a second or something bizzare like that. I was ready to accept this as a limitation of the TI hardware configuration, but you seem to know how to set the interrupt to a different time. How do you achieve this?

I'll note that my goal here is to get a routine running in an APP, but for now I'm playing in a "native" asm program, as it seems to be the easiest to debug using Latenite.

So let me see if I understand the logic on the LCD driver:

If I run in "X increment" mode, the display driver will automatically move positive along the Y axis, and will wrap around to the top of the next column along the X axis when it reaches the end? (I think of the coordinates on screen in the same manner as the BASIC Pxl-On() command, similar to most computer graphics libraries, so 0,0 is the top left of the screen.) That doesn't seem to agree with what you said earlier, since if this were the case I would not need to set the coordinates at ALL except for the first time to get it to 0,0.

I'll probably play around with that a lot on my own. I'm most interested in the timer registers for interrupts though, as I can't seem to find any documentation on interrupts or timers in the TI-83P SDK manual.

And with the discussion of greyscale comes another important question that may or may not be relevant but I still want an opinion: For the buffers (I'm going to need to double buffer a greyscale routine, yes?) I will basically need 768x4 bytes of RAM to use. Ideally, I would like to keep more general saferam open, since I'm going to need to store a lot of other information in RAM while the game is running, namely all of the information needed to describe your items and the pokemon in your party, etc. Is it a good idea to request the RAM from the OS, or should I try to have say the active buffers in a different place? Is there a speed gain, is certain RAM faster than other RAM? (I've seen this on a lot of other systems, not sure here.)

Thanks for all the info so far guys, this is going to probably be successful. ^_^

-gamefreak

*EDIT* OK, I've been playing around with the LCD for a while, and it seems the columns do not wrap around like I thought they would, I do have to set the coordinates back manually after every column. No big loss really.

Interestingly, I find it easier to move from the right to the left (column wise) when copying my buffer. I dunno, I like decrementing my column counter rather than incrementing it, so call me weird. ^_^ I don't suppose it matters which way I go so long as the working buffer is set up in a standard way.

Posted: **Sun 05 Oct, 2008 11:28 pm**

Whoa, tr1p1ea, very insightful post.

Interrupts use port 3 and port 4.

Posted: **Tue 07 Oct, 2008 12:26 am**

Wait, tr1p1ea already posted that once. Why didn't someone just link to the old post? If I recall jim E made a bunch of images to illustrate the points as well.

Edit: I couldn't find the post, but here are the images.

The final gray image:

Dark Layer:

Light Layer:

Two ways to achieve this:
Full flipping:

Interlaced flipping:

Posted: **Wed 08 Oct, 2008 1:49 am**

Wow, those illustrate a lot of things quite well. Technically, proper interlacing is done on a bit level (the image shows byte level interlacing, although that would probably work as well) but a good illustration nonetheless.

I'll probably work on this over the week, and then I'll post my code (if it works) and ask for help optimizing it, as I'm not terribly efficient with my z80 yet. Thanks guys, the interrupt info was exactly what I needed.

-gamefreak

Posted: **Wed 15 Apr, 2009 8:18 pm**

ok, this thread is old, but i got an idea:

cheap 'inc h' trick

can't you just swap the damn ram page?

Posted: **Wed 15 Apr, 2009 9:42 pm**

That's a much bigger piece of ram than necessary though, and not faster (or is it? show the codez?

), so why?
And it has great potential to mess other things up..

And on the plain old 83+, you are very restricted in how you could do this. Port 5 doesn't work the same way.

Posted: **Thu 16 Apr, 2009 2:05 am**

inc h \ inc h \ inc h
12 clock cycles.

ld a,page \ out(port),a
19 clock cycles.

Posted: **Thu 16 Apr, 2009 2:33 pm**

damn...

it was worth a try, tho

MaxCoderz

[TI-ASM] Understanding Greyscale

[TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale

Re: [TI-ASM] Understanding Greyscale