Multitasking on PIC in JAL

No replies
Michael Watterson's picture
Michael Watterson
Offline
Joined: 21 Sep 2009
Posts:

Multitasking and other fun on the PIC

By Michael Watterson © 2010. All rights reserved. May be copied or printed for personal use only.

 Embedded applications usually have some form of Real Time Operating system or Multitasking. We examine why the Microchip PIC microcomputer is ill-suited to this approach and an alternate methodology for Real Time Systems when programming the PIC with JAL. In passing various techniques of embedded programming using JAL will be illustrated and some suggestions regarding selection of PIC or if an alternate CPU should be used.

Which PIC and Why is Multitasking difficult?
There are a confusing variety of PIC Micro. 10F, 12F, 16F, 17F, 18F, dspPIC, PIC32. Initially we consider only the 8bit PIC family (see Choosing a PIC in appendix) as it has JAL (Just Another Language) and an architecture difficult for traditional Multi-tasking solutions.

The architecture is basically Harvard type, with Program in Flash and really all the RAM is the same address as the Registers. Microchip actually calls the RAM the "register file" and the operand mnemonic is "f". Less than 18F version (10F, 12F, 16F) only small values for f with bank switching flags even on 256 RAM parts. The 18F series can have 8 bit or 12 bit address sizes for f. This means even the large 18F (80pins) can only really use external RAM for table operations, not general register instructions as it is in the same bus/ address space as the Program Instructions, not the register file. Also these PIC have almost all 4096 bytes of "register file" address space in use as the specific function registers are in the same address space as the 3905 bytes RAM.

The most serious issue is Stack. Stack is usually a block of RAM used to save and restore existing address, CPU state, parameters for Procedure, Function and Interrupt calls. Since the PIC doesn't really have RAM in the conventional sense (only Register File), the Stack is implemented with dedicated RAM and pointer. The lowest spec PIC have only 1 or 2 levels for Return from Interrupt. The 16F series has 3 levels to 8 depending on part. The 18F has the most at 32 levels. Also only the 18F has Push and Pop instructions to manipulate the stack. Even the 18F stack is not enough for traditional Task Switching as you can only have one stack. There are of course awkward "work-arounds" which are slow and use a lot of scarce RAM.

If this seems strange, it's because in 1977 the General Instruments PIC 1650 was a Programmable Interface Computer for their 16 bit part which had poor I/O features. This part "lives on" as the PIC16F54, which is an almost identical Flash version. The 17F was the first attempt to really enhance the architecture. The 18F is the successful 2nd attempt at that!

Review of Basic Multitasking concepts.
Even in the earliest days of microprocessors in mid 1970s it was realised that eventually a plateau of performance would be reached for a single CPU. Also there was 30 years experience of High Level language programming. UNIX was new and shiny and Batch Time Sharing Mainframe OSes looking a bit tired. But Fortran, C, Pascal, BASIC (originally a cut down ForTran) and Cobol (the main stream languages by late 1970s) had no "built in" concepts of multi-threading, multitasking, concurrency and such even though Multi-User operating systems with live terminals rather than batch card/Tape jobs appearing.

Analysing systems using Data Flow Diagrams (DFD) and also Digital Signal Processing (DSP) naturally lead to solutions that are more obvious to implement as Parallel Tasks or Processes. Industrial control and other embedded applications naturally are thought of as "real time systems" with known response times, by parallel tasks. In contrast UNIX was not (and even Linux today still is not) a "Real Time System". This is nothing to do with how fast your CPU is. Even 2ms tasks requiring only us to execute can't be reliably executed in a timely fashion on OS X, Linux, or NT (aka Win2000, XP, Vista, Win7) no matter how fast the CPU is or how little else the user is doing, without writing a hardware level device driver. Writing a Device driver is not trivial, and on some OS, may require "signing", expensive tools, SDK, Non-Disclosure Agreements and much study.

In this discussion we don't consider the different levels of concurrency on Intel CPU, Java or Windows. We will use Task, Process and Thread as synonymous descriptions of parallel execution, with inter-task communication via signal or other shared memory. Generally we will use the more Generic "Task". A Signal is assumed to be "atomic", i.e. no matter when an Interrupt occurs the Signal is either set or not. Larger area of shared memory is not Atomic and requires a Signal setup as a Mutex (Mutual Exclusion). A bit or Boolean variable may not be "Atomic". What kind of variable is suitable to use as a signal is Compiler and Target dependant. Signals are required to allow inter-task communication and synchronisation safely.

How we got where we are?

Early solutions were Co-operative Multitasking using Co-Routines as implemented by Modula-2 (1978, PIM2 in 1983, ISO version 1996) and also Occam (released 1983) based on David May's work on "Experimental Language for Distributed Computing" aka EPL and Tony Hoare's (Communicating Sequential Processes) aka CSP started in the late 1970s.

Both have a similar concept of process synchronisation, the Signal. Occam goes much further being a fully concurrent language.

However it's implemented, or the syntax of language or library lets you program it, the underlying assumption is usually that each task (thread, process) has its own separate memory area, especially the Stack (used for calls and parameters) and Heap (used for temporary variables). With a conventional CPU (i.e. not a PIC) we can have a scheduler that has a table of Stack pointers. Switching task is as simple as saving current Program Counter, Heap Pointer, CPU state on the Stack, save Stack pointer in the table for that task, and restore a previously saved Stack pointer, then restore CPU state, heap pointer and Program counter from that stack.

Each Task (aka process, thread, co-routine) thus has its own non-overlapping block of memory, except perhaps for Signal (Mutex) protected communications / Message or other shared memory.

Each Task runs until it sends or awaits a signal. If no task awaits the signal, the sending task is suspended till one is awaiting it. If a task Awaits a signal, it's suspended till the signal is received. This is task synchronisation by semaphore

A Mutex is simply created by using a signal. A task would initialise the resource and "send" the signal, that task can then waits for a signal when the first user consumes the signal. Each task that requires the resource Awaits the signal before use and Sends the signal after completion. The Resource task eats that and sends a new signal so that the task using the resource does not wait till someone else needs the task.

Depending on Signal/Semaphore implementation and the language or library, a mutex may be differently implemented.


Penalties on a regular CPU

1. Compared with a purely sequential program, a parallelised program may use much more RAM.
2. On a CPU without Virtual Memory and a MMU, the programmer may have to decide how much memory to allocate to a task at creation. It may be difficult without MMU/Virtual Memory to change the allocation at run time.
3. Without VM/MMU, ensuring that array bound violation, Stack or Heap overflow don't destroy more than one process is nearly impossible.
4. Deadlock is possible in any parallelised system. This requires different programming skills to avoid.
5. Any Shared resources must have Mutual Exclusion (Mutex) so only one process at a time has access. Like a toilet cubical.
6. Any shared code must be "re-entrant". If it uses a global value to hold state (static variable), this needs to somehow be unique to each process.

Advantages
1. Digital Signal Processing (DSP) much simpler to implement.
2. Data Flow Diagrams (DFDs) especially based on realtime sampling of I/O is much easier to implement.
3. A delay "loop" (or While Busy do Nothing) does not "waste" CPU or block other tasks. It takes a fixed overhead no matter how long the delay.
4. Real-time response / Embedded control is much easier to write.
5. Even interrupts could become unnecessary, implemented by Signals instead.
6. If you have unknown number of cores or distributed CPUs and this is more than one, then a well designed parallelised Program with a suitable Real Time Operating system can map different Tasks

Problems with Multitasking on PIC
.
1. There is simply not the RAM for the conventional model.
2. No RAM based Stack. It's a Hardware stack, you can only have one of it. You would have to empty and fill the entire stack on a task switch. On less than 18F (10, 12, 16) you can't Push and Pop items on Stack at all. It's only for the Interrupt return.
3. Even on 18F, you could only task easily on main level of a "task" not within a procedure/Function call.

Looks bad? Take a break from Multitasking for a moment and consider JAL.

JAL
Why "Just Another Language" (JAL) rather than C, Basic, Forth or even Pascal or Modula-2?

The design of these languages assumes ample RAM, an accessible Stack and such that "regular" CPUs have. C, Basic and Forth do exist for the 16F and 18F. Forth obviously uses a fabricated software stack rather than the real Stack pointer. Parameter passing and Function returns are normally on the Stack. But on the PIC they have to be RAM based, i.e. "general purpose register file". JAL and the JAL compiler in contrast was especially designed for the quirky architecture of the 16F. JAL also has some features not part of standard C, Basic or Forth for embedded programming:

1) The ChipDef / Chip Include. This allows JAL compiler and JAL programmer to use the hundreds of different PIC, some 10F, 12F, most 16F and most 18F. This is currently about 345 cpu models. There is now an automated process to create these from new Microchip datasheets.

2) ALIAS IS This allows pins, ports and registers to be given meaningful application dependent names

3) AT, a variable declaration can be "at" a Register file "location", or another variable. AT connects a variable name to particular pin, port or register address. Or "AT" can act like a "union".

var  dword  fred 
var byte bill[4] at fred

This means bill[0] is 1st byte of fred and bill[3] is last byte of fred, dword is always 32 bits.

if you need a 64 bit variable you can
var byte[8] bill

There oddly is NO main program, in sense of main() in C. The logic is that when you declare a variable that is a port you may want to set it up with a value or a direction.

 var byte bill[4] = "Hello"
alias  GLCD_LED           is pin_E1
alias  GLCD_LED_direction is pin_E1_direction
GLCD_LED = on 

Typically, textually the last part of your file of JAL will be

 
forever loop
	-- do stuff
	-- do more stuff
end loop 

All embedded programs generally have this in the last part of the main program or else the CPU would either halt, or do nops (no operation) instructions till program counter wrapped around to reboot.

if you had a Real Time Operating system or Scheduler, it might be started here, or the loop might just have "suspend" in it.

Basics of Real Time
An embedded system or Real Time system is not about speed, per se, but about timely response. It is best to consider it in terms of data flow and sampling. The inputs read, outputs must be updated at a certain rate and processing between have a maximum latency.

1. Timing and Sampling.
Not all tasks need the same time constraints. Not all inputs or outputs require the same response time. First identify the highest speed input. If it must be polled rather than generating an interrupt, the sample clock must be at least twice as fast (Nyquist). This is the minimum speed then of the master Real Time Clock Interrupt. There may be a limited number of Hardware counters / Interrupts. But really we only need one, and that is more efficient than several in use of Stack, as otherwise they may need to overlap.

Identify all the slower events, sampling and timing required. See if there is a simple common denominator as a faster, but still reasonable clock interrupt. It may be that some intervals required are not an exact integer multiple of this Master Interrupt. In this case the ISR (Interrupt Service Routine) will use two software counters to achieve fractional multiplication. All the higher speed events should be an exact integer multiple. In some cases an inexact multiple is fine.

Example
We need to sample 1200Hz external clock on a pin and read a data pin, but 2400Hz is 416.666..7 micro seconds. Our clock is 64us. In this case simply over-sample faster at 256us, thus 4 counts in software of interrupt. This is 3906.25 Hz which is about 3.25 times, exceeding the minimum x2. Counting 5 ticks would be 320us which would also work and also 6 at 384us. But x7 at 448us is too slow, the 1200Hz is not maybe exactly 1200Hz and not synchronized to the CPU clock. Also the CPU clock may not be exactly as expected. So over-sampling at higher speed reduces jitter and increases margin if the remote system is in error. USART (PC serial Ports) typically over-sample at x4 to x 16.

For a non-integer multiplication we need to periodically change the software counter that counts the "ticks" between two values such that the average count is correct. This does introduce some jitter, equivalent to the (difference in counts) x (tick rate).

When the software counter/multiplier has counted.
Then the counter is reloaded with default value. The nature of the task decides what happens next. If you have an 18F, then you can call a procedure or function that takes longer than tick time but less than the (tick time) x (software count) period as the HW stack is 31 deep. If it is 10F, 12F, 16F, then due to lack of HW stack (1 to 3 only), the task is ideally inline and must be completed before next tick. If the task repetition time is longer than time to execute the main "forever loop" then you increment a semaphore. A section in the main loop skips if the semaphore is zero, otherwise executes extra code and decrements the semaphore (Must be a single byte on 8 bit PIC).

2. Functions that wait internally
This also includes Procedures that must wait for external event or have "out" parameters.

We are essentially implementing Data Flow architecture. So everything is in the main forever loop or repeatedly called by an Interrupt. The solution is for the routine (function, procedure) to either use a shared ("private"global") variable if several use the same resource (say Serial I/O) or a non-shared routine specific "private" global. The first action then is for the semaphore or mutex variable to be tested. If it is not in use, we set it and do the first part of the task. If it is in use we check is it "our" value and if the resource required (time, serial port, whatever) is ready. If it it's ready then we service it and set the semaphore/Mutex back to "empty". If it is for "us" but the resource isn't ready we return.

All data return has to be via "out" parameters. The Function return is always false if the function only did first part of "task" or the "resource" to complete the task was busy, or some other function has the "resource". The Function only returns true when the resource was accessed and the Function reset the "semaphore/Mutex" to "empty".

Thus in our procedures called by the main "forever loop" the code after the Function call is repeatedly skipped till there is valid data in the "out" parameters.

This method makes "blocking" I/O non-blocking for the main "forever loop" and makes "While busy do nothing" and "delay (xxx)" take almost no overhead in the main loop. We have turned the main "forever loop" into a basic round robin scheduler without the overhead of multiple stacks and a RTOS kernel.

example Function:

function CAT_FreqBCDRd(byte out ModMode,  byte out bcdfreq[4]) return bit is
var bit status = false
var byte digit
    bcdfreq[0] =0
    bcdfreq[1] =1
    bcdfreq[2] =0
    bcdfreq[3] =0
    modMode = 16
    if  ( _CAT_QueuedCmd == CMD_NONE ) then
        CAT_SendCmd(CAT_NO_PARAM, CMD_FreqRd, true)
        status = false
        --suspend
    elsif (_CAT_QueuedCmd == CMD_FreqRd) then
        if (serial_hw_data_available)  then
            digit = 0      -- get back the frequency
            bcdfreq[digit]= serial_hw_data
            for 3  loop
               digit = digit +1
               if  RigResponds() then
                   bcdfreq[digit]= serial_hw_data
               else
                   exit loop
               end if
            end loop
            if RigResponds() then    -- read the Mode
                modMode = serial_hw_data
                status = true
            end if
            _CAT_QueuedCmd = CMD_NONE
        elsif ( _CAT_retryCount < 1) then
             _CAT_QueuedCmd = CMD_NONE
        else
             _CAT_retryCount = _CAT_retryCount -1
        end if
    end if
    return (status)
end function

In this case ( _CAT_QueuedCmd == CMD_NONE ) is testing the shared Mutex /semaphore to see if the resource is free (it's a serial port). The Function has to send 5 bytes via serial and wait up to 350ms for the remote device to return 5 bytes. Once the remote responds, the 5 returned bytes come in a block, but we have tests in case the communication link is lost or remote is turned off in the middle of transmission.

The CAT_FreqBCDRd thus could be called many times before it does anything. Since we know there is no point in interrogating the remote device more than 3 or 4 times a second, we have a timer that sends a "signal" to part of our main loop, so many times this part of main loop (shown below) is skipped:

part of main loop

  -- loop
        if CheckRadio > 0 then
            case rigModeNow of
            RIGMODE_RX: block
                     if !(rigModeNow  == rigModeOld) then
                          if ! clock_on then
                              DrawClockFace(19,35,17, on, off)
                              clock_on = true
                              DisplayWait()
                              DrawMeterFace(42, 16, S_METER_HEIGHT,S_METER_SCALE, on)
                          end if
                          pttStatus = off
                          if CAT_PTT(pttStatus) then
                               rigModeOld = rigModeNow
                               linkFailTime = LINK_RETRIES
                          end if
                     else
                         RIG_RxPoll()
                     end if

                 end block

CheckRadio is incremented by the Timer ISR.

Then in Rig_RXPoll() we interrogate the Remote Radio :

Procedure DisplayFrequency() is
var volatile byte rigFreq[4]
var byte rigFreqText[10] = "430.125,00"
var byte modmodeText[3]
var byte newMode
    if CAT_FreqBCDRd(newMode,  rigfreq) then
        if BCD4toString10 (rigFreq, rigFreqText) then
            timedOut = false
            if newMode < 15 then
                modMode = newMode
            end if
            ScreenCharXY(0,0)
            CharStyleDouble = on
            print_string(ScreenChar,rigFreqText)
            CharStyleDouble = off
            ScreenCharXY(20,1)
            CharStyleBold = on
            ScreenChar ="0"
            CharStyleBold = off
        end if
        ScreenCharXY(15,7)
        if RIG_ModeText(modMode,modmodeText) then
            ScreenChar = "n"
        else
            ScreenChar = " "
        end if
        print_string(ScreenChar,modmodeText)
        linkFailTime = LINK_RETRIES
    elsif (linkFailTime < 1) then
        if ! timedOut then
           DisplayWait()
           timedOut = true
        end if
    else
        linkfailtime = linkfailtime -1
    end if
end procedure

If after a reasonable time the CAT_FreqBCDRd doesn't return "true" we assume the communications link is broken or the Remote unit is turned off.

To Be Continued! ...

Appendix
Complete "Blink LEDs"
for the 18F4550
include 18f4550

-- even though the external crystal is 20 MHz, the configuration is such that
-- the CPU clock is derived from the 96 Mhz PLL clock (div2), therefore set
-- target frequency to 48 MHz
pragma target clock       48_000_000

-- fuses
pragma target PLLDIV        P5          -- divide by 5 - 20MHZ_INPUT
pragma target CPUDIV        P2          -- OSC1_OSC2_SRC_1_96MHZ_PLL_SRC_2
pragma target USBPLL        F48MHZ      -- CLOCK_SRC_FROM_96MHZ_PLL_2
pragma target OSC           HS_PLL
pragma target FCMEN         DISABLED
pragma target IESO          DISABLED
pragma target PWRTE         ENABLED    -- power up timer
pragma target VREGEN        ENABLED     -- USB voltage regulator
pragma target VOLTAGE       V20         -- brown out voltage
pragma target BROWNOUT      DISABLED    -- no brownout detection
pragma target WDTPS         P32K        -- watch dog saler setting
pragma target WDT           DISABLED    -- no watchdog
pragma target CCP2MUX       pin_C1      -- CCP2 pin
pragma target PBADEN        DIGITAL     -- digital input port<0..4>
pragma target LPT1OSC       LOW_POWER   -- low power timer 1
pragma target MCLR          EXTERNAL    -- master reset on RE3
pragma target STVR          DISABLED    -- reset on stack over/under flow
pragma target LVP           DISABLED    -- no low-voltage programming
pragma target XINST         ENABLED     -- extended instruction set
pragma target DEBUG         DISABLED    -- background debugging
pragma target CP0           DISABLED    -- code block 0 not protected
pragma target CP1           DISABLED    -- code block 1 not protected
pragma target CP2           DISABLED    -- code block 2 not protected
pragma target CP3           DISABLED    -- code block 3 not protected
pragma target CPB           DISABLED    -- bootblock code not write protected
pragma target CPD           DISABLED    -- eeprom code not write protected
pragma target WRT0          DISABLED    -- table writeblock 0 not protected
pragma target WRT1          DISABLED    -- table write block 1 not protected
pragma target WRT2          DISABLED    -- table write block 2 not protected
pragma target WRT3          DISABLED    -- table write block 3 not protected
pragma target WRTB          DISABLED    -- bootblock not write protected
pragma target WRTD          DISABLED    -- eeprom not write protected
pragma target WRTC          DISABLED    -- config not write protected
pragma target EBTR0         DISABLED    -- table read block 0 not protected
pragma target EBTR1         DISABLED    -- table read block 1 not protected
pragma target EBTR2         DISABLED    -- table read block 2 not protected
pragma target EBTR3         DISABLED    -- table read block 3 not protected
pragma target EBTRB         DISABLED    -- boot block not protected

enable_digital_io()


-- --------------------- Main Program  ----------------------

alias led1 is   pin_b1
pin_b1_direction = output
alias led2 is  pin_b2
pin_b2_direction = output
alias led3 is  pin_b2
pin_b3_direction = output

const byte NUM_DELAYS = 3


var byte CheckDelay[NUM_DELAYS] = {0, 0, 0}   -- semaphores


var byte timer[NUM_DELAYS] = {0, 0, 0}  --internal delay timer

-- 1 to 255 x the RTC tick of 1024 usec.
function Delay(byte in instance, byte in ms) return bit is
var bit timedout = false
    if instance < Count (CheckDelay) then   -- only bother if it exists
        if CheckDelay[instance] > 0 then
           if timer[instance] < 1  then        -- 1st time ever call
              timer[instance] = ms +1
           elsif timer[instance] ==1 then      -- timeout
              timer[instance] = ms +1
              timedout = true
           else
              timer[instance] = timer[instance] -1
           end if
           CheckDelay[instance] = checkDelay[instance] -1
        end if
    end if
    return (timedout)
end function



procedure BlinkLed1() is
     If Delay(0, 100) then
        led1 =  ! Led1
     end if
end procedure


procedure BlinkLed2() is
     If Delay(1,120) then
        led2 =  ! Led2
     end if

end procedure

procedure BlinkLed3()  is
     If Delay(2,200) then
        led3 =  ! Led3
     end if
end procedure

-- The RTC always increments a Semaphore by 1 if it's time to do a task
-- The Task always decrements the Semaphore by 1 when it has completed.


const TICK_INIT = 48           -- 1024us = 1.024ms

var byte ticks = TICK_INIT          -- timer
  
procedure RTC() is
   pragma interrupt
var byte instance
   if INTCON_TMR0IF then
       INTCON_TMR0IF = off        -- clear the timer 0 interrupt flag
       ticks = ticks -1           -- 256us counter
       if (ticks < 1) then
    	   ticks = TICK_INIT
    	   for Count (CheckDelay) using instance loop
               CheckDelay[instance] = CheckDelay[instance] +1
           end loop
       end if
   end if
end procedure                   -- end of ISR

-- Main Program

block   
    --RTC setup

    T0CON_T0CS =0                          -- TMR0 on internal clock
    T0CON_PSA = 1                                -- prescaler
                         -- so no prescaler for TMR0 (= default)
    INTCON_TMR0IE = on                  -- if your PIC freezes, move these lines
    INTCON_GIE = on                     -- to see if the ISR causes trouble
    forever loop
        BlinkLed1()

        BlinkLed2()

        BlinkLed3()
    end loop
end block

Choosing a PIC

10F, 12F, 16F or 18F?
Less than the 18F series has only 1 to 8 entries possible in Stack and only up to 350byte approximately of RAM. The 18F series has up to 31 stack levels (still a fixed Hardware stack, i.e. you can't change the Stack Pointer to relocate it elsewhere in RAM).
There are about 10 models each of 6, 8 and 14 pin PIC in the 10F, 12F and 16F series.
The 18F series is not available in less than an 18 pin package. The lesser 10, 12 and 16 series are available in as little as 6 pins but no 8 bit HW multiply, no PUSH/POP and only 1 to 3 stack levels and less than 350 byte RAM. Most are 2K words to 4K words. The 18F are typically 8K words to 64k words. Because an instruction "word" is more than 8 bits but less than 16bits the PIC 10, 12, 16 series don't store lookup tables efficiently. Most lookup tables use 8 bit or 16 bit values. The 18F uses a 16bit word so lookup tables are more efficient. There are hundreds of 18F models
JAL currently supports over 350 PIC from 10F, 12F, 16F and 18F family, including 18FxxJxx and 18FxxKxx devices

We don't consider here the 17F as it's obsolete version of the 18F. Nor do we consider the higher end dspPIC or 24F series as the ARM or PIC32 (really a MIPS core) is a better choice. Some people mistakenly think the 18F is a 16 bit cpu as Microchip refers to it as 16 bit core, but they do group it as an 8bit cpu. It's less 16 bit than an 8bit Z80! The 16 bits is only the instruction size. All the PIC 10, 12, 16, 17 and 18 are 8bit CPU slightly similar to 8051 rather than 8080, Z80 or even 6502/6800 type family.

For 18 pins or more only consider the 18F family

[url]http://www.microchip.com/ParamChartSearch/chart.aspx?branchID=1004&mid=10〈=en&pageId=74[/url]

Selection and Parametric search of all 8bit PIC
http://www.microchip.com/stellent/idcplg?IdcService=SS_GET_PAGE&nodeId=2696&param=en537796

Getting JAL
http://code.google.com/p/jallib/ (Downloads on Right of page)
see also
http://groups.google.com/group/jallib
http://tech.groups.yahoo.com/group/jallist/
http://www.casadeyork.com/jalv2/
http://groups.google.com/group/jaledit
http://groups.google.com/group/jaluino
http://justanotherlanguage.org/

User login