High density floppy disc access

From BeebWiki
Revision as of 19:37, 6 October 2019 by Regregex (talk | contribs) (Balanced-stack ISRs: updated explanation of NMI latency)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Screenshot of a 510 KB catalogue

It is possible to read, write and format floppy discs on the BBC Micro in high density (that is, at 500 kHz bandwidth in MFM) with suitable hardware and a specially coded floppy disc controller driver. Besides DMA, which would require extensive modifications to the machine, there are many possible approaches to achieving the higher throughput including polling, conventional ISRs and a self-interrupting ISR proposed by Tom Seddon. The concept has been developed and proved with conventional ISRs and a supporting busy loop, driving a slightly modified disc interface from 1984. Naturally any of the techniques mentioned here require a high density disc and disc drive.

Hardware

An interface modified for high density

Prior to the 1983 introduction of the 'high density' 5 ¼-inch floppy drive in the IBM PC-AT, the same signal format was in use on 8-inch floppies in the IBM System/34 from 1978. Some mid-range controllers such as the WD279x series supported this format and were also adopted in early third-party disc interface boards for the BBC Micro. Supplying the clock frequency specified for 8-inch operation is the only modification needed for such boards; allowing frequency selection to permit continued single and double density operation is a useful extra.

Later controllers designed exclusively for double density have been found good for high density work when overclocked. The Ajax controller fitted to the Atari STE is an apparently unmodified WD1772 rated at 16 MHz. Experimenters have had promising though imperfect results doubling the clock rate to a standard WD1770[1]. There is no foreseeable reason why other controller families designed for high density should not be usable.

Balanced-stack ISRs

In this instance balanced-stack means that the ISR anticipates completion before the next interrupt, on average, and does not interfere with the stack so as to defray overflow.

Timing

Diagram of worst case NMI timings. At a conservative rate of 15 µs per interrupt, loading to a 1 MHz address fails, shown by crosses.

The non-maskable interrupt (NMI) service routines have been carefully studied to ensure they meet the floppy drive controller's timing requirements at high speed, which they do in all but one occasional case. There are two constraints from the FDC:

  1. the ISR must become re-entrant within 15 − ε µs of the interrupt;
  2. the ISR must service the FDC within 11.5 µs of the interrupt.

The first time interval is nominally 16 µs but allowance is made for jitter and fast disc drives. The ε (epsilon) means that in the worst case the NMIs will just miss one decision point and just catch the next. There are three further constraints from the 6502 CPU and the BBC Micro's clock system:

  1. 6502 instructions are atomic, and instruction processing may continue for up to 4.5 µs from the onset of an NMI. (This is composed of two clock cycles and one 7-cycle instruction; see below.)
  2. After this instruction, the 6502 executes an NMI sequence for 3.5 µs.
  3. A clock cycle that accesses a 1 MHz memory address is extended to 1 µs, or 1.5 µs if out-of-sync with an underlying 1 MHz monotonic clock (1MHzE). Once the CPU is synchronised with this clock the length of later extended cycles can be predicted by machine code analysis.

It has been found impossible to meet all these constraints, unless we ensure that no instruction longer than 6 cycles is executing when the NMI occurs, which can be arranged by carefully coding a busy loop with interrupts disabled.

We can exploit the atomic nature of 6502 instructions to buy time. The ISR need only be re-entrant, not complete, by the next interrupt; so as long as the final state-changing instruction is in progress when an NMI arrives, the behaviour will be correct. The time this instruction takes to complete has already been budgeted-for as the 'previous' instruction of the next interrupt, and the stack will not overflow as long as the CPU gets partway through RTI, on average.

According to some documentation, e.g. 64doc by Sonninen et al., to cause an NMI sequence in place of the next instruction the NMI must occur before the last cycle of the current instruction. This is confirmed by traces[1] of 6502 hardware, and caused by the 6502 sampling the NMI input only on certain cycles (usually the last) of each instruction. As we just missed the decision point in the worst case, the FDC service deadline comes 11 µs after the start of that next instruction. In all cases currently, we are clear. If 64doc were incorrect then the timings would still hold true; all the instruction rectangles would just move 1 clock cycle to the left such that each ISR gets 0.5 µs more time to service the floppy drive controller.

The minimum time between Tube data channel accesses is 11.5 µs, which can be increased with NOPs to 14.5 µs.

Reading from disc to I/O memory

Loading slow memory is stable at 16 µs ber byte.

A failure mode has been discovered when reading to 1 MHz memory areas under the above timings. As STA absolute,X instructions always take 5 cycles, the ISR may overflow the stack when traversing 1 MHz pages such as FRED and JIM. Also if a 1 MHz address at the end of a page is met within a sector, the next few bytes may be written to the wrong page.

In such cases failure can be avoided as long as NMIs arrive at the nominal interval of 16 µs (see diagram); note that drives with crystal controlled synchronous motors can stay very close to this figure. 2 MHz memory can still be successfully loaded at 15 µs per byte.

6502 interrupt bug

The 6502 has recently been discovered[1][2] to defer its response to IRQs or NMIs occurring just before the last cycle of a taken branch to the same page. In that case the branch completes and the instruction at the destination is executed before the interrupt sequence is begun. The effect (as explained by Nesdev user blargg) is that the branch adds one cycle to the maximum interrupt latency of the destination instruction.

This is in addition to the usual one cycle of latency between sampling the NMI input and fetching the next opcode; here the sample is taken on the penultimate cycle instead of the last, making two cycles of latency in total.

As long as all instructions in the busy loop, which are the target of branches from the same page, are shorter in duration than the longest instruction in the loop, then there is no impact on the worst-case interrupt latency of the system.

Calculating in the general case

Those who wish to develop their own high density system can check if the worst case timings can be met, as follows. A useful ISR is assumed to have at least these elements: a fetch, an indexed store, a register increment instruction and a conditional branch for incrementing a pointer.

  1. Determine the minimum expected interval between NMIs, in clock cycles. The datasheet will list a nominal value but allow some room for jitter and drive speed variations. 30 is a good number for high density, 60 for double, 120 for single.
  2. Calculate the maximum non-re-entrant service time. Start with 20 cycles: 7 for the NMI sequence, 4 for LDA, 5 for STA,X, 2 for INX and 2 for BNE.
  3. Add the number of cycles in the longest running instruction in your busy loop. If you have interrupts enabled, you should take this to be 7.
  4. If any of the longest running instructions are the destination of a branch from the same page, add 1 cycle.
  5. If your FDC is a WD2791 or WD2795, add 2 cycles.
  6. If your FDC is on the 1 MHz bus, add 2 cycles.
  7. If users will be saving or loading data into 1 MHz areas and expecting it to work, add 3 cycles.
  8. If the total is less than the NMI interval in cycles, the test passes. The ISR will make it to the INC instruction to cross a page boundary, and not overflow the stack the rest of the time.

The experimental system has a score of 31 and so only guarantees the nominal 32 cycle (16 µs) rate; see above.

Restrictions

The small amount of time available means many of the usual features of a disc ISR must be left out. There is not enough time to count the bytes and discard them after a certain number have been transferred. Thus all transfers are rounded up to whole sectors; this is most critical in OSWORD &7F which can no longer emulate the 8271's Read ID command faithfully, and less so in OSFILE where some memory beyond end-of-file will be overwritten, though in most cases this is not a significant problem. If multiple sector transfer commands (%10x1xxxx) are to be used, the busy loop would have to count the sectors and issue a Force Interrupt after the appropriate number, but the main routine can issue a chain of single sector transfers instead.

Nor is there time to test the FDC status register and determine whether this is a data request (DRQ) or an interrupt request (INTRQ) signalling that the command has terminated. In most WD1770-based boards these two lines are both connected to the CPU's NMI input, but in combination with high speed ISRs this will cause dropped bytes when writing to disc and extra bytes when reading. The INTRQ line will need to be disconnected if using ISRs, which should not but may affect the hardware's compatibility with standard filing system ROMs.

The ISR cannot afford to save the registers it uses. These then become volatile in the main thread and the AUG warns us on p.296, "If they are modified, the main program will suddenly find garbage in its registers in the middle of some important processing. It is probable that a total system âcrashâ would result from this." Therefore all disc operations must be confined within the busy loop with maskable interrupts disabled. As the keyboard will not be read, commands cannot be 'typed ahead' while a file is loading; and the busy loop cannot use the volatile registers except for their side effects, such as setting flags.

On the other hand, the ISR can save state in the volatile registers; they now become out-of-bounds to the busy loop altogether. For instance, X can become a running index register so that only the high address byte of a fetch or store instruction would need to be incremented in any interrupt period. The low byte of the address should be kept at &00, so that fetches do not take an extra cycle to cross a page boundary.

Byte-in-hand

With respect to write operations, to help meet the FDC's service time (11.5 µs) each byte can be prefetched so that the ISR can send it to the controller first thing before fetching the next byte. The byte rests 'in hand' in one of the volatile registers between interrupts; the main routine or busy loop prefetches the first byte of each request before sending the first command.

This means that one extra byte is fetched and discarded per request. When saving I/O memory there is a mild risk of a side effect when the area to be saved ends near a memory-mapped register (though consider also the issue of whole-sector transfers above). Likewise with the Tube, and this is the only issue there when OSWORD &7F is the exclusive user of the data channel.

However if the channel owner (e.g. OSGBPB) calls OSWORD &7F to fetch bytes from the Tube on its behalf, then the owner will find some data has been dropped and subsequent bytes will be out of sequence. In such a case OSGBPB should ideally be structured so that the channel is reopened after OSWORD &7F is called, as EDOSpat now does; however it may initially be easier to revert the ISR to the conventional, fetch-before-store form in the singular case of saving from the Tube. (The above does not apply to OSGBPB saving the file buffers in I/O memory.)

Code

Tested code fragments adapted from the assembly output of EDOSPAT 5.10.

Busy loop

The main thread is confined to this loop while a disc operation is in progress.

              \ Based on NMI disc op routine from EDOS 0.4 by Alan Williams
              \ Target is the Opus WD2791 interface.

              \ On entry A=ROM slot to access
              \          X=Value as required by ISR
              \          Y=FDC command
0D10          .edospat_disc_op
0D10 85 F4    STA mos_romsel_copy
0D12 8D 30 FE STA bbc_romsel

              \ Loop while b7 and b5 both clear.
              \ WD2791 drops some commands otherwise.
0D15          .edospat_disc_op_wait
0D15 AD 80 FE LDA fdc_base+0
0D18 49 5F    EOR # st_sense%EOR disc_op_eor%
0D1A 29 A0    AND # disc_op_and%
0D1C          OPT ps%
0D1C F0 F7    BEQ dest%
0D1E          OPT FNbndrq( ps%, disc_op_bne%, edospat_disc_op_wait)

              \ Disable interrupts
0D1E 08       PHP
0D1F 78       SEI

              \ First address, or load-immediate instructions pasted here.
              \ The address may be in another ROM slot, hence loaded here.
0D20          .edospat_disc_op_addr
0D20 AD 00 0E LDA nmi_form_bytes+0
0D23 EA       NOP

              \ Send command. A and X are now out of bounds
              \ Then wait 50 us for status register to settle
0D24 8C 80 FE STY fdc_base+0
0D27 A0 14    LDY #20
0D29          .edospat_disc_op_settle
0D29 88       DEY
0D2A D0 FD    BNE edospat_disc_op_settle

              \ Loop until controller indicates ready.
0D2C          .edospat_disc_op_loop
0D2C AC 80 FE LDY fdc_base+0
0D2F          OPT ps%
0D2F 10 FB    BPL dest%
0D31          OPT FNbnrdy( ps%, st_sense%, edospat_disc_op_loop)

              \ Loop until controller indicates not busy also.
0D31          .edospat_disc_op_test_busy
0D31 84 A0    STY edos_disc_op_temp
0D33 46 A0    LSR edos_disc_op_temp
0D35          OPT ps%
0D35 90 F5    BCC dest%
0D37          OPT FNbbusy( ps%, st_sense%, edospat_disc_op_loop)

              \ Save ISR's A and X for the next call.
0D37          .edospat_disc_op_exit
0D37 8E 21 0D STX edospat_disc_op_addr+1
0D3A 8D 23 0D STA edospat_disc_op_addr+3
0D3D A9 A2    LDA #&A2   \ =LDX immediate
0D3F 8D 20 0D STA edospat_disc_op_addr
0D42 A9 A9    LDA #&A9   \ =LDA immediate
0D44 8D 22 0D STA edospat_disc_op_addr+2

              \ Our state is saved, restore interrupt state
0D47 28       PLP

              \ Page the EDOS ROM back in
0D48 AD 60 0D LDA edos_disc_op_rom
0D4B          .edospat_disc_op_switch_rom
0D4B 85 F4    STA mos_romsel_copy
0D4D 8D 30 FE STA bbc_romsel

              \ Restore X and Y to values on entry
0D50 98       TYA
0D51 A6 A1    LDX edos_idcount
0D53 AC 63 0D LDY edos_disc_op_cmd

              \ Present controller status in A and set flags
0D56 49 FF    EOR # st_sense%

              \return to OSWORD &7F routine
0D58 60       RTS

Interrupt service routines

              \ Read from disc to I/O memory

              \ On entry X=low byte of destination address
              \          ?&0D07=high byte
0D00          .nmi_rdio
0D00 AD 83 FE LDA fdc_base+3
0D03 49 FF    EOR # sense%
0D05          .nmi_rdio_addr
0D05 9D 00 0D STA mos_nmi AND &FF00,X
0D08 E8       INX
0D09 D0 03    BNE nmi_rdio_exit
0D0B EE 07 0D INC nmi_rdio_addr+2
0D0E          .nmi_rdio_exit
0D0E 40       RTI
              \ Write from I/O memory to disc

              \ On entry A=contents of source address
              \          X=low byte of source address
              \          ?&0D0D=high byte
0D00          .nmi_wrio
0D00 49 FF    EOR # sense%
0D02 8D 83 FE STA fdc_base+3
0D05 E8       INX
0D06 D0 03    BNE nmi_wrio_addr
0D08 EE 0D 0D INC nmi_wrio_addr+2
0D0B          .nmi_wrio_addr
0D0B BD 00 0D LDA mos_nmi AND &FF00,X
0D0E          .nmi_wrio_exit
0D0E 40       RTI
              \ Read from disc to coprocessor

              \ No entry conditions
0D00          .nmi_rdtu
0D00 AD 83 FE LDA fdc_base+3
0D03 49 FF    EOR # sense%
0D05 8D E5 FE STA tube_host_fifo_3
0D08 40       RTI
              \ Write from coprocessor to disc

              \ On entry A=byte to be written
0D00          .nmi_wrtu
0D00 49 FF    EOR # sense%
0D02 8D 83 FE STA fdc_base+3
0D05 AD E5 FE LDA tube_host_fifo_3
0D08 40       RTI
              \ Read sector IDs to coprocessor

              \ On entry X=bytes remaining (initially X=4)
0D00          .nmi_idtu
0D00 AD 83 FE LDA fdc_base+3
0D03 49 FF    EOR # sense%
0D05 CA       DEX
0D06 30 03    BMI nmi_idtu_exit
0D08 8D E5 FE STA tube_host_fifo_3
0D0B          .nmi_idtu_exit
0D0B 40       RTI
              \ Verify disc

              \ No entry conditions
0D00          .nmi_veri
0D00 2C 83 FE BIT  fdc_base+3
0D03 40       RTI
              \ Format disc from RLE table in pages &0E..&0F

              \ On entry A=current byte from table (pre-inverted)
              \          X=current table index (initially 0)
0D00          .nmi_form
0D00 8D 83 FE STA fdc_base+3
0D03 DE 00 0F DEC nmi_form_counts,X
0D06 D0 04    BNE nmi_form_exit
0D08 E8       INX
0D09          .nmi_form_addr
0D09 BD 00 0E LDA nmi_form_bytes,X
0D0C          .nmi_form_exit
0D0C 40       RTI

Polling

Actively testing the FDC's status register to determine when a byte is ready is an equally valid method to sustain high speed data throughput. Polling-based transfers would require a tight loop with interrupts disabled, which we have shown is the minimum requirement with stack-balanced ISRs. The absence of NMI sequences and RTI instructions frees a small amount of time to implement more features, such as measuring and cutting off the transfer. With this approach the DRQ pin would have to be disconnected but INTRQ could remain attached (and a suitable ISR supplied) if desired.

Using SO

A negative-going edge on pin 38 of the 6502 CPU sets the overflow flag in the status register. A BVC instruction branching to itself produces a loop that exits within 3 clock cycles of lowering this pin. By connecting DRQ to this pin through an inverter the CPU can execute a loop transferring a byte to the FDC every 8 µs or slightly less. This is fast enough to consider octal density, also known as extended density (ED) or the 2.8 MB format. In that case all addresses touched must be at 2 MHz and no page boundaries can be crossed, limiting sectors to 128 or 256 bytes. This time the INTRQ signal, similarly inverted, will be needed to raise an NMI to break out of the loop.

        \ On entry X=low byte of address - 1
        \          (or &FF if sector size = 256 bytes)
        \          user=(address - 0) AND &FF00

.disc_op
        SEI
        CLV
.disc_op_loop
        BVC disc_op_loop
.disc_op_service
repeat(11) {
        LDA fdc_data
        CLV
        INX
        STA user,X
        BVC disc_op_loop
}
        LDA fdc_data
        CLV
        INX
        STA user,X
        BVS disc_op_service
        BVS disc_op_service
        BVS disc_op_service
        BVC disc_op_loop
        BVS disc_op_service

This is the fastest and most resilient polling loop possible, limited by the range of relative branching. It supports any mean request interval longer than 7.88 µs: in the worst case twelve requests 6½ to 8½ µs apart can cause the code to reach the penultimate or last instruction and loop back in 189 cycles. Typical request sequences with less jitter will be able to have a mean interval closer to the absolute limit of 7.54 µs.

Self-interrupting ISR

A method with potential when DRQ cannot be disconnected. The concept, due to Tom Seddon, is that NMIs occur in bursts and the ISR is repeated almost back-to-back; so it may as well be made long enough to interrupt itself, freeing the time taken by RTI to implement more features. As the busy loop can be suspended entirely during a burst and the ISR itself is re-entrant after a critical time, there is no need to stack and unstack a working set for every interrupt and we are pretty much back to polling, with the CPU doing the branch for us. From Tom's Mailing List post:

The NMI routine would go a bit like this:

                            \ +7 (7) -- NMI overhead
&D00    LDA FDC_DATA        \ +6 (13)
        STA (&C0),Y         \ +6 (19)
        INY                 \ +2 (21)
        BNE NO_BUMP:INC &C1 \ +7 (worst case) 28 <-- point A
        .NO_BUMP TXS        \ +2 (30)
        NOP                 \ +2 (32)
        NOP:NOP:NOP         \ +6 (cater for best-case timings)
        NOP                 \ the lucky NOP
        JMP READ_END

However, pushing the PC and status register is an unwanted side effect of receiving an NMI, and the ISR must reset the stack pointer with TXS, in place of the RTI. This saves four cycles at the tail end but gives us no more free registers in the ISR. The real advantage is that the ISR can wait for the next call with a string of NOPs, eliminating up to 5 cycles' latency at the head end. Care should be taken to assemble enough NOPs to cover the single density data rate (one NMI every 64 µs). Also the registers remain stable in the busy loop, letting us reduce the maximum length of instructions there. For instance bit 0 of the status register can be polled with:

.test_busy
        LDA #&01        \2 (3) cycles
        BIT fdc_status  \4
        BNE test_busy   \3

rather than

.test_busy
        LDY fdc_status  \4 (5) cycles
        STY temp        \3
        LSR temp        \5
        BCS test_busy   \3

(A taken branch adds one cycle of interrupt latency to the destination instruction.)

Bear in mind that in case of errors such as Sector not found, no DRQs may be issued at all. It remains to be seen how the extra time and freedom can be applied.

Applications

Other ways to use HD discs

Martin Barr reports that commodity 3½-inch high density discs work reliably when formatted at 500 kHz in FM. Often, high density discs cannot hold single- or double-density data for any length of time, even if the high density hole in the jacket is covered. With true high density being a third format this fourth, non-typical signal type was originally used on 8-inch floppies by the IBM 3740.

With the WDC x7xx family and the Intel 8271, obtaining this type is a simple matter of doubling the clock frequency while keeping the ~DDEN input high, if present. Existing ISRs capable of double density service can be reused as the data rate is the same. A double-density sector format such as ADFS, DDOS or Watford DDFS can be employed with few changes, or single density Acorn DFS which uses half of each track and so needs less media rotation per track, increasing access speed. Among the modifications that must be done to the former DFSs is that the disc formatting utility must prepare single density preambles to each sector header and data area.

See also

References

  1. 1.0 1.1 6502.org forum post by Hias Reichl, 3 September 2010
  2. Nesdev forum post by blargg, 18 June 2010

beardo 16:39, 8 October 2010 (UTC)