快速签名的16位除以7表示6502

Question

快速签名的16位除以7表示6502

mar*_*964 13 assembly 6502 division integer-division micro-optimization

我正在研究一个6502 cpu的汇编语言程序,我发现我需要一个快速尽可能快的七分程序,特别是一个可以获得16位分红的程序.

我熟悉这里发现的例程,但是对七分法例程进行了概括,发现它非常复杂,粗略地检查了一般算法(使用整数除法)

x/7~ =(x + x/8 + x/64 ...)/ 8

表示要处理16位范围,由于6502的单个累加器寄存器和6502上各个存储器位移的相对慢速,可能需要100多个周期才能完成.

我认为查找表可能会有所帮助,但在6502上,我当然只限于256字节或更少的查找表.为此,可以假设存在两个256字节的查找表xdiv7和xmod7,当使用无符号的单字节值作为表的索引时,可以快速获得字节除以7或模数的结果分别为7.但是,我不确定如何利用它们来查找完整16位范围的值.

与此同时,我还需要一个模7算法,尽管理想情况下,可以通过除法得到的解决方案也会产生mod7结果.如果需要额外的预计算表,只要所有表的总内存需求不超过约3k,我就可以添加这些表.

虽然我最终需要一个带符号的除法算法,但是一个无符号算法就足够了,因为我可以根据需要将它推广到一个有符号的例程.

任何帮助将不胜感激.

Answer 1

use*_*109 7

注意:正如@Damien_The_Unbeliever在评论中指出的那样,upperHigh和lowerLow表是相同的.所以它们可以组合成一个表.但是,这种优化会使代码更难以阅读,并且解释起来更难,因此将表格组合起来留给读者练习.

下面的代码显示了在将16位无符号值除以7时如何生成商和余数.解释代码(IMO)的最简单方法是使用示例,所以让我们考虑除以0xa7327.预期结果是:

quotient = 0x17e2
remainder = 4

Run Code Online (Sandbox Code Playgroud)

我们首先将输入视为两个8位值,即upper字节和lower字节.该upper字节0xa7和lower字节0x32.

我们从upper字节计算商和余数:

0xa700 / 7 = 0x17db
0xa700 % 7 = 3

Run Code Online (Sandbox Code Playgroud)

所以我们需要三个表:

upperHigh 存储商的高字节: upperHigh[0xa7] = 0x17
upperLow 存储商的低字节: upperLow[0xa7] = 0xdb
upperRem 存储剩余部分: upperRem[0xa7] = 3

我们从lower字节计算商和余数:

0x32 / 7 = 0x07
0x32 % 7 = 1

Run Code Online (Sandbox Code Playgroud)

所以我们需要两个表:

lowerLow 存储商的低字节: lowerLow[0x32] = 0x07
lowerRem 存储剩余部分: lowerRem[0x32] = 1

现在我们需要汇总最终答案.剩余部分是两个剩余部分的总和.由于每个余数都在[0,6]范围内,因此总和在[0,12]范围内.因此,我们可以使用两个13字节查找将总和转换为最终余数和进位.

商的低字节是该进位和来自lowerLow和upperLow表的值的总和.注意,总和可以生成进入高字节的进位.

商的高字节是该进位和upperHigh表中的值的总和.

所以,要完成这个例子:

remainder = 1 + 3 = 4              // simple add (no carry in)
lowResult = 0x07 + 0xdb = 0xe2     // add with carry from remainder
highResult = 0x17                  // add with carry from lowResult

Run Code Online (Sandbox Code Playgroud)

实现它的汇编代码包括7个表查找,一个无加载指令和两个附加携带指令.

#include <stdio.h>
#include <stdint.h>

uint8_t upperHigh[256];  // index:(upper 8 bits of the number)  value:(high 8 bits of the quotient)
uint8_t upperLow[256];   // index:(upper 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t upperRem[256];   // index:(upper 8 bits of the number)  value:(remainder when dividing the upper bits by 7)
uint8_t lowerLow[256];   // index:(lower 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t lowerRem[256];   // index:(lower 8 bits of the number)  value:(remainder when dividing the lower bits by 7)
uint8_t carryRem[13]    = { 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 };
uint8_t combinedRem[13] = { 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5 };

void populateLookupTables(void)
{
    for (uint16_t i = 0; i < 256; i++)
    {
        uint16_t upper = i << 8;
        upperHigh[i] = (upper / 7) >> 8;
        upperLow[i] = (upper / 7) & 0xff;
        upperRem[i] = upper % 7;

        uint16_t lower = i;
        lowerLow[i] = lower / 7;
        lowerRem[i] = lower % 7;
    }
}

void divideBy7(uint8_t upperValue, uint8_t lowerValue, uint8_t *highResult, uint8_t *lowResult, uint8_t *remainder)
{
    uint8_t temp = upperRem[upperValue] + lowerRem[lowerValue];
    *remainder = combinedRem[temp];
    *lowResult = upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp];
    uint8_t carry = (upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp]) >> 8;  // Note this is just the carry flag from the 'lowResult' calcaluation
    *highResult = upperHigh[upperValue] + carry;
}

int main(void)
{
    populateLookupTables();

    uint16_t n = 0;
    while (1)
    {
        uint8_t upper = n >> 8;
        uint8_t lower = n & 0xff;

        uint16_t quotient1  = n / 7;
        uint16_t remainder1 = n % 7;

        uint8_t high, low, rem;
        divideBy7(upper, lower, &high, &low, &rem);
        uint16_t quotient2 = (high << 8) | low;
        uint16_t remainder2 = rem;

        printf("n=%u q1=%u r1=%u q2=%u r2=%u", n, quotient1, remainder1, quotient2, remainder2);
        if (quotient1 != quotient2 || remainder1 != remainder2)
            printf(" **** failed ****");
        printf("\n");

        n++;
        if (n == 0)
            break;
    }
}

Run Code Online (Sandbox Code Playgroud)

可能是个愚蠢的问题,但是`upperHigh`和`lowerLow`的内容是不同的？ (4认同)
@ markt1964:在asm中,只需将两个标签放在同一个地方,不要*实际上*复制内存.在C中,`#define`或`static const uint8_t*upperHigh = lowerLow;` (4认同)

Answer 2

alv*_*ngo 5

在 8 位除以 7 的无符号整数除法例程中：

;Divide by 7 (From December '84 Apple Assembly Line)
;15 bytes, 27 cycles
  sta  temp
  lsr
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr

Run Code Online (Sandbox Code Playgroud)

对大约 100 个带班次的周期的估计相当准确：到最后一个 ror 为止有 104 个周期，总共 106 个周期（不包括）rts，整个函数有 112 个周期。
注意：在组装 C64 并使用 C64 的 VICE 模拟器后，我发现算法失败，例如 65535 给出 9343，正确答案是 9362。

   ; for 16 bit division  by 7
   ; input:
  ;   register A is low byte
  ;   register X is high byte
  ; output 
  ;   register A is low byte
  ;   register X is high byte
  ;
  ; memory on page zero
  ; temp     is on page zero, 2 bytes
  ; aHigh    is on page zero, 1 byte
  --
  sta temp
  stx temp+1
  stx aHigh
  --
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  ---
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  --
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a     -- 104 cycles
  ;-------
  ldx aHigh  ; -- 106
  rts        ; -- 112 cycles

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，6 月前
查看次数：	640 次
最近记录：	7 年，1 月前