如何在不使用汇编的情况下为x86编写原始机器代码?

Maz*_*ion 0 x86 machine-code low-level

我希望能够编写原始机器代码,而无需汇编或任何其他种类的高级语言,而这些语言可以直接放在闪存驱动器上并运行。我已经知道要执行此操作,我需要将主引导记录标头(我已经设法手动完成)格式化到驱动器上。我已完成此操作,并成功使用代码所在驱动器的第一个扇区(在本例中为前512个字节)中的汇编代码,使一行文本显示在屏幕上。但是,我希望能够像在MBR格式化中那样将原始的十六进制代码写入驱动器,而无需诸如汇编之类的任何工具来帮助我。我知道有一种方法可以做到这一点,但是我真的找不到任何不涉及汇编的东西。在哪里可以找到有关此信息?汇编附带了谷歌搜索机器代码或x86编程,这不是我想要的。

Pet*_*des 5

如果您真正想要的是更好地理解 x86 机器代码,我建议您首先查看汇编器的输出,以查看它为每一行 asm 源代码汇编到输出文件中的字节。

nasm -fbin -l listing.txt foo.asm将为您提供一个包含原始十六进制字节和源代码行nasm -fbin -l/dev/stdout foo.asm | less的列表,或者将列表直接传送到文本查看器中。请参阅在 codegolf.SE 上编写的13 字节 x86 机器代码中的这个色度键混合函数,以获取输出外观示例。

也可以正常创建后反汇编二进制文件。 ndisasm适用于平面二进制文件,并生成相同格式的十六进制字节 + asm 指令。其他反汇编objdump程序也可用:使用 objdump 反汇编平面二进制文件

半相关:如何将十六进制代码转换为 x86 指令


Intel 的 x86 手册完全指定了指令的编码方式:请参阅vol.2 insn set 参考手册,第 2 章指令格式以了解前缀、操作码、ModR/M + 可选 SIB 和可选位移以及立即数的细分。

鉴于此,您可以阅读有关如何对其进行编码的每条指令文档,例如D1 /4( shl r/m32, 1)表示操作码字节为 D1,而/rModRM的字段必须为 4。(该/r字段用作某些指令的 3 个附加操作码位。 )

还有一个附录将操作码字节映射回指令,以及该手册中的其他部分。

您当然可以使用十六进制编辑器输入您手动计算的编码,以创建 512 字节的二进制文件,而无需使用汇编程序。但这是一个毫无意义的练习。


有关x86 指令编码的许多怪癖,另请参阅在 x86 机器代码中打高尔夫球的技巧:例如,inc/dec完整寄存器有单字节编码(64 位模式除外)。它当然侧重于指令长度,但除非您坚持自己查找实际编码,否则有趣的部分是哪些形式的指令具有不同或特殊的编码可用。该提示问答的几个答案有objdump -d显示机器代码字节和 AT&T 语法反汇编的输出。


old*_*mer 5

只是画画而已...

First off you are not going to find a how to program in machine code, that doesn't have assembly associated with it and that should be obvious. Any decent instruction reference of which most you will find contain the assembly for some assembler along with the machine code, because you really need some way to reference some bit pattern and assembly language is that language.

So look up nop for example you find the bit patter 10010000 or 0x90. So if I want to add the instruction nop to my program I add the byte 0x90. So even if you go back to very early processors you still desired to program in assembly language and hand assemble with pencil and paper then use dip switches to clock the program into memory before trying to run it. Because it just makes sense. Decades later even to demonstrate machine code programming, particularly with a painful instruction set like x86, you start with assembly, assemble, then dissassemble, then talk about it, so here goes:

top:
    mov ah,01h
    jmp one
    nop
    nop
one:
    add ah,01h
    jmp two
two:
    mov bx,1234h
    nop
    jmp three
    jmp three
    jmp three
three:
    nop
    jmp top

nasm -f aout so.s -o so.elf
objdump -D so.elf

00000000 <top>:
   0:   b4 01                   mov    $0x1,%ah
   2:   eb 02                   jmp    6 <one>
   4:   90                      nop
   5:   90                      nop

00000006 <one>:
   6:   80 c4 01                add    $0x1,%ah
   9:   eb 00                   jmp    b <two>

0000000b <two>:
   b:   66 bb 34 12             mov    $0x1234,%bx
   f:   90                      nop
  10:   eb 04                   jmp    16 <three>
  12:   eb 02                   jmp    16 <three>
  14:   eb 00                   jmp    16 <three>

00000016 <three>:
  16:   90                      nop
  17:   eb e7                   jmp    0 <top>
Run Code Online (Sandbox Code Playgroud)

so just the first couple of instructions describe the problem and why asm makes so much sense...

The first one you can easily program in machine code b4 01 mov ah,01h we go into the overloaded instruction mov in the documentation and find immediate operand to register. 1011wreg data we have one byte so it is not a word so the word bit is not set, we have to look up reg to find ah end up with b4 and the immediate is 01h. Not that bad, but now jump I want to jump over some stuff, well how much stuff? Which jump do I want to use? Do I want to be conservative and use the fewest byte one?

I can see that I want to jump over two instructions we can easily look up the nops to know they are one byte, 0x90, instructions. so intra-segment direct short should work as the assembler chose. 0xEB but what is the offset? 0x02 to jump over the two BYTES of instructions between where I am and where I want to go.

So you can go through the rest of the instructions I have assembled here from the intel documentation to see what and why the assembler chose those bytes.

Now I am looking at the intel 8086/8088 manual right now, the intra-segment direct short instruction comments on sign extended, the intra-segment direct does not say sign extended, although the processor at this time was 16 bits but you had a few more bits of segment so by only reading the manual, having no access to the design engineers, and using no debugged assembler for reference, how would I know if I could have used the 16 bit direct jump for that last instruction that is branching backward? In this case the assembler chose the byte sized offset, but what if...

Im using a 16 bit manual but 32/64 bit tools, so I have to consider that, but I could and did do this:

three:
    nop
db 0xe9,0xe7,0xff,0xff,0xff
Run Code Online (Sandbox Code Playgroud)

instead of jmp top.

00000016 <three>:
  16:   90                      nop
  17:   e9 e7 ff ff ff          jmp    3 <top+0x3>
Run Code Online (Sandbox Code Playgroud)

for 8086 that would have been 0xe9,0xe7,0xff

   db 0xb4,0x01
   db 0xeb,0x02
   db 0x90
   db 0x90
Run Code Online (Sandbox Code Playgroud)

so now what if I wanted to change one of the nops being jumped over to a mov

   db 0xb4,0x01
   db 0xeb,0x02
   db 0xb4,0x11
   db 0x90
Run Code Online (Sandbox Code Playgroud)

but its broken now I have to fix the jump

   db 0xb4,0x01
   db 0xeb,0x03
   db 0xb4,0x11
   db 0x90
Run Code Online (Sandbox Code Playgroud)

Now change that to an add

   db 0xb4,0x01
   db 0xeb,0x03
   db 0x80,0xc4,0x01
   db 0x90
Run Code Online (Sandbox Code Playgroud)

Now I have to change the jump again

   db 0xb4,0x01
   db 0xeb,0x04
   db 0x80,0xc4,0x01
   db 0x90
Run Code Online (Sandbox Code Playgroud)

But had I programmed that jmp one in assembly language I don't have to deal with that the assembler does it. It gets worse when your jump is right on that cusp of the distance then you say have some other jumps within that loop, you have to go through the code several times to see if any of those other jumps are 2 or 3 or 4 bytes, and does that push my longer jumps over the edge from one byte to another

a:
...
jmp x
...
jmp a
...
x:
Run Code Online (Sandbox Code Playgroud)

as we pass jump x do we allocate 2 bytes for it? then get to jmp a, allocate two bytes for it as well and at that point we may have figured out all the rest of the instructions between jmp a and a: and it just fits in a two byte jump. but then eventually we get to x: to find that jmp x needs to be 3 bytes, that pushes the jmp a too far now it has to be a three byte jmp, which means we have to go back to jmp x and adjust for the additional byte from jmp a being three bytes now instead of the assumed 2.

The assembler does all off this for you, if you want to program machine code directly first and formost how are you going to keep track of the hundreds of different instructions without some natural language notes to keep track?

So I can do this

    mov ah,01h
top:
    add ah,01h
    nop
    nop
    jmp top
Run Code Online (Sandbox Code Playgroud)

then

nasm so.s -o so
hexdump -C so
00000000  b4 01 80 c4 01 90 90 eb  f9                       
|.........|
00000009
Run Code Online (Sandbox Code Playgroud)

Or I can do this:

#include <stdio.h>
unsigned char data[]={0xb4,0x01,0x80,0xc4,0x01,0x90,0x90,0xeb,0xf9};
int main ( void )
{
    FILE *fp;
    fp=fopen("out.bin","wb");
    if(fp==NULL) return(1);
    fwrite(data,1,sizeof(data),fp);
    fclose(fp);
}
Run Code Online (Sandbox Code Playgroud)

I want to add a nop to the loop:

    mov ah,01h
top:
    add ah,01h
    nop
    nop
    nop
    jmp top
Run Code Online (Sandbox Code Playgroud)

vs

#include <stdio.h>
unsigned char data[]={0xb4,0x01,0x80,0xc4,0x01,0x90,0x90,0x90,0xeb,0xf8};
int main ( void )
{
    FILE *fp;
    fp=fopen("out.bin","wb");
    if(fp==NULL) return(1);
    fwrite(data,1,sizeof(data),fp);
    fclose(fp);
}
Run Code Online (Sandbox Code Playgroud)

If I was really trying to write in machine code I would have to do something like this:

unsigned char data[]={
0xb4,0x01, //top:
0x80,0xc4,0x01, //add ah,01h
0x90, //nop
0x90, //nop
0x90, //nop
0xeb,0xf8 //jmp top
};
Run Code Online (Sandbox Code Playgroud)

To remain sane. There are some instruction sets I have used and made for myself for fun and were easier to program in machine code, but still better done with comments in pseudocode using assembly mnemonics...

If your goal is to simply end up with some blob of machine code in some format, bare metal or other not some Windows or Linux file format program, you use assembly language and in one or two steps of the toolchain you get from the assembly source to the binary machine code result. Worst case you write an ad hoc program to get from the output of the toolchain, and manipulate those bits into other bits. You don't toss out the tools available to write raw bits at the end by hand, you just reformat the output file format.