Chapter 3: Machine-Level Representation of Programs
这章主要介绍了汇编,不过切入点是汇编与 C 语言之间的关系,也就是所谓的 a Programmer’s Perspective
Contents
- 3.1 A Historical Perspectivey
- 3.2 Program Encodings
- 3.3 Data Formats
- 3.4 Accessing Information
- 3.5 Arithmetic and Logical Operations
- 3.6 Control
- 3.7 Procedures
- 3.8 Array Allocation and Access
- 3.9 Heterogeneous Data Structures
- 3.10 Combining Control and Data in Machine-Level Programs
- 3.11 Floating-Point Code
- 3.11.1 Floating-Point Movement and Conversion Operations.
- 3.11.2 Floating-Point Code in Procedures
- 3.11.3 Floating-Point Arithmetic Operations
- 3.11.4 Defining and Using Floating-Point Constants
- 3.11.5 Using Bitwise Operations in Floating-Point Code
- 3.11.6 Floating-Point Comparison Operations
- 3.11.7 Observations about Floating-Point Code
- 3.12 Summary
3.1 A Historical Perspectivey
汇编一路发展,一路向下兼容(弄得语法好别扭。。。
本书是第三版,主要介绍 64 位的汇编,而且用的语法是 AT&T 而不是 Intel。
32bits vs 64bits
- 32 位的汇编少了一波 64 位的寄存器/指令
- 因此不像 64 位汇编那样依赖寄存器,比如传参时只会用栈(在 64 位中,前 6 个参数会用寄存器)
- 传统的 32 位汇编在函数调用时,会用 %ebp来记录栈帧起始地址(在现代的 32 位汇编与 64 位汇编中, %ebp只是一个普通的 callee-saved register,在函数调用时可以不用它来记录栈帧起始地址,而直接通过栈顶 %esp或者 %rsp来定位参数,例外:3.10.5)
Intel vs ATT
- The Intel code omits the size designation suffixes. We see instruction push and mov instead of pushq and movq.
- The Intel code omits the ‘%’ character in front of register names, using rbx instead of $rbx.
- The Intel code has a different way of describing locations in memory——for examble, QWORD PTR [rbx] rather than (%rbx).
- Instruction with multiple operands list them in the reverse order.
3.2 Program Encodings
preprocessor -> compiler -> assembler -> linker
The assembler converts the assembly code into binary object-code files
*.o
Object code is one form of machine code——it contains binary representations of all of the instructions, but the addresses of global values are not yet filled in.
3.2.1 Machine-Level Code
The machine code for x86-64 differs greatly from the original C code.
Parts of the processor state are visible that normally are hidden from the C programmer:
- program counter (commonly referred to as the PC, and called %rip in x86-64) …
- The integer register file …
- The condition code registers …
- A set of vector register …
3.2.2 Code Examples
Several features about machine code and its disassembled representation are worth noting:
- x86-64 instructions can range in length from 1 to 15 bytes.
The instruction encoding is designed so that commonly used instructions and those with fewer operands require a smaller number of bytes than do less common ones or ones with more operands. - The instruction format is designed in such a way that from a given starting position, there is a unique decoding of the bytes into machine instructions.
- The disassembler determines the assembly code based purely on the byte sequences in the machine-code file.
- The dissembler users a slightly different naming convention for the instructions than does the assembly code generated by GCC.
3.2.3 Notes on Formatting
All of the lines beginning with ‘.’ are directives to guide the assembler and linker.
3.3 Data Formats
C declaration | Intel data type | Assembly-code suffix | Size (bytes) |
---|---|---|---|
char | Byte | b | 1 |
short | Word | w | 2 |
int | Double word | l | 4 |
long | Quad word | q | 8 |
char * | Quad word | q | 8 |
float | single precision | s | 4 |
double | Double precision | l | 8 |
Sizes of C data types in x86-64.
With a 64-bit machine, pointers are 8 bytes long.
3.4 Accessing Information
When instructions have registers as destinations, two conventions arise for what happens to the remaining bytes in the register for instruction that generate less than 8 bytes:
Those that generate 1- or 2-byte quantities leave the remaining bytes unchanged.
Those that generate 4-byte quantities set the upper 4 bytes of the register to zero.
(The latter convention was adopted as part of the expansion from IA32 to x86-64.)
(这里的 remaining bytes 是指 64位寄存器的剩余字节,比如在
movswl %ax, %eax里,remaining bytes 就指
%rax的高 32 位)
3.4.1 Operand Specifiers
Type | Form | Operand value | Name |
---|---|---|---|
Immediate | \($Imm\) | \(Imm\) | Immediate |
Register | \(r_a\) | \(R[r_a]\) | Register |
Memory | \(Imm\) | \(M[Imm]\) | Absolute |
Memory | \((r_a)\) | \(M[R[r_a]]\) | Indirect |
Memory | \(Imm(r_b)\) | \(M[Imm + R[r_b]]\) | Base + displacement |
Memory | \((r_b, r_i)\) | \(M[R[r_b] + R[r_i]]\) | Indexed |
Memory | \(Imm(r_b, r_i)\) | \(M[Imm + R[r_b] + R[r_i]]\) | Indexed |
Memory | \(( , r_i, s)\) | \(M[R[r_i] \cdot s]\) | Scaled indexed |
Memory | \(Imm( , r_i, s)\) | \(M[Imm + R[r_i] \cdot s]\) | Scaled indexed |
Memory | \((r_b, r_i, s)\) | \(M[R[r_b] + R[r_i] \cdot s]\) | Scaled indexed |
Memory | \(Imm(r_b, r_i, s)\) | \(M[Imm + R[r_b] + R[r_i] \cdot s]\) | Scaled indexed |
Operand forms.
Opreands can denote immediate (constant) values, register values, or values from memory.
The scaling factor s must be either 1, 2, 4, or 8.
We use the notation \(r_a\) to denote an arbitrary register a and indicate its value with the reference \(R[r_a]\), viewing the set of registers as an array R indexed by register identifiers.
3.4.2 Data Movement Instructions
Instruction | Effect | Description |
---|---|---|
MOV S, D | D \(\leftarrow\) S | Move |
movb | Move byte | |
movw | Move word | |
movl | Move double word | |
movq | Move quad word | |
movabsq I, R | R \(\leftarrow\) I | Move absolute quad word |
Single data movement instructions.
x86-64 imposes the restriction that a move instruction cannot have both operands refer to memory location.
The size of the register must match the size designated by the last character of the instruction.
The regular movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers. This value is then sign extended to produce the 64-bit value for the destination.
( movq 的源是寄存器、内存时,要 64 位,是立即数时,要 32 位)
The movabsq instruction can have an arbitrary 64-bit immediate value as its source operand and can only have a register as a destination.
( movabsq 的源只能是立即数,目的只能是寄存器)
Instruction | Effect | Description |
---|---|---|
MOVZ S, R | R \(\leftarrow\) ZeroExtend(S) | Move with zero extension |
movzbw | Move zero-extended byte to word | |
movzbl | Move zero-extended byte to doubler word | |
movzwl | Move zero-extended word to double word | |
movzbq | Move zero-extended byte to quad word | |
movzwq | Move zero-extended word to quad word |
Zero-extending data movement instructinos.
These instructions have a register or memory location as the source and a register as the destination.
Note the absence of explicit instruction (movzlq) to zero-extend a 4-byte source value to an 8-byte destination (which can be implemented using a movl instruction having a register as the destination).
Instruction | Effect | Description |
---|---|---|
MOVS S, R | R \(\leftarrow\) SignExtend(S) | Move with sign extension |
movsbw | Move sign-extended byte to word | |
movsbl | Move sign-extended byte to doubler word | |
movswl | Move sign-extended word to double word | |
movsbq | Move sign-extended byte to quad word | |
movswq | Move sign-extended word to quad word | |
movslq | Move sign-extended double word to quad word | |
cltq | %rax \(\leftarrow\) SignExtend(%eax) | Sign-extend %eax to %rax |
Sign-extending data movement instructinos.
The MOVS instructions have a register or memory location as the source and a register as the destination.
The cltq instruction specific to registers %eax and %rax
cltq has the same effect as the instruction movslq %eax, %rax, but it has a more compact encoding.
3.4.3 Data Movement Example
Dereferencing a pointer involves copying that pointer into a register, and then using this register in a memory reference.
3.4.4 Pushing and Popping Stack Data
Instruction | Effect | Description |
---|---|---|
pushq S | R[%rsp] \(\leftarrow\) R[%rsp] - 8 M[R[%rsp]] \(\leftarrow\) S | Push quad word |
popq D | D \(\leftarrow\) M[R[%rsp]] R[%rsp] \(\leftarrow\) R[%rsp] + 8 | Pop quad word |
Push and pop instructions.
3.5 Arithmetic and Logical Operations
Instruction | Effect | Description |
---|---|---|
leaq S, D | D \(\leftarrow\) &S | Load effective address |
INC D | D \(\leftarrow\) D + 1 | Increment |
DEC D | D \(\leftarrow\) D - 1 | Decrement |
NEG D | D \(\leftarrow\) -D | Negate |
NOT D | D \(\leftarrow\) ~D | Complement |
ADD S, D | D \(\leftarrow\) D + S | Add |
SUB S, D | D \(\leftarrow\) D - S | Subtract |
IMUL S, D | D \(\leftarrow\) D * S | Multiply |
XOR S, D | D \(\leftarrow\) D ^ S | Exclusive-or |
OR S, D | D \(\leftarrow\) D | S | Or |
AND S, D | D \(\leftarrow\) D & S | And |
SAL k, D | D \(\leftarrow\) D << k | Left shift |
SHL k, D | D \(\leftarrow\) D << k | Left shift (same as SAL) |
SAR k, D | D \(\leftarrow\) D <<\(_A\) k | Arithmetic right shift |
SHR k, D | D \(\leftarrow\) D <<\(_L\) k | Logical right shift |
Integer arithmetic operations.
The effective address (leaq) instruction is commonly used to perform simple arithmetic.
The remaining ones are more standard unary or binary operations.
We use the notation >>_A and >>_L to denote arithmetic and logical right shift, respectively.
Note the nonintuitive ordering of the operands with ATT-format assembly code.
Most of the operations are given as instruction classes, as they can have different variant with different operand sizes. (Only leaq has no other size variants.)
3.5.1 Load Effective Address
The destination operand must be a register.
3.5.2 Unary and Binary Operations
As with the MOV instructions, the two operands cannot both be memory location.
3.5.3 Shift Operations
第一个操作数(移多少位)可以为立即数或寄存器.
如果为寄存器,则只能是
%cl,且其中太高位会被忽略,比如
%cl为
0xFF时,
salb最多只会左移 7 位,而不会左移 255 位
3.5.4 Discussion
We see that most of the instructions above can be used for either unsigned or two’s-complement arithmetic.
Only right shifting requires instructions that differentiate between signed versus unsigned data.
3.5.5 Special Arithmetic Operations
Instruction | Effect | Description |
---|---|---|
imulq S | R[%rdx]:R[%rax] \(\leftarrow\) S \(\times\) R[%rax] | Signed full multiply |
mulq S | R[%rdx]:R[%rax] \(\leftarrow\) S \(\times\) R[%rax] | Unsigned full multiply |
cqto | R[%rdx]:R[%rax] \(\leftarrow\) SignExtend(R[%rax]) | Convert to oct word |
idivq S | R[%rdx] \(\leftarrow\) R[%rdx]:R[%rax] mod S R[%rax] \(\leftarrow\) R[%rdx]:R[%rax] / S | Signed divide |
div S | R[%rdx] \(\leftarrow\) R[%rdx]:R[%rax] mod S R[%rax] \(\leftarrow\) R[%rdx]:R[%rax] / S | Unsigned divide |
Special arithmeti operations.
These operations provide full 128-bit multiplication and division, for both signed and unsigned numbers.
The pair of register %rdx and %rax are viewed as forming a single 128-bit oct word
前面的 imulq有可能会溢出截断,所以这里添加了对 128 位的支持,高 64 位存在 %rdx里,低 64 位存在 %rax里
同时这里有 Signed 和 Unsigned 两种 imulq(前面的截断版的 imulq则不用区分)
整除支持 128 位(同样要分开存)整除 64 位,商存在 %rax里,余数存在 %rdx里
整除也有 Signed 和 Unsigned 两种
3.6 Control
3.6.1 Condition Codes
The CPU maintains a set of single-bit condition code registers describing attributes of the most recent arithmetic or logical operation.
- CF: Carry flag.
The most recent operation generated a carry out of the most significant bit.
Used to detect t overflow for unsigned operations. - ZF: Zero flag.
The most recent operation yielded zero. - SF: Sign flag.
The most recent operation yielded a negative value. - OF: Overflow flag.
The most recent operation caused a two’s-complement overflow——either negative or positive.
All instructions listed in 3.5 (except
leaq) caused the condition codes to be set.
In addition, there are two instruction classes that set condition codes without altering any other registers.
Instruction | Based on | Description |
---|---|---|
CMP \(S_1, S_2\) | \(S_2 - S_1\) | Compare |
cmpb | Compare byte | |
cmpw | Compare word | |
cmpl | Compare double word | |
cmpq | Compare quad word | |
TEST \(S_1, S_2\) | \(S_2 \& S_1\) | Test |
testb | Test byte | |
testw | Test word | |
testl | Test double word | |
testq | Test quad word |
Comparison and test instructions.
These instructions set the condition codes without updating any other registers.
3.6.2 Accessing the Condition Codes
Instruction | Synonym | Effect | Set condition |
---|---|---|---|
sete D | setz | D \(\leftarrow\) ZF | Equal / zero |
setne D | setnz | D \(\leftarrow\) ~ZF | Not equal / not zero |
sets D | D \(\leftarrow\) SF | Negative | |
setns D | D \(\leftarrow\) ~SF | Nonnegative | |
setg D | setnle | D \(\leftarrow\) ~(SF^OF) & ~ZF | Greater (signed >) |
setge D | setnl | D \(\leftarrow\) ~(SF^OF) | Greater or equal (signed >=) |
setl D | setnge | D \(\leftarrow\) SF^OF | Less (signed <) |
setle D | setng | D \(\leftarrow\) (SF^OF) | ZF | Less or equal (signed <=) |
seta D | setnbe | D \(\leftarrow\) ~CF & ~ZF | Above (unsigned >) |
setae D | setnb | D \(\leftarrow\) ~CF | Above or equal (unsigned >=) |
setb D | setnae | D \(\leftarrow\) CF | Below (unsigned <) |
setbe D | setna | D \(\leftarrow\) CF | ZF | Below or equal (unsigned <=) |
The SET instructions.
Each instruction sets a single byte to 0 or 1 based on some combination of the condition codes.
Some instructions have "synonyms," that is, alternate names for the same machine instruction.
3.6.3 Jump Instructions
Instruction | Synonym | Jump condition | Description |
---|---|---|---|
jmp Label | 1 | Direct jump | |
jmp *Operand | 1 | Indirect jump | |
je Label | jz | ZF | Equal / zero |
jne Label | jnz | ~ZF | Not equal / not zero |
js Label | SF | Negative | |
jns Label | ~SF | Nonnegative | |
jg Label | jnle | ~(SF^OF) & ~ZF | Greater (signed >) |
jge Label | jnl | ~(SF^OF) | Greater or equal (signed >=) |
jl Label | jnge | SF^OF | Less (signed <) |
jle Label | jng | (SF^OF) | ZF | Less or equal (signed <=) |
ja Label | jnbe | ~CF & ~ZF | Above (unsigned >) |
jae Label | jnb | ~CF | Above or equal (unsigned >=) |
jb Label | jnae | CF | Below (unsigned <) |
jbe Label | jna | CF | ZF | Below or equal (unsigned <=) |
The jump instructions.
These instructions jump to a labeled destination when the jump condition have "synonyms," alternate names for the same machine instruction.
3.6.4 Jump Instruction Encodings
PC relative encode the difference between the address of the target instruction and the address of the instruction immediately following the jump.
A second encoding method is to given an “absolute” address, using 4 bytes to directly specify target.
3.6.5 Implementing Conditional Branches with Conditional Control
The general form of an if-else statement in C is given by the template:
1 2 3 4 |
if (test-expr) then-statement else else-statement |
For this general form, the assembly implementation typically adheres to the following form, where we use C syntax to describe the control flow.
1 2 3 4 5 6 7 8 |
t = test-expr; if (!t) goto false then-statement goto done; false: else-statement done: |
3.6.6 Implementing Conditional Branches with Conditional Moves
Instruction | Synonym | Move condition | Description |
---|---|---|---|
cmove S, R | cmovz | ZF | Equal / zero |
cmovne S, R | cmovnz | ~ZF | Not equal / not zero |
cmovs S, R | SF | Negative | |
cmovns S, R | ~SF | Nonnegative | |
cmovg S, R | cmovnle | ~(SF^OF) & ~ZF | Greater (signed >) |
cmovge S, R | cmovnl | ~(SF^OF) | Greater or equal (signed >=) |
cmovl S, R | cmovnge | SF^OF | Less (signed <) |
cmovle S, R | cmovng | (SF^OF) | ZF | Less or equal (signed <=) |
cmova S, R | cmovnbe | ~CF & ~ZF | Above (unsigned >) |
cmovae S, R | cmovnb | ~CF | Above or equal (unsigned >=) |
cmovb S, R | cmovnae | CF | Below (unsigned <) |
cmovbe S, R | cmovna | CF | ZF | Below or equal (unsigned <=) |
The conditional move instructions
These instructions copy the source value S to its destination R when the move condition holds.
Some instructions have "synonyms," alternate names for the same machine instruction.
都已经有 Conditional Control 了,为什么还要有 Conditional Move 呢?
因为 CPU 是流水线式执行指令的,所以 je指令还没执行完(比如刚刚取完址)时,就会加载执行后几条指令
可是 je执行完之前,是不能确定接下来执行什么指令的,CPU 只能猜测,现代 CPU 的猜测正确率有大约 90%
如果猜错了,流水线上已经预执行的指令就要清除掉,流水线要从新从头开始执行后几条指令,这将浪费约 15-30 个时钟周期的时间
而 cmove则不会有这个问题,它的下一条指令是预先就确定的
Consider the following general form of conditional expression and assignment:
v = test-expr ? then-expr : else-expr;
The standard way to compile this expression using conditional control transfer would have the following form:
1 2 3 4 5 6 7 |
if (!test-expr) goto false v = then-statement goto done; false: v = else-statement done: |
Based on a conditional move, this can be described by the following abstract code:
1 2 3 4 |
v = then-expr; ve = else-expr; t = test-expr; if (!t) v = ve; |
可以发现,如果
then-expr或者
else-expr耗时太长(比猜测失败浪费的时间多),第二种方法反而会更慢(不过这只是少数情况)
还有,如果其中一个分支会有 side-effect 或者 error condition,也不能用第二种方法,比如:
return (xp ? *xp : 0);
3.6.7 Loops
Do-While Loops
The general form of a do-while statement is as follows:
1 2 3 |
do body-statement while (test-expr) |
This general form can be translated into conditionals and goto statement as follows:
1 2 3 4 5 |
loop: body-statement t = test-expr; if (t) goto loop; |
While Loops
The general form of a do-while statement is as follows:
1 2 |
while (test-expr) body-statement |
The first translation method, which we refer to as jump to middle:
1 2 3 4 5 6 7 |
goto test; loop: body-statement test: t = test-expr; if (t) goto loop; |
The second translation method, which we refer to as guarded do:
1 2 3 4 5 6 7 8 9 |
t = test-expr if (!t) goto done; loop: body-statement t = test-expr; if (t) goto loop; done: |
Using this implementation strategy, the compiler can often optimize the initial test, for example, determining that the test condition will always hold.
For Loops
The general form of a for loop is as follows:
1 2 |
for (init-expr; test-expr; update-expr) body-statement |
The behavior of such a loop is identical to the following code using a while loop:
1 2 3 4 5 |
init-expr; while (test-expr) body-statement update-expr; } |
The code generated by GCC for a for loop then follows one of our two translation strategies for while loops, depending on the optimization level.
3.6.8 Switch Statements
A switch statement not only make the C code more readable, but also allow an efficient implementation using a data structure called a jump table.
A jump table is an array where entry i is the address of a code segment implementing the action the program should take when the switch index equals i.
In the assembly code, the jump table is indicated by the following declarations, to which we have added comment:
1 2 3 4 5 6 7 8 9 10 |
.section .rodata .align 8 # Align address to multiple of 8 .L4: .quad .L3 # Case 100: loc_A .quad .L8 # Case 100: loc_def .quad .L5 # Case 100: loc_B .quad .L6 # Case 100: loc_C .quad .L7 # Case 100: loc_D .quad .L8 # Case 100: loc_def .quad .L7 # Case 100: loc_D |
The following example has a number of interesting features, including case labels that do not span a contiguous range, case with multiple label, and cases that fall through to other cases.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
void switch_eg(long x, long n, long *dest) { long val = x; switch (n) { case 100: val *= 13; break; case 102: val += 10; /* Fall through */ case 103: val += 11; break; case 104: case 106: val *= val; break; default: val = 0; } *dest = val; } |
Recall that the operator ‘&’ creates a pointer for a data value.
In making this extension, the authors of GCC created a new operator && to create a pointer for a code location.
The computed goto is supported by GCC as an extension to the C language.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
void switch_eg_impl(long x, long n, long *dest) { /* Table of code pointers */ static void *jt[7] = { &&loc_A, &&loc_def, &&loc_B, &&loc_C, &&loc_D, &&loc_def, &&loc_D }; unsigned long index = n - 100; long val; if (index > 6) goto loc_def; /* Multiway branch */ goto *jt[index]; loc_A: /* Case 100 */ val = x * 13; goto done; loc_B: /* Case 102 */ x = x + 10; /* Fall through */ loc_C: /* Case 103 */ val = x + 11; goto done; loc_D: /* Case 104, 106 */ val = x * x; goto done; loc_def: /* Default case */ val = 0; done: *dest = val; } |
The compiler first shifts the range to between 0 and 6 by subtracting 100 from n.
It further simplifies the branching possibilities by treating index as an unsigned value, making use of the fact that negative numbers in a two’s-complement representation map to large positive numbers in an unsigned representation.
It can therefore test whether index is outside of the range 0-6 by testing whether it is greater than 6.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
switch_eg: subq $100, %rsi compq $7, %rsi ja .L8 jmp *.L4(,%rsi,8) .L3: leaq (%rdi,%rdi,2), %rax leaq (%rdi,%rax,4), %rdi jmp .L2 .L5: addq $10, %rdi .L6: addq $11, %rdi jmp .L2 .L7: imulq %rdi, %rdi jmp .L2 .L8: movl $0, %edi .L2: movq %rdi, (%rdx) ret |
3.7 Procedures
For discussion purposes, suppose procedure P calls procedure Q, and Q then executes and return back to P.
These actions involve one or more of the following mechanisms:
- Passing control.
The program counter must be set to the starting address of the code for Q upon entry and then set to the instruction in P following the call to Q upon return. - Passing data.
P must be able to provide one or more parameters to Q, and Q must be able to return a value back to P.
Allocating and deallocating memory.
Q must need to allocate space for local variables when it begins and then free that storage before it returns.
3.7.1 The Run-Time Stack
We consider the return address to be part of P’s stack frame, since it holds state relevant to P.
x86-64 procedures allocate only portions of stack frames they require.
For example, many procedures have six or fewer argument, and so all of their parameters can be passed in registers.
Indeed, many functions do not even require a stack frame. (When all of the local variables can be held in registers and the function does not call any other functions.)
3.7.2 Control Transfer
The added suffix ‘q’ in
callq and
retq simply emphasizes that these are x86-64 versions of call and return instructions, not IA32. In x86-64 assembly code, both versions (with or without ‘q’) can be used interchangeably.
Like jumps, a call can be either direct or indirect. In assembly code, the target of a direct call is given as a label, while the target of an indirect call is given by ‘*’ followed by an operand specifier.
3.7.3 Data Transfer
When Q returns back to P, the code for P can access the returned value in register
%rax
With x86-64, up to 6 integral arguments can be passed via registers.
1st | 2rd | 3rd | 4th | 5th | 6th | |
---|---|---|---|---|---|---|
64 bits | %rdi | %rsi | %rdx | %rcx | %r8 | %r9 |
32 bits | %edi | %esi | %edx | %ecx | %r8d | %r9d |
16 bits | %di | %si | %dx | %cx | %r8w | %r9w |
8 bits | %dil | %sil | %dl | %cl | %r8b | %r9b |
When a function has more than 6 integral arguments, the other ones are passed on the stack.
When passing parameters on the stack, all data sizes are rounded up to be multiples of eight.
3.7.4 Local Storage on the Stack
At times,however, local data must be stored in memory:
- There are not enough registers to hold all of the local data.
- The address operator ‘&’ is applied to a local variable, and hence we must be able to generate an address for it.
- Some of the local variables are arrays or structures and hence must be accessed by array or structure references.
Typically, a procedure allocates space on the stack frame by decrementing the stack pointer.
3.7.5 Local Storage in Registers
By convention, registers
%rbx,
%rbp, and
%r12 –
%r15 are classified as callee-saved registers.
When procedure P calls procedure Q, Q must preserve the values of these registers.
Procedure Q can preserve a register value by either not changing ti at all or by pushing the original value on the stack, altering it, and then popping the old value from the stack before returning.
3.7.6 Recursive Procedures
The conventions we have described for using the registers and the stack allow x86-64 procedures to call themselves recursively just like any other function call.
3.8 Array Allocation and Access
3.8.1 Basic Principles
3.8.2 Pointer Arithmetic
C allows arithmetic on pointers, where the computed value is scaled according to the size of the data type referenced by the pointer.
Suppose the starting address of integer array E and integer index x are stored in register %rdx and %rcx, resprectively.
Expression | Type | Value | Assembly code |
---|---|---|---|
E | int * | \(X_E\) | movq %rdx, %rax |
E[0] | int | \(M[X_E]\) | movl (%rdx), %eax |
E[i] | int | \(M[X_E + 4i]\) | movl (%rdx, %rcx, 4), %eax |
&E[2] | int * | \(X_E + 8\) | leaq 8(%rdx), %rax |
E+i-1 | int * | \(X_E + 4i - 4\) | leaq -4(%rdx, %rcx, 4), %rax |
*(E+i-3) | int | \(M[X_E + 4i - 12]\) | movl -12(%rdx, %rcx, 4), %eax |
&E[i]-E | long | i | movq %rcx, %rax |
3.8.3 Nested Arrays
Array element D[i][j] is at memory address \(\&D[i][j] = x_D + L(C \times i + j)\), where L is the size of data type T in bytes.
3.8.4 Fixed-Size Arrays
The C compiler is able to make many optimizations for code operating on multi-dimensional arrays of fixed size.
1 2 3 4 5 6 7 8 |
/* Compute i, k of fixed matrix product */ int fix_prod_ele(fix_matrix A, fix_matrix B, long i, long k) { long j; int result = 0; for (j = 0; j < N; j++) result += A[i][j] *B[j][k]; return result; } |
1 2 3 4 5 6 7 8 9 10 11 12 |
/* Compute i, k of fixed matrix product */ int fix_prod_ele_opt(fix_matrix A, fix_matrix B, long i, long k) { int *Aptr = &A[i][0]; int *Bptr = &B[0][k]; int *Bend = &B[N][k]; do { result += *Aptr * *Bptr; Aptr++; Bptr += N; } while (Bptr != Bend); return result; } |
3.8.5 Variable-Size Arrays
Historically, programmers requiring variable-size arrays had to allocate storage for these arrays using function such as malloc or calloc, and they had to explicitly encode the mapping of multidimensional arrays int single-dimension ones via row-major indexing, as expressed in 3.8.3
ISO C99 introduced the capability of having array dimension expressions that are computed as the array is being allocated.
So, for example, we can write a function to access element i,j of an n*n array as follows:
1 2 3 |
int val_ele(long n, int A[n][n], long i, long j) { return A[i][j]; } |
The parameter n must precede the parameter A[n][n], so that the function can compute the array dimensions as the parameter is encountered.
Gcc generates code for this referencing function as
1 2 3 4 5 6 |
; n in %rdi, A in %rsi, i in %rdx, j in %rcx var_ele: imulq %rdx, %rdi ; Compute n*i leaq (%rsi, %rdi, 4), %rax ; Compute x_A + 4(n*i) movl (%rax, %rcx, 4), %eax ; Read from M[x_A + 4(n*i) + 4j] ret |
We see that referencing variable-size arrays requires only a slight generalization over fixed-size ones. The dynamic version must use a multiplication instruction to scale i by n (which may incur significant performance penalty), rather than a series of shifts and adds.
When variable-size arrays are referenced within a loop, the compiler can often optimize the index computations by exploiting the regularity of the access patterns (like the codes in 3.8.4).
3.9 Heterogeneous Data Structures
C provides two mechanisms for creating data types by combining objects of different types: structures, declared using keyword struct, aggregate multiple objects into a single unit: unions, declared using keyword union, allow an object to be referenced using several different types.
3.9.1 Structures
To access the fields of a structure, the compiler generates code that adds the appropriate offset to the address of the structure. The selection of the different fields of a structure is handled completely at compile time. The machine code contains no information about the field declarations or the names of the fields.
3.9.2 Unions
Unions provide a way to circumvent the type system of C (at the same time, bypass the safety provided by the C type system).
Unions can also be used to access the bit patterns of different data types.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
unsigned long double2bits(double d) { union { double d; unsigned long u; } temp; temp.d = d; return temp.u; } /* on a little-endian machine, word0 will become the low-order 4 bytes of d */ double uu2double(unsigned word0, unsigned word1) { union { double d; unsigned u[2]; } temp; temp.u[0] = word0; temp.u[1] = word1; return temp.d; } |
3.9.3 Data alignment
Many computer systems place restrictions on the allowable addresses for the primitive data types, requiring that the address for some objects must be a multiple of K (typically 2, 4, or 8). Such alignment restrictions simplify the design of the hardware forming the interface between the processor and the memory system.
For example, suppose a processor always fetches 8 bytes from memory with an address that must be a multiple of 8. If we can guarantee that any double will be aligned to have its address be a multiple of 8, then the value can be read or written with a single memory operation. Otherwise, we may need two.
The x86-64 hardware will work correctly regardless of the alignment of data. However, Intel recommends that data be aligned to improve memory system performance.
Their alignment rule is based on the principle that any primitive object of K bytes must have an address that is multiple of K.
The compiler places directives in the assembly code indicating the desired alignment for global data. For examble,
.align 8 ensures that the data following it will start with an address that is a multiple of 8.
For code involving structures, the compiler may need to insert gaps in the field allocation to ensure that each structure element satisfies its alignment requirement. In additions, the compiler may need to add padding to the end of the structure.
1 2 3 4 5 6 |
struct S2 { int i; int j; char c; }; struct S2 d[4]; |
The elements of d will have address x, x + 12, x + 24, x + 36. As long as x is a multiple of 4, all of the alignment restrictions will be satisfied.
3.10 Combining Control and Data in Machine-Level Programs
In this section, we look at ways in which data and control interact with each other.
3.10.1 Understanding Pointers
- Every pointers has an associated type.
Pointer types are not part of machine code; they are an abstraction provided by C to help programmers avoid addressing errors. - Every pointer has a value.
The value is an address of some object of the designated type. - Pointers are created with the ‘&’ operator.
- Pointers are dereferenced with the ‘*’ operator.
- Arrays and pointers are closely related.
- Casting from one type of pointer to another changes its type but not its value.
- Pointers can also point to functions.
The value of a function pointer is the address of the first instruction in the machine-code representation of the function.
3.10.2 Life in the Real World: Using the GDB Debugger
3.10.3 Out-of-Bounds Memory References and Buffer Overflow
We have seen that C does not perform any bounds checking for array references, and that local variables are stored on the stack along with state information such as saved register values and return addresses. This combination can lead to serious program errors, where the state stored on the stack gets corrupted by a write to an out-of-bounds array elements. When the program then tries to reload the register or execute a ret instruction with this corrupted state, things can go seriously wrong.
Typically, the program is fed with a string that contains the byte encoding of some executable code, called the exploit code, plus some extra bytes that overwrite the return address with a pointer to the exploit code. The effect of executing the ret instruction is then to jump to the exploit code.
1 2 3 4 5 |
void echo() { char buf[8]; gets(buf); puts(buf); } |
3.10.4 Thwarting Buffer Overflow Attacks
Stack Randomization
In order to insert exploit code into a system, the attacker needs to inject both the code as well as a pointer to this code as part of the attack string. Generating this pointer requires knowing the stack address where the string will be located.
The idea of stack randomization is to make the position of the stack vary from one run of a program to another.
This is implemented by allocating a random amount of space between 0 and n bytes on the stack at the start of a program, for example, by using the allocation function alloca, which allocates space for a specified number of bytes on the stack. This allocated space is not used by the program, but it causes all subsequent stack locations to vary from one execution of a program to another. The allocation range n needs to be large enough to get sufficient variations in the stack address, yet small enough that it does not waste too much space in the program.
Stack randomization is one of a larger class of techniques known as address-space layout randomization, of ASLR. With ASLR, different parts of the program, including program code, library code, stack, global variables, and heap data, are loaded into different regions of many each time a programs is run.
However, the attacker can overcome randomization by brute force. A common trick is to include a long sequence of nop instructions (called nop sled) before the actual exploit code. As long as the attacker can guess an address somewhere within this sequence, the program will run through the sequence and then hit the exploit code. If we set up a 256-byte nop sled, then the randomization over \(n=2^{23}\) can be cracked by enumerating \(2^{15}\) starting address, which is feasible.
Stack Corruption Detection
The canary value, also referred as a guard value, is generated randomly each time the program runs.
Recent versions of GCC try to determine whether a function is vulnerable to a stack overflow and insert this type of overflow detection automatically.
Limiting Executable Code Regions
In typical programs, only the portion of memory holding the code generated by the compiler need be executable. The other portions can be restricted to allow just reading and writing.
Historically, the x86 architecture merged the read and execute access controls into a single 1-bit flag, so that any page marked as readable was also executable.
More recently, AMD introduced an NX (for “no-execute”) bit into memory protection for its 64-bit processors, separating the read and execute access modes, and Intel followed suit.
The checking of whether a page is executable is performed in hardware, with no penalty in efficiently.
However, some types of programs require the ability to dynamically generate and execute code. For example, just-in-time compilation techniques dynamically generate code for programs written in interpreted languages, such as Java, to improve execution performance.
3.10.5 Supporting Variable-Size Stack Frames
We have examined the machine-level code for a variety of functions so far, but they all have the property that the compiler can determine in advance the amount of space that must be allocated for their stack frames.
Some functions, however, require a variable amount of local storage. This can occur, for example, when the function calls alloca, a standard library function that can allocate an arbitrary number of bytes of storage on the stack. It can also occur when the code declares a local array of variable size.
To manage a variable-size stack frame, x86-64 code use register %rbp to serve as a frame pointer (sometimes referred as a base pointer). In the earlier versions of x86 code, the frame pointer was used with every function call. With x86-64 code, it is used only in cases where the stack frame may be of variable size. Historically, most compilers used frame pointers when generating IA32 code. Recent versions of GCC have dropped this convention. Observe that it is acceptable to mix code that uses frame pointers with code that does not, as long as all functions treat %rbp as a callee-saved register.
1 2 3 4 5 6 7 8 |
long vframe(long n, long idx, long *q) { long i; long *p[n]; p[0] = &i; for (i = 1; i < n; i++) p[i] = q; return *p[idx]; } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
; n in %rdi, idx in %rsi, q in %rdx vframe: pushq %rbp movq %rsp, %rbp subq $16, %rsp leaq 22(,%rdi,8), %rax andq $-16, %rax subq %rax, %rsp leaq 7(%rsp), %rax shrq $3, %rax leaq 0(,%rax,8), %r8 movq %r8, %rcx ; ... ; i in %rax and on stac, n in %rdi, p in %rcx, q in %rdx .L3: movq %rdx, (%rcx, %rax, 8) addq $1, %rax movq %rax, -8(%rbp) .L2: movq -8(%rbp), %rax cmpq %rdi, %rax jl .L3 ; ... leave ret |
3.11 Floating-Point Code
x86-64 floating point is based on SSE of AVX, including conventions for passing procedure arguments and return values.
The AVX floating-point architecture allows data to be stored in 16 YMM registers, named %ymm0-%ymm15. Each YMM register is 256 bits long.
When operating on scalar data, these registers only hold floating-point data, and only the low-order 32 bits (for float) or 64 bit (for double) are used. The assembly code refers to the registers by their SSE XMM register names %xmm0-%xmm15, where each XMM register is the low-order 128 bits of the corresponding YMM register.
The scalar instructions operate on individual data values, while some instructions operate on packed data, meaning that they update the entire destination XMM register. Once again, our only interest for scalar data is the effect these instructions have on the low-order 4 or 8 bytes of the destination.
(由于浮点数相关的指令名太丑,下面将忽略)
3.11.1 Floating-Point Movement and Conversion Operations.
GCC uses the scalar movement operations only to transfer data from memory to an XMM register or from an XMM register to memory.
For transferring data between two XMM registers, it uses one of two different instructions for copying the entire contents of on XMM register to another.
The sets of instructions for conversion are all scalar instructions.
When converting floating-point values to integers, they perform truncation, rounding values toward zero.
3.11.2 Floating-Point Code in Procedures
- Up to eight floating-pointing arguments can be passed in XMM registers %xmm0-%xmm7.
These registers are used in the order the argument are listed from left to right.
Additional floating-point argument can be passed on the stack. - A function that return a floating-point value does so in register %xmm0.
- All XMM registers are caller saved.
- When a function contains a combination of pointer, integer, and floating-point arguments, the pointers and integers are passed in general-purpose register, while the floating-point values are passed in XMM registers.
3.11.3 Floating-Point Arithmetic Operations
The second source operand and the destination operands must be XMM register.
3.11.4 Defining and Using Floating-Point Constants
AVX floating-point operations cannot have immediate values as operands. In instead, the compiler must allocate and initialize storage for any constant values. The code then reads the values from memory.
1 2 3 4 |
vmulsd .LC2(%rip), %xmm0, %xmm0 ; multiply by 1.8 .LC2: .long 3435973837 .long 1073532108 |
3.11.5 Using Bitwise Operations in Floating-Point Code
These operations all act on packed data, applying the bitwise operation to all the data in the two source registers.
3.11.6 Floating-Point Comparison Operations
For floating-point comparisons, the parity flag PF is set when either operand is NaN. By convention, any comparison in C is considered to fail when one of the arguments is NaN, and this flag is used to detect such a condition. For example, even the comparison
x == x yields 0 when x is NaN.
Instructions such as jp, ja, and jb are used to conditionally jump on various combinations of flags.
3.11.7 Observations about Floating-Point Code
AVX2 has the potential to make computations run faster by performing parallel operations on packed data. Compiler developers are working on automating the conversion of scalar code to parallel code, but currently the most reliable way to achieve higher performance through parallelism is to use the extensions to the C language supported by GCC for manipulating vectors of data. See Web Aside OPT:SIMD on page 546 to see how this can be done.
3.12 Summary
In this chapter, we have peered beneath the layer of abstraction provided by the C language to get a view of machine-level programming.
By having the compiler generate an assembly-code representation of the machine-level program, we gain insights into both the compiler and its optimization capabilities, along with the machine, its data types, and its instruction set (will be useful in Chapter 5).
We have also gotten a more complete picture of how the program stores data in different memory regions (will be useful in Chapter 12).
We have only examined the mapping of C onto x86-64, but much of what we have covered is handled in a similar way for other combinations of language and machine.