Chapter 3: Machine-Level Representation of Programs

这章主要介绍了汇编,不过切入点是汇编与 C 语言之间的关系,也就是所谓的 a Programmer’s Perspective

Contents

3.1 A Historical Perspectivey

汇编一路发展,一路向下兼容(弄得语法好别扭。。。
本书是第三版,主要介绍 64 位的汇编,而且用的语法是 AT&T 而不是 Intel。

32bits vs 64bits

  • 32 位的汇编少了一波 64 位的寄存器/指令
  • 因此不像 64 位汇编那样依赖寄存器,比如传参时只会用栈(在 64 位中,前 6 个参数会用寄存器)
  • 传统的 32 位汇编在函数调用时,会用 %ebp来记录栈帧起始地址(在现代的 32 位汇编与 64 位汇编中, %ebp只是一个普通的 callee-saved register,在函数调用时可以不用它来记录栈帧起始地址,而直接通过栈顶 %esp或者 %rsp来定位参数,例外:3.10.5

Intel vs ATT

  • The Intel code omits the size designation suffixes. We see instruction push and mov instead of pushq and movq.
  • The Intel code omits the ‘%’ character in front of register names, using rbx instead of $rbx.
  • The Intel code has a different way of describing locations in memory——for examble, QWORD PTR [rbx] rather than (%rbx).
  • Instruction with multiple operands list them in the reverse order.

3.2 Program Encodings

preprocessor -> compiler -> assembler -> linker
The assembler converts the assembly code into binary object-code files *.o
Object code is one form of machine code——it contains binary representations of all of the instructions, but the addresses of global values are not yet filled in.

3.2.1 Machine-Level Code

The machine code for x86-64 differs greatly from the original C code.
Parts of the processor state are visible that normally are hidden from the C programmer:

  • program counter (commonly referred to as the PC, and called %rip in x86-64) …
  • The integer register file
  • The condition code registers …
  • A set of vector register …

3.2.2 Code Examples

Several features about machine code and its disassembled representation are worth noting:

  • x86-64 instructions can range in length from 1 to 15 bytes.
    The instruction encoding is designed so that commonly used instructions and those with fewer operands require a smaller number of bytes than do less common ones or ones with more operands.
  • The instruction format is designed in such a way that from a given starting position, there is a unique decoding of the bytes into machine instructions.
  • The disassembler determines the assembly code based purely on the byte sequences in the machine-code file.
  • The dissembler users a slightly different naming convention for the instructions than does the assembly code generated by GCC.

3.2.3 Notes on Formatting

All of the lines beginning with ‘.’ are directives to guide the assembler and linker.


3.3 Data Formats

C declarationIntel data typeAssembly-code suffixSize (bytes)
charByteb1
shortWordw2
intDouble wordl4
longQuad wordq8
char *Quad wordq8
floatsingle precisions4
doubleDouble precisionl8

Sizes of C data types in x86-64.
With a 64-bit machine, pointers are 8 bytes long.


3.4 Accessing Information

CSAPP Figure 3.2 Ineger registers
Figure 3.2: Ineger registers: The low-order portions of all 16 registers can be accessd as byte, word, double word, and quad word quantities.

When instructions have registers as destinations, two conventions arise for what happens to the remaining bytes in the register for instruction that generate less than 8 bytes:
Those that generate 1- or 2-byte quantities leave the remaining bytes unchanged.
Those that generate 4-byte quantities set the upper 4 bytes of the register to zero.
(The latter convention was adopted as part of the expansion from IA32 to x86-64.)
(这里的 remaining bytes 是指 64位寄存器的剩余字节,比如在 movswl %ax, %eax里,remaining bytes 就指 %rax的高 32 位)

3.4.1 Operand Specifiers

TypeFormOperand valueName
Immediate\($Imm\)\(Imm\)Immediate
Register\(r_a\)\(R[r_a]\)Register
Memory\(Imm\)\(M[Imm]\)Absolute
Memory\((r_a)\)\(M[R[r_a]]\)Indirect
Memory\(Imm(r_b)\)\(M[Imm + R[r_b]]\)Base + displacement
Memory\((r_b, r_i)\)\(M[R[r_b] + R[r_i]]\)Indexed
Memory\(Imm(r_b, r_i)\)\(M[Imm + R[r_b] + R[r_i]]\)Indexed
Memory\(( , r_i, s)\)\(M[R[r_i] \cdot s]\)Scaled indexed
Memory\(Imm( , r_i, s)\)\(M[Imm + R[r_i] \cdot s]\)Scaled indexed
Memory\((r_b, r_i, s)\)\(M[R[r_b] + R[r_i] \cdot s]\)Scaled indexed
Memory\(Imm(r_b, r_i, s)\)\(M[Imm + R[r_b] + R[r_i] \cdot s]\)Scaled indexed

Operand forms.
Opreands can denote immediate (constant) values, register values, or values from memory.
The scaling factor s must be either 1, 2, 4, or 8.


We use the notation \(r_a\) to denote an arbitrary register a and indicate its value with the reference \(R[r_a]\), viewing the set of registers as an array R indexed by register identifiers.

3.4.2 Data Movement Instructions

InstructionEffectDescription
MOV   S, DD \(\leftarrow\) SMove
  movbMove byte
  movwMove word
  movlMove double word
  movqMove quad word
movabsq   I, RR \(\leftarrow\) IMove absolute quad word

Single data movement instructions.


x86-64 imposes the restriction that a move instruction cannot have both operands refer to memory location.
The size of the register must match the size designated by the last character of the instruction.
The regular movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers. This value is then sign extended to produce the 64-bit value for the destination.
( movq 的源是寄存器、内存时,要 64 位,是立即数时,要 32 位)
The movabsq instruction can have an arbitrary 64-bit immediate value as its source operand and can only have a register as a destination.
movabsq 的源只能是立即数,目的只能是寄存器)

InstructionEffectDescription
MOVZ   S, RR \(\leftarrow\) ZeroExtend(S)Move with zero extension
  movzbwMove zero-extended byte to word
  movzblMove zero-extended byte to doubler word
  movzwlMove zero-extended word to double word
  movzbqMove zero-extended byte to quad word
  movzwqMove zero-extended word to quad word

Zero-extending data movement instructinos.
These instructions have a register or memory location as the source and a register as the destination.


Note the absence of explicit instruction (movzlq) to zero-extend a 4-byte source value to an 8-byte destination (which can be implemented using a movl instruction having a register as the destination).

InstructionEffectDescription
MOVS   S, RR \(\leftarrow\) SignExtend(S)Move with sign extension
  movsbwMove sign-extended byte to word
  movsblMove sign-extended byte to doubler word
  movswlMove sign-extended word to double word
  movsbqMove sign-extended byte to quad word
  movswqMove sign-extended word to quad word
  movslqMove sign-extended double word to quad word
cltq%rax \(\leftarrow\) SignExtend(%eax)Sign-extend %eax to %rax

Sign-extending data movement instructinos.
The MOVS instructions have a register or memory location as the source and a register as the destination.
The cltq instruction specific to registers %eax and %rax


cltq has the same effect as the instruction movslq %eax, %rax, but it has a more compact encoding.

3.4.3 Data Movement Example

Dereferencing a pointer involves copying that pointer into a register, and then using this register in a memory reference.

3.4.4 Pushing and Popping Stack Data

InstructionEffectDescription
pushq   SR[%rsp] \(\leftarrow\) R[%rsp] - 8
M[R[%rsp]] \(\leftarrow\) S
Push quad word
popq   DD \(\leftarrow\) M[R[%rsp]]
R[%rsp] \(\leftarrow\) R[%rsp] + 8
Pop quad word

Push and pop instructions.


3.5 Arithmetic and Logical Operations

InstructionEffectDescription
leaq   S, DD \(\leftarrow\) &SLoad effective address
INC   DD \(\leftarrow\) D + 1Increment
DEC   DD \(\leftarrow\) D - 1Decrement
NEG   DD \(\leftarrow\) -DNegate
NOT   DD \(\leftarrow\) ~DComplement
ADD   S, DD \(\leftarrow\) D + SAdd
SUB   S, DD \(\leftarrow\) D - SSubtract
IMUL  S, DD \(\leftarrow\) D * SMultiply
XOR   S, DD \(\leftarrow\) D ^ SExclusive-or
OR    S, DD \(\leftarrow\) D | SOr
AND   S, DD \(\leftarrow\) D & SAnd
SAL   k, DD \(\leftarrow\) D << kLeft shift
SHL   k, DD \(\leftarrow\) D << kLeft shift (same as SAL)
SAR   k, DD \(\leftarrow\) D <<\(_A\) kArithmetic right shift
SHR   k, DD \(\leftarrow\) D <<\(_L\) kLogical right shift

Integer arithmetic operations.
The effective address (leaq) instruction is commonly used to perform simple arithmetic.
The remaining ones are more standard unary or binary operations.
We use the notation >>_A and >>_L to denote arithmetic and logical right shift, respectively.
Note the nonintuitive ordering of the operands with ATT-format assembly code.


Most of the operations are given as instruction classes, as they can have different variant with different operand sizes. (Only leaq has no other size variants.)

3.5.1 Load Effective Address

The destination operand must be a register.

3.5.2 Unary and Binary Operations

As with the MOV instructions, the two operands cannot both be memory location.

3.5.3 Shift Operations

第一个操作数(移多少位)可以为立即数或寄存器.
如果为寄存器,则只能是 %cl,且其中太高位会被忽略,比如 %cl0xFF时, salb最多只会左移 7 位,而不会左移 255 位

3.5.4 Discussion

We see that most of the instructions above can be used for either unsigned or two’s-complement arithmetic.
Only right shifting requires instructions that differentiate between signed versus unsigned data.

3.5.5 Special Arithmetic Operations

InstructionEffectDescription
imulq   SR[%rdx]:R[%rax] \(\leftarrow\) S \(\times\) R[%rax]Signed full multiply
mulq   SR[%rdx]:R[%rax] \(\leftarrow\) S \(\times\) R[%rax]Unsigned full multiply
cqtoR[%rdx]:R[%rax] \(\leftarrow\) SignExtend(R[%rax])Convert to oct word
idivq    SR[%rdx] \(\leftarrow\) R[%rdx]:R[%rax] mod S
R[%rax] \(\leftarrow\) R[%rdx]:R[%rax] / S
Signed divide
div     SR[%rdx] \(\leftarrow\) R[%rdx]:R[%rax] mod S
R[%rax] \(\leftarrow\) R[%rdx]:R[%rax] / S
Unsigned divide

Special arithmeti operations.
These operations provide full 128-bit multiplication and division, for both signed and unsigned numbers.
The pair of register %rdx and %rax are viewed as forming a single 128-bit oct word


前面的 imulq有可能会溢出截断,所以这里添加了对 128 位的支持,高 64 位存在 %rdx里,低 64 位存在 %rax
同时这里有 Signed 和 Unsigned 两种 imulq(前面的截断版的 imulq则不用区分)
整除支持 128 位(同样要分开存)整除 64 位,商存在 %rax里,余数存在 %rdx
整除也有 Signed 和 Unsigned 两种


3.6 Control

3.6.1 Condition Codes

The CPU maintains a set of single-bit condition code registers describing attributes of the most recent arithmetic or logical operation.

  • CF: Carry flag.
    The most recent operation generated a carry out of the most significant bit.
    Used to detect t overflow for unsigned operations.
  • ZF: Zero flag.
    The most recent operation yielded zero.
  • SF: Sign flag.
    The most recent operation yielded a negative value.
  • OF: Overflow flag.
    The most recent operation caused a two’s-complement overflow——either negative or positive.

All instructions listed in 3.5 (except leaq) caused the condition codes to be set.
In addition, there are two instruction classes that set condition codes without altering any other registers.

InstructionBased onDescription
CMP   \(S_1, S_2\)\(S_2 - S_1\)Compare
  cmpbCompare byte
  cmpwCompare word
  cmplCompare double word
  cmpqCompare quad word
TEST   \(S_1, S_2\)\(S_2 \& S_1\)Test
  testbTest byte
  testwTest word
  testlTest double word
  testqTest quad word

Comparison and test instructions.
These instructions set the condition codes without updating any other registers.

3.6.2 Accessing the Condition Codes

InstructionSynonymEffectSet condition
sete       DsetzD \(\leftarrow\) ZFEqual / zero
setne     DsetnzD \(\leftarrow\) ~ZFNot equal / not zero
sets       DD \(\leftarrow\) SFNegative
setns     DD \(\leftarrow\) ~SFNonnegative
setg       DsetnleD \(\leftarrow\) ~(SF^OF) & ~ZFGreater (signed >)
setge     DsetnlD \(\leftarrow\) ~(SF^OF)Greater or equal (signed >=)
setl        DsetngeD \(\leftarrow\) SF^OFLess (signed <)
setle      DsetngD \(\leftarrow\) (SF^OF) | ZFLess or equal (signed <=)
seta       DsetnbeD \(\leftarrow\) ~CF & ~ZFAbove (unsigned >)
setae     DsetnbD \(\leftarrow\) ~CFAbove or equal (unsigned >=)
setb       DsetnaeD \(\leftarrow\) CFBelow (unsigned <)
setbe     DsetnaD \(\leftarrow\) CF | ZFBelow or equal (unsigned <=)

The SET instructions.
Each instruction sets a single byte to 0 or 1 based on some combination of the condition codes.
Some instructions have "synonyms," that is, alternate names for the same machine instruction.

3.6.3 Jump Instructions

InstructionSynonymJump conditionDescription
jmp     Label1Direct jump
jmp     *Operand1Indirect jump
je       LabeljzZFEqual / zero
jne     Labeljnz~ZFNot equal / not zero
js       LabelSFNegative
jns     Label~SFNonnegative
jg       Labeljnle~(SF^OF) & ~ZFGreater (signed >)
jge     Labeljnl~(SF^OF)Greater or equal (signed >=)
jl        LabeljngeSF^OFLess (signed <)
jle      Labeljng(SF^OF) | ZFLess or equal (signed <=)
ja       Labeljnbe~CF & ~ZFAbove (unsigned >)
jae     Labeljnb~CFAbove or equal (unsigned >=)
jb       LabeljnaeCFBelow (unsigned <)
jbe     LabeljnaCF | ZFBelow or equal (unsigned <=)

The jump instructions.
These instructions jump to a labeled destination when the jump condition have "synonyms," alternate names for the same machine instruction.

3.6.4 Jump Instruction Encodings

PC relative encode the difference between the address of the target instruction and the address of the instruction immediately following the jump.
A second encoding method is to given an “absolute” address, using 4 bytes to directly specify target.

3.6.5 Implementing Conditional Branches with Conditional Control

The general form of an if-else statement in C is given by the template:

For this general form, the assembly implementation typically adheres to the following form, where we use C syntax to describe the control flow.

3.6.6 Implementing Conditional Branches with Conditional Moves

InstructionSynonymMove conditionDescription
cmove       S, RcmovzZFEqual / zero
cmovne     S, Rcmovnz~ZFNot equal / not zero
cmovs       S, RSFNegative
cmovns     S, R~SFNonnegative
cmovg       S, Rcmovnle~(SF^OF) & ~ZFGreater (signed >)
cmovge     S, Rcmovnl~(SF^OF)Greater or equal (signed >=)
cmovl        S, RcmovngeSF^OFLess (signed <)
cmovle      S, Rcmovng(SF^OF) | ZFLess or equal (signed <=)
cmova       S, Rcmovnbe~CF & ~ZFAbove (unsigned >)
cmovae     S, Rcmovnb~CFAbove or equal (unsigned >=)
cmovb       S, RcmovnaeCFBelow (unsigned <)
cmovbe     S, RcmovnaCF | ZFBelow or equal (unsigned <=)

The conditional move instructions
These instructions copy the source value S to its destination R when the move condition holds.
Some instructions have "synonyms," alternate names for the same machine instruction.


都已经有 Conditional Control 了,为什么还要有 Conditional Move 呢?
因为 CPU 是流水线式执行指令的,所以 je指令还没执行完(比如刚刚取完址)时,就会加载执行后几条指令
可是 je执行完之前,是不能确定接下来执行什么指令的,CPU 只能猜测,现代 CPU 的猜测正确率有大约 90%
如果猜错了,流水线上已经预执行的指令就要清除掉,流水线要从新从头开始执行后几条指令,这将浪费约 15-30 个时钟周期的时间
cmove则不会有这个问题,它的下一条指令是预先就确定的

Consider the following general form of conditional expression and assignment:
v = test-expr ? then-expr : else-expr;

The standard way to compile this expression using conditional control transfer would have the following form:

Based on a conditional move, this can be described by the following abstract code:

可以发现,如果 then-expr或者 else-expr耗时太长(比猜测失败浪费的时间多),第二种方法反而会更慢(不过这只是少数情况)
还有,如果其中一个分支会有 side-effect 或者 error condition,也不能用第二种方法,比如: return (xp ? *xp : 0);

3.6.7 Loops

Do-While Loops
The general form of a do-while statement is as follows:

This general form can be translated into conditionals and goto statement as follows:

While Loops
The general form of a do-while statement is as follows:

The first translation method, which we refer to as jump to middle:

The second translation method, which we refer to as guarded do:

Using this implementation strategy, the compiler can often optimize the initial test, for example, determining that the test condition will always hold.

For Loops
The general form of a for loop is as follows:

The behavior of such a loop is identical to the following code using a while loop:

The code generated by GCC for a for loop then follows one of our two translation strategies for while loops, depending on the optimization level.

3.6.8 Switch Statements

A switch statement not only make the C code more readable, but also allow an efficient implementation using a data structure called a jump table.
A jump table is an array where entry i is the address of a code segment implementing the action the program should take when the switch index equals i.
In the assembly code, the jump table is indicated by the following declarations, to which we have added comment:

The following example has a number of interesting features, including case labels that do not span a contiguous range, case with multiple label, and cases that fall through to other cases.

Recall that the operator ‘&’ creates a pointer for a data value.
In making this extension, the authors of GCC created a new operator && to create a pointer for a code location.
The computed goto is supported by GCC as an extension to the C language.

The compiler first shifts the range to between 0 and 6 by subtracting 100 from n.
It further simplifies the branching possibilities by treating index as an unsigned value, making use of the fact that negative numbers in a two’s-complement representation map to large positive numbers in an unsigned representation.
It can therefore test whether index is outside of the range 0-6 by testing whether it is greater than 6.


3.7 Procedures

For discussion purposes, suppose procedure P calls procedure Q, and Q then executes and return back to P.
These actions involve one or more of the following mechanisms:

  • Passing control.
    The program counter must be set to the starting address of the code for Q upon entry and then set to the instruction in P following the call to Q upon return.
  • Passing data.
    P must be able to provide one or more parameters to Q, and Q must be able to return a value back to P.
    Allocating and deallocating memory.
    Q must need to allocate space for local variables when it begins and then free that storage before it returns.

3.7.1 The Run-Time Stack

CSAPP Figure 3.25 General stack frame structure
Figure 3.25: General stack frame structure: The stack can be used for passing arguments, for storing return information, for saving registers, and for local storage. Portions may be omitted when not needed.

We consider the return address to be part of P’s stack frame, since it holds state relevant to P.
x86-64 procedures allocate only portions of stack frames they require.
For example, many procedures have six or fewer argument, and so all of their parameters can be passed in registers.
Indeed, many functions do not even require a stack frame. (When all of the local variables can be held in registers and the function does not call any other functions.)

3.7.2 Control Transfer

The added suffix ‘q’ in callq and retq simply emphasizes that these are x86-64 versions of call and return instructions, not IA32. In x86-64 assembly code, both versions (with or without ‘q’) can be used interchangeably.
Like jumps, a call can be either direct or indirect. In assembly code, the target of a direct call is given as a label, while the target of an indirect call is given by ‘*’ followed by an operand specifier.

3.7.3 Data Transfer

When Q returns back to P, the code for P can access the returned value in register %rax
With x86-64, up to 6 integral arguments can be passed via registers.

 1st2rd3rd4th5th6th
64 bits%rdi%rsi%rdx%rcx%r8%r9
32 bits%edi%esi%edx%ecx%r8d%r9d
16 bits%di%si%dx%cx%r8w%r9w
8 bits%dil%sil%dl%cl%r8b%r9b

When a function has more than 6 integral arguments, the other ones are passed on the stack.
When passing parameters on the stack, all data sizes are rounded up to be multiples of eight.

3.7.4 Local Storage on the Stack

At times,however, local data must be stored in memory:

  • There are not enough registers to hold all of the local data.
  • The address operator ‘&’ is applied to a local variable, and hence we must be able to generate an address for it.
  • Some of the local variables are arrays or structures and hence must be accessed by array or structure references.

Typically, a procedure allocates space on the stack frame by decrementing the stack pointer.

3.7.5 Local Storage in Registers

By convention, registers %rbx, %rbp, and %r12%r15 are classified as callee-saved registers.
When procedure P calls procedure Q, Q must preserve the values of these registers.
Procedure Q can preserve a register value by either not changing ti at all or by pushing the original value on the stack, altering it, and then popping the old value from the stack before returning.

3.7.6 Recursive Procedures

The conventions we have described for using the registers and the stack allow x86-64 procedures to call themselves recursively just like any other function call.


3.8 Array Allocation and Access

3.8.1 Basic Principles

3.8.2 Pointer Arithmetic

C allows arithmetic on pointers, where the computed value is scaled according to the size of the data type referenced by the pointer.
Suppose the starting address of integer array E and integer index x are stored in register %rdx and %rcx, resprectively.

ExpressionTypeValueAssembly code
Eint *\(X_E\)movq %rdx, %rax
E[0]int\(M[X_E]\)movl (%rdx), %eax
E[i]int\(M[X_E + 4i]\)movl (%rdx, %rcx, 4), %eax
&E[2]int *\(X_E + 8\)leaq 8(%rdx), %rax
E+i-1int *\(X_E + 4i - 4\)leaq -4(%rdx, %rcx, 4), %rax
*(E+i-3)int\(M[X_E + 4i - 12]\)movl -12(%rdx, %rcx, 4), %eax
&E[i]-Elongimovq %rcx, %rax

3.8.3 Nested Arrays

Array element D[i][j] is at memory address \(\&D[i][j] = x_D + L(C \times i + j)\), where L is the size of data type T in bytes.

3.8.4 Fixed-Size Arrays

The C compiler is able to make many optimizations for code operating on multi-dimensional arrays of fixed size.

3.8.5 Variable-Size Arrays

Historically, programmers requiring variable-size arrays had to allocate storage for these arrays using function such as malloc or calloc, and they had to explicitly encode the mapping of multidimensional arrays int single-dimension ones via row-major indexing, as expressed in 3.8.3
ISO C99 introduced the capability of having array dimension expressions that are computed as the array is being allocated.
So, for example, we can write a function to access element i,j of an n*n array as follows:

The parameter n must precede the parameter A[n][n], so that the function can compute the array dimensions as the parameter is encountered.
Gcc generates code for this referencing function as

We see that referencing variable-size arrays requires only a slight generalization over fixed-size ones. The dynamic version must use a multiplication instruction to scale i by n (which may incur significant performance penalty), rather than a series of shifts and adds.
When variable-size arrays are referenced within a loop, the compiler can often optimize the index computations by exploiting the regularity of the access patterns (like the codes in 3.8.4).


3.9 Heterogeneous Data Structures

C provides two mechanisms for creating data types by combining objects of different types: structures, declared using keyword struct, aggregate multiple objects into a single unit: unions, declared using keyword union, allow an object to be referenced using several different types.

3.9.1 Structures

To access the fields of a structure, the compiler generates code that adds the appropriate offset to the address of the structure. The selection of the different fields of a structure is handled completely at compile time. The machine code contains no information about the field declarations or the names of the fields.

3.9.2 Unions

Unions provide a way to circumvent the type system of C (at the same time, bypass the safety provided by the C type system).
Unions can also be used to access the bit patterns of different data types.

3.9.3 Data alignment

Many computer systems place restrictions on the allowable addresses for the primitive data types, requiring that the address for some objects must be a multiple of K (typically 2, 4, or 8). Such alignment restrictions simplify the design of the hardware forming the interface between the processor and the memory system.
For example, suppose a processor always fetches 8 bytes from memory with an address that must be a multiple of 8. If we can guarantee that any double will be aligned to have its address be a multiple of 8, then the value can be read or written with a single memory operation. Otherwise, we may need two.
The x86-64 hardware will work correctly regardless of the alignment of data. However, Intel recommends that data be aligned to improve memory system performance.
Their alignment rule is based on the principle that any primitive object of K bytes must have an address that is multiple of K.
The compiler places directives in the assembly code indicating the desired alignment for global data. For examble, .align 8 ensures that the data following it will start with an address that is a multiple of 8.
For code involving structures, the compiler may need to insert gaps in the field allocation to ensure that each structure element satisfies its alignment requirement. In additions, the compiler may need to add padding to the end of the structure.

The elements of d will have address x, x + 12, x + 24, x + 36. As long as x is a multiple of 4, all of the alignment restrictions will be satisfied.


3.10 Combining Control and Data in Machine-Level Programs

In this section, we look at ways in which data and control interact with each other.

3.10.1 Understanding Pointers

  • Every pointers has an associated type.
    Pointer types are not part of machine code; they are an abstraction provided by C to help programmers avoid addressing errors.
  • Every pointer has a value.
    The value is an address of some object of the designated type.
  • Pointers are created with the ‘&’ operator.
  • Pointers are dereferenced with the ‘*’ operator.
  • Arrays and pointers are closely related.
  • Casting from one type of pointer to another changes its type but not its value.
  • Pointers can also point to functions.
    The value of a function pointer is the address of the first instruction in the machine-code representation of the function.

3.10.2 Life in the Real World: Using the GDB Debugger

3.10.3 Out-of-Bounds Memory References and Buffer Overflow

We have seen that C does not perform any bounds checking for array references, and that local variables are stored on the stack along with state information such as saved register values and return addresses. This combination can lead to serious program errors, where the state stored on the stack gets corrupted by a write to an out-of-bounds array elements. When the program then tries to reload the register or execute a ret instruction with this corrupted state, things can go seriously wrong.
Typically, the program is fed with a string that contains the byte encoding of some executable code, called the exploit code, plus some extra bytes that overwrite the return address with a pointer to the exploit code. The effect of executing the ret instruction is then to jump to the exploit code.

CSAPP-Figure-3.40
Stack organization for echo function.character array buf is just below part of the saved state. An out-of-bounds write to buf can corrupt the program state.

3.10.4 Thwarting Buffer Overflow Attacks

Stack Randomization
In order to insert exploit code into a system, the attacker needs to inject both the code as well as a pointer to this code as part of the attack string. Generating this pointer requires knowing the stack address where the string will be located.
The idea of stack randomization is to make the position of the stack vary from one run of a program to another.
This is implemented by allocating a random amount of space between 0 and n bytes on the stack at the start of a program, for example, by using the allocation function alloca, which allocates space for a specified number of bytes on the stack. This allocated space is not used by the program, but it causes all subsequent stack locations to vary from one execution of a program to another. The allocation range n needs to be large enough to get sufficient variations in the stack address, yet small enough that it does not waste too much space in the program.
Stack randomization is one of a larger class of techniques known as address-space layout randomization, of ASLR. With ASLR, different parts of the program, including program code, library code, stack, global variables, and heap data, are loaded into different regions of many each time a programs is run.
However, the attacker can overcome randomization by brute force. A common trick is to include a long sequence of nop instructions (called nop sled) before the actual exploit code. As long as the attacker can guess an address somewhere within this sequence, the program will run through the sequence and then hit the exploit code. If we set up a 256-byte nop sled, then the randomization over \(n=2^{23}\) can be cracked by enumerating \(2^{15}\) starting address, which is feasible.

Stack Corruption Detection

CSAPP-Figure-3.42
Stack organization for echo function with stack protector enabled. A special “canary” value is positioned between array buf and the saved state. The code checks the canary value to determine whether or not the stack state has been corrupted.

The canary value, also referred as a guard value, is generated randomly each time the program runs.
Recent versions of GCC try to determine whether a function is vulnerable to a stack overflow and insert this type of overflow detection automatically.

Limiting Executable Code Regions
In typical programs, only the portion of memory holding the code generated by the compiler need be executable. The other portions can be restricted to allow just reading and writing.
Historically, the x86 architecture merged the read and execute access controls into a single 1-bit flag, so that any page marked as readable was also executable.
More recently, AMD introduced an NX (for “no-execute”) bit into memory protection for its 64-bit processors, separating the read and execute access modes, and Intel followed suit.
The checking of whether a page is executable is performed in hardware, with no penalty in efficiently.
However, some types of programs require the ability to dynamically generate and execute code. For example, just-in-time compilation techniques dynamically generate code for programs written in interpreted languages, such as Java, to improve execution performance.

3.10.5 Supporting Variable-Size Stack Frames

We have examined the machine-level code for a variety of functions so far, but they all have the property that the compiler can determine in advance the amount of space that must be allocated for their stack frames.
Some functions, however, require a variable amount of local storage. This can occur, for example, when the function calls alloca, a standard library function that can allocate an arbitrary number of bytes of storage on the stack. It can also occur when the code declares a local array of variable size.
To manage a variable-size stack frame, x86-64 code use register %rbp to serve as a frame pointer (sometimes referred as a base pointer). In the earlier versions of x86 code, the frame pointer was used with every function call. With x86-64 code, it is used only in cases where the stack frame may be of variable size. Historically, most compilers used frame pointers when generating IA32 code. Recent versions of GCC have dropped this convention. Observe that it is acceptable to mix code that uses frame pointers with code that does not, as long as all functions treat %rbp as a callee-saved register.

CSAPP-Figure-3.44
Stack frame structure for function vframe.

3.11 Floating-Point Code

x86-64 floating point is based on SSE of AVX, including conventions for passing procedure arguments and return values.
The AVX floating-point architecture allows data to be stored in 16 YMM registers, named %ymm0-%ymm15. Each YMM register is 256 bits long.
When operating on scalar data, these registers only hold floating-point data, and only the low-order 32 bits (for float) or 64 bit (for double) are used. The assembly code refers to the registers by their SSE XMM register names %xmm0-%xmm15, where each XMM register is the low-order 128 bits of the corresponding YMM register.
The scalar instructions operate on individual data values, while some instructions operate on packed data, meaning that they update the entire destination XMM register. Once again, our only interest for scalar data is the effect these instructions have on the low-order 4 or 8 bytes of the destination.
(由于浮点数相关的指令名太丑,下面将忽略)

3.11.1 Floating-Point Movement and Conversion Operations.

GCC uses the scalar movement operations only to transfer data from memory to an XMM register or from an XMM register to memory.
For transferring data between two XMM registers, it uses one of two different instructions for copying the entire contents of on XMM register to another.
The sets of instructions for conversion are all scalar instructions.
When converting floating-point values to integers, they perform truncation, rounding values toward zero.

3.11.2 Floating-Point Code in Procedures

  • Up to eight floating-pointing arguments can be passed in XMM registers %xmm0-%xmm7.
    These registers are used in the order the argument are listed from left to right.
    Additional floating-point argument can be passed on the stack.
  • A function that return a floating-point value does so in register %xmm0.
  • All XMM registers are caller saved.
  • When a function contains a combination of pointer, integer, and floating-point arguments, the pointers and integers are passed in general-purpose register, while the floating-point values are passed in XMM registers.

3.11.3 Floating-Point Arithmetic Operations

The second source operand and the destination operands must be XMM register.

3.11.4 Defining and Using Floating-Point Constants

AVX floating-point operations cannot have immediate values as operands. In instead, the compiler must allocate and initialize storage for any constant values. The code then reads the values from memory.

3.11.5 Using Bitwise Operations in Floating-Point Code

These operations all act on packed data, applying the bitwise operation to all the data in the two source registers.

3.11.6 Floating-Point Comparison Operations

For floating-point comparisons, the parity flag PF is set when either operand is NaN. By convention, any comparison in C is considered to fail when one of the arguments is NaN, and this flag is used to detect such a condition. For example, even the comparison x == x yields 0 when x is NaN.
Instructions such as jp, ja, and jb are used to conditionally jump on various combinations of flags.

3.11.7 Observations about Floating-Point Code

AVX2 has the potential to make computations run faster by performing parallel operations on packed data. Compiler developers are working on automating the conversion of scalar code to parallel code, but currently the most reliable way to achieve higher performance through parallelism is to use the extensions to the C language supported by GCC for manipulating vectors of data. See Web Aside OPT:SIMD on page 546 to see how this can be done.


3.12 Summary

In this chapter, we have peered beneath the layer of abstraction provided by the C language to get a view of machine-level programming.
By having the compiler generate an assembly-code representation of the machine-level program, we gain insights into both the compiler and its optimization capabilities, along with the machine, its data types, and its instruction set (will be useful in Chapter 5).
We have also gotten a more complete picture of how the program stores data in different memory regions (will be useful in Chapter 12).
We have only examined the mapping of C onto x86-64, but much of what we have covered is handled in a similar way for other combinations of language and machine.