从头开始编写 64 位模式的 BIOS 引导加载程序

从头开始编写 64 位模式的 BIOS 引导加载程序
Writing a BIOS bootloader for 64-bit mode from scratch

原始链接: https://thasso.xyz/2024/07/13/setting-up-an-x86-cpu.html

作者解释了为 X86\_64 CPU 创建最小 64 位引导加载程序所采取的步骤。它们从由 BIOS 加载的引导扇区开始，该引导扇区以 16 位实模式初始化处理器，并仅提供 512 字节可供使用。该引导扇区负责引导引导加载程序的其他部分。在 16 位实模式下，引导扇区利用分段来访问 20 位地址空间，允许寻址高达 1 MB 的内存。该过程涉及使用多种工具，包括“Intel 64 和 IA-32 架构软件开发人员手册”、汇编器（在本例中为“nasm”）和 QEMU。他们解释说，复位后CPU进入“实模式”，其中默认操作数大小为16位。然后，重新启动时执行的第一个代码位于 BIOS 提供的引导扇区中，该引导扇区基于第一个扇区末尾的幻数模式 (0xaa55) 进行定位。一旦加载到地址 0x7c00 的内存中，该引导扇区就可以为继续创建引导加载程序的其余部分提供必要的基础。基础知识就位后，他们概述了从 16 位实模式过渡到 64 位长模式的下一个目标。目前，开发过程分为多个阶段： * 第 1 阶段：引导扇区（计算机启动时 BIOS 执行的初始代码）。其主要目标是加载引导加载程序的第二阶段。 * 第 2 阶段：从 16 位实模式切换到 32 位保护模式的代码。 + 这使他们不再依赖 BIOS 功能。 + 作者利用了一系列磁盘读取，但出于安全考虑，他们警告说，如果不采取适当的预防措施，这样做可能会导致问题。本文的下一步重点是通过 BIOS 执行磁盘操作，特别是扩展在 osdev.org 上找到的有关该主题的信息。最后，本文最后提供了一些有关进一步进入引导过程的附加细节，包括加载“阶段 2”以供稍后跳转，以及

所提供的文本讨论了一种提议的新文件系统，旨在简化引导加载程序中的编码并降低系统复杂性。系统使用扇区号来识别文件，而不是传统的文件名。每个文件在主文件表（MFT）中都有一个称为“inode”的索引号。作者建议添加一个标志来指示文件是否永远不应分段，并为有关卷的元数据保留前 23 个 inode 编号。该系统与当前主流系统之间的主要区别是： 1. 文件系统内的直接链接不需要文件名，而是使用扇区号。 2. 可能使用标志来确保文件保持不碎片状态。 3. 包含仅供人类使用的文件名选项。 4. 可选择使用唯一数字标识符来实现更高级别的功能。关于该计划提出的问题是，除了不符合主流做法之外，专家是否预见到还会出现任何问题。提出的其他要点包括文件系统支持的最大文件大小和文件数量方面的潜在限制。作者建议，起始操作系统的极其简单的设计将由 (sector_count, start_sector, fname_len, fname, 0) 组成，而更复杂的设计可能需要检查其他文件系统，例如 EXT2、FAT32、NTFS 等。其他评论提到了错误处理、系统稳定性、磁盘可靠性和用户错误的重要性。比较了因过大的文件、分区或大量文件而面临挑战的数据中心。关于最大文件名长度、文件大小和每个文件系统的文件数量的限制提出的问题被认为是确定必要的实现细节的潜在影响因素。其他观察涉及文件系统使用的历史示例，包括为关键系统组件（例如引导加载程序、操作系统内核或基本数据库）指定的连续文件。一些文件系统要求特定系统文件不分段，尽管缺乏元数据来表示它们。最后，讨论了用于引导现代硬盘驱动器的 MBR（主引导记录）和 GPT（GUID 分区表）之间的差异，以及 UEFI（统一可扩展固件接口）与 BIOS（基本输入输出系统）相比所提供的优势。后者包括降低复杂性和

原文

People say there are things that are complex and there are things that are just complicated. Complexity is considered interesting, complicatedness is considered harmful. The process of setting up an x86_64 CPU is mostly complicated.

I’ll describe one way to go from a boot sector loaded by the BIOS with the CPU in 16-bit real mode to the CPU set up in 64-bit long mode. The setup is pretty bare-bones and there’s tons more to do.

To follow along, you need the Intel 64 and IA-32 Architectures Software Developer’s Manual, an assembler (I used nasm), and QEMU. If you don’t have an x86_64 CPU, you should still be able to run everything I describe by emulating an x86 CPU in QEMU. I assume you know x86 assembly and the syntax that nasm uses. I like the nasm tutorial by Ray Toal for getting started.

I was surprised by how readable some of the Intel manual is. The initial chapters in volume 1 do a really good job at providing an overview of the system and explaining the terms used throughout the other volumes. But volume 3: System Programming Guide is most relevant to this discussion. There is an overview of all the operating modes in volume 3, section 2.2 Modes of Operation. The path we’re taking is highlighted in red.

Overview of x86 mode transitions

For everything up to 32-bit mode, take a look at “Writing a Simple Operating System – from Scratch”. It’s unfinished but still very good.

After a reset, the x86 CPU is in “real mode”. That mode has a default operand size of 16 bits. You get a 20-bit address space and thus the ability to address 1MB of memory by using segmentation. Real mode is pretty much a backward compatibility mode for the Intel 8086 chip from 1978.

After the BIOS the first code that runs is that in the boot sector. The BIOS searches the system for a disk where the first sector ends in the magic number 0xaa55 (i.e., the byte 0x55 followed by the byte 0xaa). It loads that “boot sector” to memory at address 0x7c00.

So the BIOS gives us 512 bytes to work with. We need to use these bytes in order to bootstrap the rest of the bootloader. One can fit a surprising amount of stuff in 512 bytes, but it’s easiest to just load some more data from disk first. Fortunately, routines defined by the BIOS remain available to us as long as we’re in real mode.

Let’s set up a simple boot sector. It will just print a message to the screen using BIOS routines and then hang. This way, we know that the tooling works.

This is the assembly we need:

;; src/boot_sector.s

    section .boot_sector
    global __start

    [bits 16]

__start:
    mov bx, hello_msg
    call print_string

end:
    hlt
    jmp end

;; Uses the BIOS to print a null-termianted string. The address of the
;; string is found in the bx register.
print_string:
    pusha
    mov ah, 0x0e ; BIOS "display character" function

print_string_loop:
    cmp byte [bx], 0
    je print_string_return

    mov al, [bx]
    int 0x10 ; BIOS video services

    inc bx
    jmp print_string_loop

print_string_return:
    popa
    ret

hello_msg: db "Hello, world!", 0

Plus this Makefile:

# Makefile

.PHONY: all clean boot

NASM := nasm -f elf64

BUILD_DIR := build
SRC_DIR := src

SRC := $(wildcard $(SRC_DIR)/*.s)
OBJS := $(patsubst $(SRC_DIR)/%.s, $(BUILD_DIR)/%.o, $(SRC))
BOOT_IMAGE := $(BUILD_DIR)/boot_image

all: $(BOOT_IMAGE)

boot: $(BOOT_IMAGE)
	qemu-system-x86_64 -no-reboot -drive file=$<,format=raw,index=0,media=disk

$(BOOT_IMAGE): $(BUILD_DIR)/linked.o
	objcopy -O binary $< $@

$(BUILD_DIR)/linked.o: $(OBJS)
	ld -T linker.ld -o $@ $^

$(BUILD_DIR)/%.o: $(SRC_DIR)/%.s
	@mkdir -p $(dir $@)
	$(NASM) $< -o $@

clean:
	$(RM) -r $(BUILD_DIR)

The linker script linker.ld is important because it makes sure that the code in our boot sector is relocated to the right address in the final image. Specifically, the bootloader loads the boot sector to address 0x7c00 in memory. So that’s the base address to relocate the boot sector to. In addition, the linker will add the magic number at the end of the boot sector. Other guides I’ve seen do both the offset and the magic number inside the boot sector assembly source file by using features of the assembler, but that’s somewhat hackish.

# linker.ld

MEMORY
{
    boot_sector (rwx) : ORIGIN = 0x7c00, LENGTH = 512
}

ENTRY(__start)
SECTIONS
{
    .boot_sector : { *(.boot_sector); } > boot_sector
    .bootsign (0x7c00 + 510) :
    {
        BYTE(0x55)
        BYTE(0xaa)
    } > boot_sector
}

Running make boot should result in a QEMU window and the “Hello, World!” message should be displayed.

QEMU screenshot of the hello world message

We can split the bootloader into two stages. Stage 1 is the code in the boot sector. It is everything that the BIOS loads for us. The sole purpose of stage 1 is to load stage 2 into memory. Stage 1 does this by using BIOS-provided routines to load stage 2 into memory.

In stage 2, we’ll switch from 16-bit real mode to 32-bit protected mode. In protected mode, we can’t use BIOS routines anymore. Without BIOS routines, loading sectors from a disk would become much more involved. So we’ll load a number of sectors from disk into memory and hope for the best. Of course, this is an unsafe technique, but it works for now.

This is how one can access the disk using BIOS. There’s an osdev.org page on this.

;; src/boot_sector.s

;; ...

__start:
    ;; ...

    mov si, disk_address_packet
    mov ah, 0x42 ; BIOS "extended read" function
    mov dl, 0x80 ; Drive number
    int 0x13 ; BIOS disk services
    jc error_reading_disk

ignore_disk_read_error:
    SND_STAGE_ADDR equ (BOOT_LOAD_ADDR + SECTOR_SIZE)
    jmp 0:SND_STAGE_ADDR

error_reading_disk:
    ;; We accept reading fewer sectors than requested
    cmp word [dap_sectors_num], READ_SECTORS_NUM
    jle ignore_disk_read_error

    mov bx, error_reading_disk_msg
    call print_string

    end:
    ;; ...

And at the end of boot_sector.s put this data:

;; src/boot_sector.s

;; ...

    align 4
disk_address_packet:
    db 0x10 ; Size of packet
    db 0 ; Reserved, always 0
dap_sectors_num:
    dw READ_SECTORS_NUM ; Number of sectors read
    dd (BOOT_LOAD_ADDR + SECTOR_SIZE) ; Destination address
    dq 1 ; Sector to start at (0 is the boot sector)

READ_SECTORS_NUM equ 64
BOOT_LOAD_ADDR equ 0x7c00
SECTOR_SIZE equ 512

hello_msg: db "Hello, world!", 13, 10, 0
error_reading_disk_msg: db "Error: failed to read disk with 0x13/ah=0x42", 13, 10, 0

Lastly we need a stage 2 to jump to and we need to update the linker script. The Makefile remains unchanged.

;; src/stage2.s

    section .stage2

    [bits 16]

    mov bx, stage2_msg
    call print_string

end:
    hlt
    jmp end

    print_string:
        ;; ...

stage2_msg: db "Hello from stage 2", 13, 10, 0

I just copied the print_string function so we can test if the jump works. Because this specific function only works with BIOS in real mode, it won’t be of any use to stage 2 once we have switched to protected mode.

Finally the linker script:

# linker.ld

MEMORY
{
    boot_sector (rwx) : ORIGIN = 0x7c00, LENGTH = 512
    stage2 (rwx) : ORIGIN = 0x7e00, LENGTH = 32768 # 512 * 64
}

ENTRY(__start)
SECTIONS
{
    .boot_sector : { *(.boot_sector); } > boot_sector
    .bootsign (0x7c00 + 510) :
    {
        BYTE(0x55)
        BYTE(0xaa)
    } > boot_sector
    .stage2 : { *(.stage2); } > stage2
}

QEMU screenshot of the stage 2 message

Next, we’ll switch the CPU from real mode (16-bit) to protected mode (32-bit). In protected mode, segmentation is used by default to implement memory protection. Before switching to protected mode, you need to define a Global Descriptor Table (GDT) that contains segment descriptors for all the segments you want to define. Usually, paging is used in favor of segmentation. In fact, in 64-bit long mode, you need to use paging. But for the initial switch to protected mode, segmentation is required.

The Intel manual describes the “flat model” as a very simple segmentation model that can be implemented in the GDT. The “flat model” comprises a code segment and a data segment. Both of these segments are mapped to the entire linear address space (their base addresses and limits are identical). Using the simplest of all models is fine, since we just want to get to long mode and abandon segmentation in favor of paging.

The GDT is defined as a contiguous structure in memory. You fill a chunk of memory with the right data and give the CPU the address and the length of the memory chunk. The format of the GDT structure is described in the Intel manual.

From section “3.4.5 Segment Descriptors”:

Screenshot of segment descriptor layout

The GDT is just an array of segment descriptors with a “null descriptor” at the start that’s used to catch invalid translations. The fields in the segment descriptor are described in detail in section “3.4.5 Segment Descriptors” of volume 3 of the Intel manual.

We define the GDT like this:

;; include/gdt32.s

    ;; Base address of GDT should be aligned on an eight-byte boundary
    align 8

gdt32_start:
    ;; 8-byte null descriptor (index 0).
    ;; Used to catch translations with a null selector.
    dd 0x0
    dd 0x0

gdt32_code_segment:
    ;; 8-byte code segment descriptor (index 1).
    ;; First 16 bits of segment limit
    dw 0xffff
    ;; First 24 bits of segment base address
    dw 0x0000
    db 0x00
    ;; 0-3: segment type that specifies an execute/read code segment
    ;;   4: descriptor type flag indicating that this is a code/data segment
    ;; 5-6: Descriptor privilege level 0 (most privileged)
    ;;   7: Segment present flag set indicating that the segment is present
    db 10011010b
    ;; 0-3: last 4 bits of segment limit
    ;;   4: unused (available for use by system software)
    ;;   5: 64-bit code segment flag indicates that the segment doesn't contain 64-bit code
    ;;   6: default operation size of 32 bits
    ;;   7: granularity of 4 kilobyte units
    db 11001111b
    ;; Last 8 bits of segment base address
    db 0x00

gdt32_data_segment:
    ;; Only differences are explained ...
    dw 0xffff
    dw 0x0000
    db 0x00
    ;; 0-3: segment type that specifies a read/write data segment
    db 10010010b
    db 11001111b
    dw 0x00

gdt32_end:

;; Value for GDTR register that describes the above GDT
gdt32_pseudo_descriptor:
    ;; A limit value of 0 results in one valid byte. So, the limit value of our
    ;; GDT is its length in bytes minus 1.
    dw gdt32_end - gdt32_start - 1
    ;; Start address of the GDT
    dd gdt32_start

CODE_SEG32 equ gdt32_code_segment - gdt32_start
DATA_SEG32 equ gdt32_data_segment - gdt32_start

Switching to protected mode is very easy now. We load the GDT pseudo-descriptor into the GDTR register so that the base address and length of our GDT are known to the system. Lastly, we do a far jump to flush the instruction pipeline.

;; src/stage2.s

    section .stage2

    [bits 16]

;; ...

    ;; Load GDT and switch to protected mode

    cli ; Can't have interrupts during the switch
    lgdt [gdt32_pseudo_descriptor]

    ;; Setting cr0.PE (bit 0) enables protected mode
    mov eax, cr0
    or eax, 1
    mov cr0, eax

    ;; The far jump into the code segment from the new GDT flushes
    ;; the CPU pipeline removing any 16-bit decoded instructions
    ;; and updates the cs register with the new code segment.
    jmp CODE_SEG32:start_prot_mode


    [bits 32]
start_prot_mode:
    ;; Old segments are now meaningless
    mov ax, DATA_SEG32
    mov ds, ax
    mov ss, ax
    mov es, ax
    mov fs, ax
    mov gs, ax

;; ...

%include "include/gdt32.s"

Interrupts are disabled during the switch. After the entire setup is complete, interrupts can be enabled again. This would require extra setup work.

Now that we’re in protected mode, we can’t use the BIOS routines anymore. To print text, we can write straight to the VGA buffer instead.

;; src/stage2.s

;; ...

;; Writes a null-terminated string straight to the VGA buffer.
;; The address of the string is found in the bx register.
print_string32:
    pusha

    VGA_BUF equ 0xb8000
    WB_COLOR equ 0xf

    mov edx, VGA_BUF

print_string32_loop:
    cmp byte [ebx], 0
    je print_string32_return

    mov al, [ebx]
    mov ah, WB_COLOR
    mov [edx], ax

    add ebx, 1              ; Next character
    add edx, 2              ; Next VGA buffer cell
    jmp print_string32_loop

print_string32_return:
    popa
    ret

Best print something so that we know the switch worked. To do that, add a string with the message and a call to print_string32 to the code. The print_string32 function is super basic, so the message always shows up in the top left corner of the display.

QEMU screenshot of the protected mode message

For this part, refer to “10.8.5 Initializing IA-32e Mode”. Note that Intel calls the 64-bit mode “IA-32e” while AMD refers to it as “long mode” in the AMD64 manual.

Before switching to long mode, the CPU must be in protected mode and paging must be enabled. We have protected mode now, but we are missing paging.

I love paging. It’s just very cool. But I’d do a poor job at explaining the concept itself. Philipp Oppermann’s Introduction to Paging from the “Writing an OS in Rust” blog was really useful for me personally. OSTEP also talks about paging starting chapter 18, although it doesn’t go into the specifics of paging on x86 like Philipp Oppermann’s post does.

In long mode with Physical Address Extension enabled (PAE, we’ll do that below ), a four level page table is used. The below code generates such a page table at a given address.

;; src/stage2.s

;; Builds a 4 level page table starting at the address that's passed in ebx.
build_page_table:
    pusha

    PAGE64_PAGE_SIZE equ 0x1000
    PAGE64_TAB_SIZE equ 0x1000
    PAGE64_TAB_ENT_NUM equ 512

    ;; Initialize all four tables to 0. If the present flag is cleared, all other bits in any
    ;; entry are ignored. So by filling all entries with zeros, they are all "not present".
    ;; Each repetition zeros four bytes at once. That's why a number of repetitions equal to
    ;; the size of a single page table is enough to zero all four tables.
    mov ecx, PAGE64_TAB_SIZE ; ecx stores the number of repetitions
    mov edi, ebx             ; edi stores the base address
    xor eax, eax             ; eax stores the value
    rep stosd

    ;; Link first entry in PML4 table to the PDP table
    mov edi, ebx
    lea eax, [edi + (PAGE64_TAB_SIZE | 11b)] ; Set read/write and present flags
    mov dword [edi], eax

    ;; Link first entry in PDP table to the PD table
    add edi, PAGE64_TAB_SIZE
    add eax, PAGE64_TAB_SIZE
    mov dword [edi], eax

    ;; Link the first entry in the PD table to the page table
    add edi, PAGE64_TAB_SIZE
    add eax, PAGE64_TAB_SIZE
    mov dword [edi], eax

    ;; Initialize only a single page on the lowest (page table) layer in
    ;; the four level page table.
    add edi, PAGE64_TAB_SIZE
    mov ebx, 11b
    mov ecx, PAGE64_TAB_ENT_NUM
set_page_table_entry:
    mov dword [edi], ebx
    add ebx, PAGE64_PAGE_SIZE
    add edi, 8
    loop set_page_table_entry

    popa
    ret

Paging supersedes segmentation for managing virtual address spaces, permissions, etc. A Global Descriptor Table with segment descriptors is still needed though, and the segment descriptors must be modified slightly to enable long mode-specific features.

This is another GDT that also implements the flat model. It’s almost identical to the GDT for protected mode. Just two bits were changed.

;; include/gdt64.s

    align 16
gdt64_start:
    ;; 8-byte null descriptor (index 0).
    dd 0x0
    dd 0x0

gdt64_code_segment:
    dw 0xffff
    dw 0x0000
    db 0x00
    db 10011010b
    ;;   5: 64-bit code segment flag indicates that this segment contains 64-bit code
    ;;   6: must be zero if L bit (bit 5) is set
    db 10101111b
    db 0x00

gdt64_data_segment:
    dw 0xffff
    dw 0x0000
    db 0x00
    ;; 0-3: segment type that specifies a read/write data segment
    db 10010010b
    db 10101111b
    dw 0x00

gdt64_end:

gdt64_pseudo_descriptor:
    dw gdt64_end - gdt64_start - 1
    dd gdt64_start

CODE_SEG64 equ gdt64_code_segment - gdt64_start
DATA_SEG64 equ gdt64_data_segment - gdt64_start

With the page table and the GDT in place, the switch from protected mode to long mode can be performed.

;; src/stage2.s

;; ...

start_prot_mode:
    ;; ...

    ;; Build 4 level page table and switch to long mode
    mov ebx, 0x1000
    call build_page_table
    mov cr3, ebx            ; MMU finds the PML4 table in cr3

    ;; Enable Physical Address Extension (PAE). This is needed to allow the switch
    mov eax, cr4
    or eax, 1 << 5
    mov cr4, eax

    ;; The EFER (Extended Feature Enable Register) MSR (Model-Specific Register) contains fields
    ;; related to IA-32e mode operation. Bit 8 if this MSR is the LME (long mode enable) flag
    ;; that enables IA-32e operation.
    mov ecx, 0xc0000080
    rdmsr
    or eax, 1 << 8
    wrmsr

    ;; Enable paging (PG flag in cr0, bit 31)
    mov eax, cr0
    or eax, 1 << 31
    mov cr0, eax

    mov ebx, comp_mode_msg
    call print_string32

    ;; New GDT has the 64-bit segment flag set. This makes the CPU switch from
    ;; IA-32e compatibility mode to 64-bit mode.
    lgdt [gdt64_pseudo_descriptor]

    jmp CODE_SEG64:start_long_mode

    ;; ...

    [bits 64]

start_long_mode:
    htl
    jmp start_long_mode

    ;; ...

%include "include/gdt64.s"

    ;; ...

comp_mode_msg: db "Entered 64-bit compatibility mode", 0

Again, the “success message” should show up in the top left corner. Write a small VGA driver if this annoys you.

QEMU screenshot of the success message

C code can easily be intergrated into this setup. E.g, this might become an OS kernel.

/* src/kernel.c */

#define VGA_COLUMNS_NUM 80
#define VGA_ROWS_NUM 25

#define ARRAY_SIZE(arr) ((int)sizeof(arr) / (int)sizeof((arr)[0]))

void _start_kernel(void) {
	volatile char *vga_buf = (char *)0xb8000;
	const char msg[] = "Hello from C";
	int i;

	for (i = 0; i < VGA_COLUMNS_NUM * VGA_ROWS_NUM * 2; i++)
		vga_buf[i] = '\0';

	for (i = 0; i < ARRAY_SIZE(msg) - 1; i++) {
		vga_buf[i * 2] = msg[i];
		vga_buf[i * 2 + 1] = 0x07; /* White on black */
	}
}

Update src/stage2.s:

;; src/stage2.s

    ;; ...

    [bits 64]

start_long_mode:
    mov ebx, long_mode_msg
    call print_string64

    extern _start_kernel
    call _start_kernel

end64:
    hlt
    jmp end64

    ;; ...

The linker script:

# linker.ld

MEMORY
{
    boot_sector (rwx) : ORIGIN = 0x7c00, LENGTH = 512
    stage2 (rwx) : ORIGIN = 0x7e00, LENGTH = 512
    kernel (rwx) : ORIGIN = 0x8000, LENGTH = 0x10000
}

ENTRY(__start)
SECTIONS
{
    .boot_sector : { *(.boot_sector); } > boot_sector
    .bootsign (0x7c00 + 510) :
    {
        BYTE(0x55)
        BYTE(0xaa)
    } > boot_sector
    .stage2 : { *(.stage2); } > stage2
    .text : { *(.text); } > kernel
    .data : { *(.data); } > kernel
    .rodata : { *(.rodata); } > kernel
    .bss :
    {
        *(.bss)
        *(COMMON)
    } > kernel
}

Lastly, the Makefile needs to change. Here, I only included the lines that have changed.

# Makefile

# ...

CC := gcc
CFLAGS := -std=c99 -ffreestanding -m64 -mno-red-zone -fno-builtin -nostdinc -Wall -Wextra

# ...

SRC := $(wildcard $(SRC_DIR)/*)
OBJS := $(patsubst $(SRC_DIR)/%, $(BUILD_DIR)/%.o, $(SRC))

# ...

$(BUILD_DIR)/%.s.o: $(SRC_DIR)/%.s
	@mkdir -p $(dir $@)
	$(NASM) $< -o $@

$(BUILD_DIR)/%.c.o: $(SRC_DIR)/%.c
	@mkdir -p $(dir $@)
	$(CC) $(CFLAGS) -c $< -o $@

# ...

QEMU screenshot of the message printed by C code

Cool if you actually came along this far. The code is on GitHub.

从头开始编写 64 位模式的 BIOS 引导加载程序 Writing a BIOS bootloader for 64-bit mode from scratch

从头开始编写 64 位模式的 BIOS 引导加载程序
Writing a BIOS bootloader for 64-bit mode from scratch