XNU initialises a two-level kernel table structure to hold information on exclave resources discovered during boot. Each resource is exclusively of one resource type and holds information on a resource that exists in the secure world, or in both worlds.
The root_table identifies domains by name, with each domain referencing a second level table holding all the resources for that domain. Domains and their resources discovered so far include:
com.apple.kernel — this domain contains many resources used by the kernel including:
- com.apple.service.ConclaveLauncherControl — conclave launcher service
- com.apple.service.ConclaveLauncher_Debug — debug service
- com.apple.service.ExclaveIndicatorController — service for secure indicator lights
- com.apple.service.LogServer_XNUProxy — service for logging
- com.apple.service.FrameMint — service used to boot ExclaveKit
- com.apple.storage.backend — Shared memory buffer used by exclave services to do file IO from XNU space via upcalls (more details below)
- Conclave Manager x — One per conclave, used to control a conclave
- Conclave Manager y …
com.apple.darwin — No open-source components use this domain
com.apple.conclave.name — There is one domain per conclave.
- service_x
- service_y
- audio buffer
- shared memory buffer
- etc
com.apple.driver.name — One domain per device driver — existence of these domains is based on comments, not actually seen in open-sourced code. I suspect these are just per-driver conclaves.
A conclave is a type of resource that itself can contain multiple resources. However, it is much more than just a container of resources. Conclaves allow a group of services and other resources to have shared access to each other, and Mach tasks are limited in what (if any) conclaves they can call upon.
Each conclave has a Conclave Manager (another type of exclave resource), located in the kernel domain.
Conclaves have a lifecycle, whereby their Conclave Manager is first attached to a Mach task, and are then launched. They can also be stopped and detached. States such as launching and stopping exist during transitions in the lifecycle.
The XNU posix_spawn() function can call task_add_conclave() to attach a task and a conclave manager resource together. This is a 1:1 relationship — only one task can be attached to a conclave manager and vice versa. Only launchd and tasks with the com.apple.private.exclaves.conclave-spawn entitlement may spawn a conclave. The com.apple.private.exclaves.conclave-host entitlement is largely similar, but I believe only entitles a task to attach itself, rather than being able to spawn a new task for this purpose.
The kernel looks up the associated conclave manager resource for the targeted conclave in the com.apple.kernel domain. It then saves a tightbeam endpoint to the conclave manager’s endpoint in the conclave’s resource struct. This endpoint is where all future control of the conclave is directed. Tightbeam appears to be an RPC framework for communication between exclave components.
Note this attachment is to a task — not a thread. Execution of services will be covered later.
Conclave manager tasks are not allowed to have kernel domain privileges.
Once attached, a conclave may be launched. The launch attempt must be performed from the conclave manager task attached to the conclave. Attempts to launch conclaves also wait until exclaves have fully booted (into state EXCLAVES_BS_BOOTED_EXCLAVEKIT — more on this later).
A new mach trap (ie system call) for exclave functionality has been added to XNU and ends up in the _exclaves_ctl_trap() function. This call is overloaded and can perform different operations passed in as parameters. The relevant operation to launch a conclave is EXCLAVES_CTL_OP_LAUNCH_CONCLAVE.
The launch operation calls a redacted function, conclave_launcher_conclavecontrol_launch() and passes it the tightbeam connection to the conclave manager to perform the launch. I suspect this requests the initialisation of executable code and resources for the conclave within the secure world.
In production, conclave hosts can be tainted when launched, and an exit() may then cause a kernel panic.
As mentioned, the _exclaves_ctl_trap() function handles a new Mach trap for exclave functionality. The call is overloaded, with its action dependent on an operation parameter, and it generally verifies entitlements to the operations called. The operations are:
- EXCLAVES_CTL_OP_BOOT — Called twice during the system boot process — firstly to start exclaves boot stage 2, and then to boot stage ExclaveKit. The caller must be launchd or have the com.apple.private.exclaves.boot entitlement.
All operations below, at minimum, require the current task to have the com.apple.private.exclaves.kernel-domain entitlement, or be the relevant conclave manager task
- EXCLAVES_CTL_OP_LAUNCH_CONCLAVE — launch a conclave, discussed earlier
- EXCLAVES_CTL_OP_LOOKUP_SERVICES — lookup an exclave service and copy its struct to a userspace buffer. First it looks in the exclave domain of the current task, if that fails it checks the Darwin domain followed by kernel domain — if it is entitled to do so
- EXCLAVES_CTL_OP_ENDPOINT_CALL — calls the endpoint for an exclave service in the current task’s domain — this will result in the current thread switching from kernel mode to the secure world and executing specific code there
- EXCLAVES_CTL_OP_NAMED_BUFFER_CREATE — create a named buffer resource
- EXCLAVES_CTL_OP_NAMED_BUFFER_COPYIN — copy data from a userspace buffer to a kernel buffer (that is shared with exclaves)
- EXCLAVES_CTL_OP_NAMED_BUFFER_COPYOUT — copy data from a kernel buffer (that is shared with exclaves) to a userspace buffer
- EXCLAVES_CTL_OP_AUDIO_BUFFER_CREATE — can an audio buffer
- EXCLAVES_CTL_OP_AUDIO_BUFFER_COPYOUT — copy data from audio buffer to userspace buffer
- EXCLAVES_CTL_OP_SENSOR_CREATE — create a sensor resource (eg. camera, microphone)
- EXCLAVES_CTL_OP_SENSOR_START
- EXCLAVES_CTL_OP_SENSOR_STOP
- EXCLAVES_CTL_OP_SENSOR_STATUS
- EXCLAVES_CTL_OP_NOTIFICATION_RESOURCE_LOOKUP — create a notification resource — TBD, but likely for coordination/scheduling
Downcalls are calls to exclave Services’ endpoints in the secure world — this is where secure world code execution happens.
There is a great deal of complexity in these calls, primarily around managing thread/IPC contexts and scheduling the current thread to execute code in the secure world.
- Downcalls switch the current thread into the secure world and start executing at an entry point in secure code, rather than asking some other thread to perform work on behalf of the current thread.
- Calling tasks must have kernel domain entitlements or be the conclave manager task attached to the service’s conclave.
- Conclaves have a maximum of 128 services that can be called
- It appears that threads are scheduled into the secure kernel (via the sk_enter() function) by XNU. XNU appears to handle the scheduling of all threads in the secure world, with SK potentially not having any independent threads of its own.
- A thread executing in the secure world can perform a temporary upcall to XNU, which returns the thread to kernel mode for the upcall, before a mandatory return back to the secure world context. More detail on upcalls will be provided further below.
- Threads executing in the secure world can do normal scheduler type things like yield, wait, be suspended, or be interrupted. When this happens, the thread leaves the secure world and returns to the XNU kernel context. From there it must be rescheduled back into the secure world by exclave scheduling code in XNU. The thread will continue to be rescheduled into the secure world as necessary until the downcall is completed.
- If a secure world thread is panic()ing on a CPU core (which will call on XNU to panic via SPTM), fresh tasks are no longer scheduled into the secure world on other cores and they wait for a timeout period. If everything goes correctly, the waiting threads will never finish their wait. However if the timeout expires, the waiting threads will then … panic() :)
- XNU appears to handle all interrupt processing, rather than SK. When XNU is finished handling an interrupt, the interrupted thread is returned to the secure world if it was executing there. Directing interrupts to either the insecure or secure kernel is an ARM TrustZone feature.
- IPC structures for the downcall are setup with request and response buffers before entering the secure world through the redacted sk_enter() call.
- Interrupts and pre-emption are disabled while finalising the IPC request structure and calling sk_enter(). This is because there is only one of these structures per core. I suspect the redacted path travelled after calling sk_enter() and entering the secure world copies the request from the per-cpu structure into secure world memory, and then re-enables interrupts and pre-emption on the core. The alterative would be ugly. A similar process happens in reverse for protecting the per-cpu response structure.
- Disconcertingly, the downcall response can come back via a different CPU’s per-core response buffer, as the downcall may have been interrupted, upcalled, or yielded and needed rescheduling.
- Coordination of a thread’s exclave status (to avoid SK re-entry etc) occurs via th_exclaves_state — a bitfield in the thread structure.
A thread running in the secure world due to a downcall may need assistance from XNU and this can be achieved through an upcall to the exclaves upcall handler via the Tightbeam framework. Upcalls are limited to specific functions within XNU. A thread desiring an upcall returns to the insecure world where the specific upcall handler is called. While in this state, the thread cannot return to user mode (for obvious reasons) nor perform another downcall to the secure world, ie it is not allowed to “re-enter” exclaves. Instead the thread will be returned to the secure world at the point where it performed the upcall.
Allowed upcalls discovered in the source end up inside the following functions:
Memory
exclaves_memory_upcall_alloc(npages, kind, completion);
exclaves_memory_upcall_free(pages, npages, kind, completion);
File storage
exclaves_storage_upcall_root(exclaveid, completion);
exclaves_storage_upcall_open(fstag, rootid, name, completion);
exclaves_storage_upcall_close(fstag, fileid, completion);
exclaves_storage_upcall_create(fstag, rootid, name, completion);
exclaves_storage_upcall_read(fstag, fileid, descriptor, completion);
exclaves_storage_upcall_write(fstag, fileid, descriptor, completion);
exclaves_storage_upcall_remove(fstag, rootid, name, completion);
exclaves_storage_upcall_sync(fstag, op, fileid, completion);
exclaves_storage_upcall_readdir(fstag, fileid, buf, length, completion);
exclaves_storage_upcall_getsize(fstag, fileid, completion);
exclaves_storage_upcall_sealstate(fstag, completion);
DriverKit
exclaves_driverkit_upcall_irq_register(id, index, completion);
exclaves_driverkit_upcall_irq_remove(id, index, completion);
exclaves_driverkit_upcall_irq_enable(id, index, completion);
exclaves_driverkit_upcall_irq_disable(id, index, completion);
exclaves_driverkit_upcall_timer_register(id, completion);
exclaves_driverkit_upcall_timer_remove(id, timer_id, completion);
exclaves_driverkit_upcall_timer_enable(id, timer_id, completion);
exclaves_driverkit_upcall_timer_disable(id, timer_id, completion);
exclaves_driverkit_upcall_timer_set_timeout(id, timer_id, duration,completion);
exclaves_driverkit_upcall_timer_cancel_timeout(id, timer_id, completion);
exclaves_driverkit_upcall_lock_wl(id, completion);
exclaves_driverkit_upcall_unlock_wl(id, completion);
exclaves_driverkit_upcall_async_notification_signal(id, notificationID, completion);
exclaves_driverkit_upcall_mapper_activate(id,mapperIndex, completion);
exclaves_driverkit_upcall_mapper_deactivate(id, mapperIndex, completion);
exclaves_driverkit_upcall_notification_signal(id, mask, completion);
DriverKit Apple Neural Engine
exclaves_driverkit_upcall_ane_setpowerstate(id, desiredState, completion);
exclaves_driverkit_upcall_ane_worksubmit(id, requestID, taskDescriptorCount, submitTimestamp, completion);
exclaves_driverkit_upcall_ane_workbegin(id, requestID, beginTimestamp, completion);
exclaves_driverkit_upcall_ane_workend(id, requestID, completion);
Conclaves
exclaves_conclave_upcall_suspend(flags, completion);
exclaves_conclave_upcall_stop(flags, completion);
exclaves_conclave_upcall_crash_info(shared_buf, length, completion);
References to XNUProxy abound, yet I haven’t been able to definitely pin down exactly what and where it is. Options I have considered include:
- It’s an exclave domain of its own, something like com.apple.xnuproxy
- It’s an exclave service or bunch of services that runs in the com.apple.kernel domain, serving particular types of downcalls.
- It’s a subsystem in SPTM for making downcalls to the secure world…
Comments in Exclaves_L4.h state that the XNU Proxy makes the following exclaves reachable (aside from testing ones, usually featuring the word “HELLO” in them):
- EXCLAVES_XNUPROXY_EXCLAVE_USERAPP/2/3 (templated user app…)
- EXCLAVES_XNUPROXY_EXCLAVE_AUDIODRIVER
- EXCLAVES_XNUPROXY_EXCLAVE_EXCLAVEDRIVERKIT
- EXCLAVES_XNUPROXY_EXCLAVE_SECURERTBUDDY_AOP (RT Buddy for Always On Processor)
- EXCLAVES_XNUPROXY_EXCLAVE_SECURERTBUDDY_DCP (for Display Coprocessor)
- EXCLAVES_XNUPROXY_EXCLAVE_CONCLAVECONTROL (conclave launcher control)
- EXCLAVES_XNUPROXY_EXCLAVE_CONCLAVEDEBUG
- EXCLAVES_XNUPROXY_EXCLAVE_SECURERTBUDDY_AOP_EDK (ExclaveDriverKit connection for Always On Processor)
- EXCLAVES_XNUPROXY_EXCLAVE_SECURERTBUDDY_DCP_EDK (ExclaveDriverKit connection for Display CoProcessor)
Note RTBuddys are for communicating with RTKit, yet another Apple Operating System, that runs on the Display Coprocessor, Apple Neural Engine, NVMe controller, SMC Controller, Smart Keyboards, Siri Remote, Apple Pencil, AirPods, AirTags… and I assume the AOP.
Booting exclaves when the system is starting requires a delicately coordinated dance between the insecure and secure worlds. Anything going wrong usually ends up in a panic().
Booting occurs in three stages. Stage one is not visible in the open-source, however is likely a secure boot process where SK is loaded into memory and its code signatures are verified before being made executable. At the end of a successful stage one boot, the boot status is EXCLAVES_BS_NOT_STARTED.
- Initialises upcall server by creating a tightbeam endpoint for upcalls
- Enters secure world with a special call to collect boot information from secure kernel
- Enters secure world again with normal endpoint call but not sure why… possibly to trigger the kernel domain to start
- Initialises the exclave scheduler
- Initialises the XrtHostedXNU kext
- Initialises callbacks (I think into the above kext)
- Boots the scheduler — sets up per-cpu request&response for the boot CPU core only, and binds to the boot core
- Loops, calling into the secure world to see if it needs memory allocations, until it responds that all exclaves are booted
- Initialises multicore by setting up per-cpu request&response memory for all cores
- Initialises XNU Proxy — creates a cache of buffers for IPC calls, creates some thread contexts, sets up a tightbeam endpoint for downcalls to the xnuproxy
- Initialises an exclaves panic kernel thread
- Discovers all static exclave resources and builds the root_table of domains and resources.
- Creates tightbeam endpoints for all Conclave Manager resources and calls an initialisation process for each one.
- Populates a bitmap of valid conclave service ids (from 0 to 127) for each conclave.
- At kernel build time, a list of boot tasks was stored in the __DATA_CONST segment. These are now sorted by priority and each boot task function is called. I likely only have a very partial picture here, but these tasks include creating an endpoint for each of the exclave indicator controller service, the storage backend service, the logserver, and for stackshots.
- Boot status is now EXCLAVES_BS_BOOTED_STAGE_2
The stage makes multiple calls regarding “framemint”. This is suggestive of the SK being based on seL4.
- The “com.apple.service.FrameMint” service is looked up and a tightbeam endpoint is created for it
- A redacted function, framemint_framemint__init() is called
- A redacted framemint_framemint_populate() function is called but I guess this will be triggering all sorts of exciting activity to happen in the secure world
- Boot status is now EXCLAVES_BS_BOOTED_EXCLAVEKIT
SPTM “types” memory pages to control access to them via its different subsystems. Existing types included:
- XNU_USER_EXEC
- XNU_USER_DEBUG
- XNU_USER_JIT
- XNU_ROZONE
- XNU_KERNEL_RESTRICTED
- +Types for TXM, DART, etc
Exclaves have added:
- SK_DEFAULT (exclusive to SK — inaccessible to XNU)
- SK_IO (also exclusive to SK — inaccessible to XNU)
- SK_SHARED_RO (memory shared between SK and XNU (read only for XNU)
- SK_SHARED_RW (memory shared between SK and XNU (read+write for XNU)