VMX root-mode compatible message tracing
How we transfer the buffers safely from the kernel and vmx-root to the debugger
If you need to change and customize message tracing options, please visit here.
The message tracing is derived from our Hypervisor From Scratch 8 :
Without any doubt, one of the hardest parts of designing a hypervisor is sending a message from Vmx root-mode to Vmx non-root mode. This is because you have lots of limitations like you can’t access non-paged buffer, and of course, most of the NT functions are not (ANY IRQL) compatible as they might access the buffers that reside in paged pool.
The things are ending here, there are plenty of other limitation to deal with.
This section is inspired by Chapter 6: Kernel Mechanisms (High IRQL Synchronization) from the Windows Kernel Programming book by Pavel Yosifovich which is a really amazing book if you want to start with kernel programming.
Concepts
This section describes some of the Operating System concepts, you should know before starting.
What’s a spinlock?
The Spin Lock is a bit in memory that provides atomic test and modify operations. When a CPU tries to acquire a spinlock, and it’s not currently free, the CPU keeps spinning on the spinlock, busy waiting for it to be released by another CPU means that it keeps checking until another thread which acquired it first release it.
Test-and-Set
You probably read about Test and Set in university. Still, in case you didn’t, in computer science, the test-and-set instruction is an instruction used to write 1 (set) to a memory location and return its old value as a single atomic (i.e., non-interruptible) operation. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process’s test-and-set is finished.
What do we mean by “Safe”?
The “safe” is used a lot in hypervisors. By “safe,” we mean something that works all the time and won’t cause system crash or system halt. It’s because it’s so tricky to manage codes in vmx root-mode. After all, interrupts are masked (disabled), or transfer buffer from vmx root-mode to vmx non-root mode needs extra effort, and we should be cautious and avoid executing some APIs to be safe.
What is DPC?
A Deferred Procedure Call (DPC) is a Windows mechanism that allows high-priority tasks (e.g., an interrupt handler) to defer required but lower-priority tasks for later execution. This permits device drivers and other low-level event consumers to perform the high-priority part of their processing quickly and schedule non-critical additional processing for execution at a lower priority.
DPCs are implemented by DPC objects which are created and initialized by the kernel when a device driver or some other kernel-mode program issues requests for DPC. The DPC request is then added to the end of a DPC queue. Each processor has a separate DPC queue. DPCs have three priority levels: low, medium, and high. By default, all DPCs are set to medium priority. When Windows drops to an IRQL of Dispatch/DPC level, it checks the DPC queue for any pending DPCs and executes them until the queue is empty or some other interrupt with a higher IRQL occurs.
This is the description of DPCs from MSDN:
Because ISRs must execute as quickly as possible, drivers must usually postpone the completion of servicing an interrupt until after the ISR returns. Therefore, the system provides support for deferred procedure calls (DPCs), which can be queued from ISRs and which are executed at a later time and at a lower IRQL than the ISR.
There are two posts about DPCs here and here, you can read them for more information.
Challenges
For example, Vmx-root mode is not a HIGH_IRQL interrupt (with discussing it in Discussion Section), but as it disables all of the interrupts, we can think like it’s a HIGH_IRQL state. The problem is that must of synchronization functions are designed to be worked on IRQL less than DISPATCH_LEVEL.
Why is it problematic? Imagine you have a one-core processor, and your function requires a spinlock (let say it’s merely a buffer that needs to be accessed). The function raises the IRQL to DISPATCH_LEVEL. Now the Windows Scheduler can’t interrupt the function until it releases the spinlock and lowers the IRQL to PASSIVE_LEVEL or APC_LEVEL. During the execution of the function, a vm-exit occurs; thus, we’re in vmx root-mode now. It’s because, as I told you, vm-exit happens as if it’s a HIGH_IRQL interrupt.
Now, what if we want to access that buffer in vmx root mode? Two scenarios might occur.
We wait on a spinlock that was previously acquired by a thread in vmx non-root mode and this we have to wait forever. A deadlock occurs.
We enter the function without looking at the lock (while there is another thread that enters the function at the same time.) so it results in a corrupted buffer and invalid data.
The other limitation is in Windows design putting the thread into a waiting state cannot be done at IRQL DISPATCH_LEVEL or higher. It’s because in Windows when you acquire a spinlock it raises the IRQL to 2 – DISPATCH_LEVEL (if not already there), acquire the spinlock, perform the work and finally release the spinlock and lower IRQL back.
If you look at a function like KeAcquireSpinLock and KeReleaseSpinLock, they get an IRQL in their arguments. First, KeAcquireSpinLock saves current IRQL to the parameter supplied by the user then raises the IRQL to DISPATCH_LEVEL and sets a bit. When the function finished its works with shared data, then it calls KeReleaseSpinLock and passes that old IRQL parameter so this function unsets the bit and restore the old IRQL (lowers the IRQL).
Windows has 4 kinds of Spinlocks,
KeAcquireSpinLock – KeReleaseSpinLock : This pair can be called at IRQL <= DISPATCH_LEVEL.
KeAcquireSpinLockAtDpcLevel – KeReleaseSpinLockFromDpcLevel : This pair can be call at IRQL = DISPATCH_LEVEL only, it’s more optimized if you are already in IRQL 2 as it doesn’t saves the old IRQL and it’s specially designed to work on DPC routine.
KeAcquireInterruptSpinLock – KeReleaseInterruptSpinLock: Hardware based use this pair e.g in Interrupt Service Routine (ISR) or it used by drivers with an interrupt source.
ExInterlockedXxx : This function raises the IRQL to HIGH_LEVEL and perform it’s task, it doesn’t need a release function as no one interrupt us on HIGH_IRQL.
But unfortunately, things are more complicated when it comes to vmx root-mode. We don’t have IRQL in the vmx root-mode. It’s an operating system thing, so we can’t use any of the above functions, and things are getting worst if we want to use our message tracing mechanism between multiple cores!
For these reasons, we have to design our custom spinlock.
Designing A Spinlock
Designing spinlock in a multi-core system by its nature needs the hardware support for atomic operation means that hardware (most of the time processor) should guarantee that an operation is performed just by logical (hyper-threaded) core and it’s non-interruptible.
There is an article here that describes different kinds of spinlock with different optimizations, also it’s implemented here.
The design of this mechanism in the processor is beyond the scope of this article. We simply use an intrinsic function provided by Windows called “_interlockedbittestandset“.
This makes our implementation super simple. We just need to use the following function, and it’s the responsibility of the processor to take care of everything.
We should use volatile keyword in parameters too, otherwise, it’s like un-volatiling.
Now we need to spin! If the above function was not successful, then we have to keep CPU checking to see when another processor releases the lock.
If you wonder what is the _mm_pause() then it’s equal to PAUSE instruction in x86.
Pause instruction is commonly used in the loop of testing spinlock, when some other thread owns the spinlock, to mitigate the tight loop.
PAUSE notifies the CPU that this is a spinlock wait loop, so memory and cache accesses may be optimized. See also pause instruction in x86 for some more details about avoiding the memory-order mis-speculation when leaving the spin-loop. PAUSE may stop CPU for some time to save power. Older CPUs decode it as REP NOP, so you don’t have to check if it’s supported. Older CPUs will simply do nothing (NOP) as fast as possible.
For releasing the lock, there is nothing special to do, so simply unset it without caring for any other processor as there is no other processor that wants to unset it.
The “volatile” keyword tells the compiler that the value of the variable may change at any time without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. There are lots of examples here if you have a problem with understanding “volatile“.
Message Tracer Design
For solving the above the challenge about deadlock, I create two message pools for saving messages. The first pool is designed to be used as storage for vmx non-root messages (buffers) and the second pool is used for vmx-root messages.
We have the following structure that describes the state of each of these two pools.
Generally, we’ll save the buffer as illustrated below, each chunk of the message came with BUFFER_HEADER that describes that chunk.
Other information for the buffer like Current Index to Write and Current to Send is saved in the above structure.
The BUFFER_HEADER is defined like this,
We save the length of used length of the chunk and a bit which determine whether we sent it before or not.
Operation Number is number, which will be sent to the user-mode to show the type of the buffer that came from the kernel. In other words, it’s a number that indicates the intention (and structure) of the buffer, so the user-mode application will know what to do with this buffer.
The following Operation Numbers are currently defined :
Each of them shows a different type of message, and the last one shows that a bunch buffer is accumulated in this buffer. This message tracing is designed to send any kind of the buffer from both vmx root and OS to the user-mode, so it’s not limited just to sending messages, we can send buffers with custom structures and different Operation Numbers.
The last thing about our message tracing is, it can be configured with the following constants, you can change them in order to have a better performance for your exclusive use.
You can configure things like the maximum number of chunks in a buffer and also the size of each chunk. Setting the above variables is necessary in some cases if there is no thread to consume (read) these chunks and pools are full; it replaces the previous unread buffer. Hence, if you can’t frequently consume the pools, then it’s better to specify a higher number for MaximumPacketsCapacity so that you won’t lose anything.
Initialization Phase
In the initialization phase, we allocate space for the above structure (2 times, one for vmx non-root and one for vmx-root) and then allocate the buffers to be used as the storage for saving our messages.
We have to zero them all and also KeInitializeSpinLock to initialize the spinlock. We use this spinlock only for vmx non-root, and this function makes sure that the value for the lock is unset. We do the same for our custom spinlock (VmxRootLoggingLock), just unset it.
You might ask, what is the “BufferLockForNonImmMessage“, it’s another lock that will use it as optimization (see later).
All in all, we have the following code.
Sending Phase (Saving Buffer and adding them to pools)
In a regular Windows routine generally, we shouldn’t be on IRQL more than Dispatch Level. There is no case that our log manager needs to be used in higher IRQLs, so we don’t care about them; thus, we have two different approaches here. First, we acquire the lock (spinlock) using KeAcquireSpinLock in vmx non-root as it’s a Windows optimized way to acquire a lock and for vmx-root mode, we acquire the lock using our previously designed spinlock.
As I told you above, we want to fix this problem that might a vmx-exit occurs when we acquired a lock, so it’s not possible to use the same spinlock as deadlock might happen.
Now we have to see whether we are operating from vmx non-root or vmx root, based on this condition, we select our lock and the index of the buffer that we want to put our message into it.
I’m not gonna explain each step, as it’s easy, it’s just managing buffer and copying data from a buffer to another buffer and also the code is well commented so you can read the code, instead, I explain tricky parts of our message tracing.
After creating a header for our new message buffer, we will copy the bytes and change the information about buffer’s indexes. The last step here is to see whether any thread is waiting to receive our message or not.
If there is no thread waiting for our message then nothing more to do here but if there is a thread which is IRP Pending state (I explain about it later), then we use KeInsertQueueDpc so that it will be added to our DPC Queue which will be subsequently executed by Windows in IRQL == DISPATCH_LEVEL.
It means that our callback function will execute by Windows later and of course, Windows execute our function in vmx non-root so it’s safe. I’ll describe this callback and how we create a DPC later.
Finally, we have to release the locks so that other threads can enter.
Reading Phase (Read buffers and send them to user-mode)
It’s time to read the previously filled buffer! The fact that we add a DPC in the previous function “LogSendBuffer” shows that the “LogReadBuffer” is executed in vmx non-root mode so we can freely use most of the APIs (not all of them).
Theoretically, we have a problem here, if we want to read a buffer from the vmx root-mode pool, then it might cause a deadlock as we acquired a vmx root-mode lock and might a vm-exit occur. Hence, we spin on this lock in vmx root mode forever, but practically there is no deadlock here. Can you guess why?
It’s because our LogReadBuffer executes in DISPATCH_LEVEL so the Windows scheduler won’t interrupt us, and our function is executed without any interruption and the fact that we’re not doing anything fancy here. I mean, we’re not performing anything (like CPUID) that causes a vm-exit in our code, so practically there is nothing to cause deadlock here, but we should keep in mind that we’re not allowed to run codes that cause vmx-exit.
We compute the header address based on previous information and also set the valid bit to zero so that it shows that this buffer is previously used.
Then we copy the buffer to the buffer that specified in arguments also put the Operation Number on the top of the target buffer so that the future functions will know about the intention of this buffer. We can also use DbgPrint to show the messages to the kernel debugger. Using DbgPrint in DISPATCH_LEVEL (vmx non-root mode) is safe. We might need to use DbgPrint multiple times as this function has a maximum of 512 bytes by default. Even though you can change the limit number but we assume the default size is selected.
Finally, we have to reset some of the information regarding buffer, clear the buffer messages (it’s not necessary to zero the buffer, but for making debug process easier, I prefer to zero the buffer), and release the locks.
Checking for new messages
Checking for the new message is simple; we just need to check the current message index based on previous information and see if its header is valid or not. If it’s valid then it shows that we have a new message, but if it’s not valid, then some function reads the message previously, and there is no new message.
For checking the new message, we even don’t need to acquire a lock because basically we don’t write anything and in our case reading doesn’t need a lock.
Sending messages to pools
Previously, we see how to save (send) buffers and read them. Each message is a buffer of strings, so finally, we have to use “LogSendBuffer” to send our buffer, but we need to consider extra effort to send a well-formed message.
va_start and va_end are used to support multiple arguments to one function, e.g like DbgPrint or printf.
You can use a combination of KeQuerySystemTime, ExSystemTimeToLocalTime, and RtlTimeToTimeFields to get the current system time (see the example) then putting them together with sprintf_s.
There is a particular reason why we use the sprintf-like function instead of RtlString* functions; the reason is described in the Discussion section. The next step is computing length using strnlen_s.
Finally, we have a vital optimization here; logically we create two kinds of messages, one called “Immediate Message” which we will directly send it into the pool and another type is “Non-Immediate Message” which we gather the messages in another buffer and append new messages in that buffer until its capacity is full (we shouldn’t pass the PacketChunkSize limit).
Using this way, we don’t send each message to the user-mode separately but instead, we send multiple messages in one buffer to the user-mode. We will gain visible performance improvement. For example with a configuration with PacketChunkSize == 1000 bytes we send 6 messages on a buffer (it’s average basically it depends on each message size) because you probably know that CPU has to do a lot to change its state from kernel-mode to user-mode and also creating new IRP Packet is a heavy task.
You can also change the configuration, e.g., increase the PacketChunkSize so that more messages will hold on the temporary buffer, but generally, it delays the time you see the message.
Also, we work on a buffer so we need another spinlock here.
Putting it all together we have the following code :
Receiving buffers and messages in user-mode
Receiving buffers from the user-mode is done by using an IOCTL. First, we create another thread in our user-mode application. This thread is responsible for bringing the kernel-mode buffers to the user-mode and then operate based on Operation Number.
This thread executes the following function. We use IRP Pending for transferring data from kernel-mode to user-mode. IRP Pending is primarily used for transferring a packet. For example, you send an IRP packet to the kernel, and kernel marks this packet as Pending. Whenever the user-mode buffer is available to send to the user-mode, the kernel completes the IRP request, and the IOCTL function returns to the user-mode and continues the execution.
It’s somehow like when you use Wait for an object. We can also use events in Windows and whenever the buffer is available the event is triggered but IRP Pending is better as it designed for the purpose of sending messages to user-mode.
What we have to do is allocating a buffer for kernel-mode code and using DeviceIoControl to request the packet. When the packet from the kernel received, we process the packet and switch through the Operation Number.
IOCTL and managing user-mode requests
When the IOCTL arrived on the kernel side, DrvDispatchIoControl from major functions is called. This function returns a pointer to the caller’s I/O stack location in the specified IRP.
From the IRP Stack we can read the IOCTL code and buffers address, this time we perform necessary checks and pass the arguments to LogRegisterIrpBasedNotification.
To register an IRP notification, first, we check whether any other thread is pending by checking GlobalNotifyRecord if there is any thread we complete the IRP and return to the user-mode because in our design we ignore multiple threads that request the buffers means that only one thread can read the kernel-mode buffer.
Second, we initialize a custom structure that describes the state. The following structure is responsible for saving Type, DPC Object, and target buffer.
In order to fill this structure, we initialize a DPC object by calling KeInitializeDpc, this function gets the function callback that should be called later (LogNotifyUsermodeCallback) and the parameter(s) to this function (NotifyRecord).
We first check the vmx non-root pools to see if anything new is available. Otherwise, we check the vmx-root mode buffer. This precedence is because vmx non-root buffers are more important. After all, we spent must of the time in VMX Root-mode, so we might see thousands of messages from vmx-root while we have fewer messages from vmx non-root. If we check the vmx root message buffer first, then we might lose some messages from vmx non-root or never find a time to process them.
If any new message is available then we directly add a DPC to the queue (KeInsertQueueDpc).
If there isn’t any new message available, then we simply save our Notify Record for future use, and also we mark IRP to pending state using IoMarkIrpPending and return STATUS_PENDING.
Usermode notify callback
As you see in the above codes, we add DPCs to queue in two functions (LogRegisterIrpBasedNotification and LogSendBuffer). This way, we won’t miss anything, and everything is processed as a message is generated. For example, if there is any thread waiting for the message then LogSendBuffer notifies it about the new message, if there isn’t any thread waiting for the message then LogSendBuffer can’t do anything, as long as a new thread comes to the kernel then it checks for the new message. Think about it one more time. It’s beautiful.
Now it’s time to read the packets from kernel pools and send them to the user-mode.
When LogNotifyUsermodeCallback is called then we sure that we’re in DISPATCH_LEVEL and vmx non-root mode.
In this function, we check if the parameters sent to the kernel are valid or not. It’s because the user-mode provides them. For example, we check the IRP stack’s Parameters. DeviceIoControl. InputBufferLength and Parameters. DeviceIoControl. OutputBufferLength to make sure they are not null or check whether the SystemBuffer is null or not.
Then we call LogReadBuffer with user-mode buffers, so this function will fill the user-mode buffer and adds the Operation Number in a suitable place. Also, Irp->IoStatus.Information provides the buffer length to the user-mode.
The last step here is to complete the IRP, so I/O Manager sends the results to the user-mode, and the thread can continue to its normal life.
The reason why we access the user-mode buffer in all processes (because DPCs might run on the random user-mode process) and why we use DPCs and don’t use other things like APCs is discussed in the Discussion section.
The following code demonstrates what we talked about it above.
Uninitialization Phase
Nothing special, we just de-allocate the previously allocated buffers. Keep in mind that we should initialize the message tracer at the very first function of our driver so we can use it and, of course, uninitialize it at the end when we don’t have any message anymore.
Last updated