[email protected]
Posted: 2026-04-19
Tags: DOS, Networking, Segmented memory is hard
A few weeks ago I was revisiting my instructions for running a SLIP connection between a DOS PC and Linux. If you are not familiar with SLIP it stands for Serial Line Internet Protocol, and it lets you run TCP/IP over a PC serial port. TCP/IP is much faster over Ethernet, but a serial port can work too.
There are several packet drivers for DOS that let you make SLIP connections. One that I use often is "EtherSLIP" which is handy because it emulates an Ethernet packet driver but it is really just SLIP over a serial port. The emulation allows you to use programs designed for Ethernet packet drivers unmodified; otherwise, you'd have to run programs that are designed specifically for SLIP packet drivers. All of the mTCP programs expect Ethernet, but they don't actually know what is happening "under the covers" so any packet driver that emulates Ethernet works too. (Besides EtherSLIP there is also a Token Ring packet driver that emulates Ethernet.) EtherSlip is included in the Crynwr packet driver collection, which covers most classic ISA Ethernet cards.
I used Telnet to do my testing and there was something wrong with my cabling; it was slow and dropping packets like crazy. (It turned out to be a hardware problem.) When Telnet exited it gave me this error message:
*** NULL assignment detected
Well, that doesn't sound good. The compiler I use (Open Watcom 1.9) checks the heap at the end of a program to let you know if there was heap corruption, but this is a different error message. I dug through the PDF documentation and I found an explanation in the "Watcom C/C++ Programmer's Guide." Here is a summary of the problem:
- Normally it is an error to use a NULL pointer and on a real operating system you will get a signal or an interrupt if you try to read or write using one. 16-bit DOS doesn't have that capability so it is allowed, even if it is an error.
- While you can't detect reads using a NULL pointer, the compiler has a trick for trying to detect writes using it. The compiler reserves 32 bytes at the start of the data segment and writes a known pattern to it. At the end of the program the compiler checks to see if those 32 bytes have been altered. If they have, then something might have used a NULL pointer to do it. (Nothing else should point into that area.)
If you get the warning message then something clobbered those first 32 bytes in the data segment and you probably have a bug. Even if you don't get the warning message you might have a bug, but this trick can't detect that - a write outside of those 32 bytes will not be detected by this mechanism.
Ok, so here is the situation:
- This only happens when using EtherSLIP. I've never even seen this error before and I've been using this compiler for 15 years.
- It seems to be triggered only when I have packets getting lost, which then requires retrying sending those packets.
- The machine I'm using is an 8088 class machine so I can't use the Open Watcom debugger to catch the code that is causing this.
My first attempt: Lots of if-checks
The compiler run-time is telling me that I'm writing using a NULL pointer, so all I need to do is add some trace points on the suspected path and write a warning if I see a NULL pointer being used. That is simple to do but somewhat tedious as I might have to add a trace point for every pointer I use. But the suspect code path (resending a lost packet) is not that complicated so I started with this approach. Here is a sample of what I did:
void near TcpSocket::resendPacket( TcpBuffer *buf ) {
if ( buf == NULL ) {
TRACE_WARN(("Whoops: resendPacket tried to reference a NULL pointer."));
return;
}
TcpPacket_t* packetPtr = &buf->headers;
...
I kept running the code and recreating the problem, but I never got my warning message. So I kept adding trace points in my code until I eventually determined that this approach was not working and I would need to try something different.
My second attempt: Detect the corruption earlier
The compiler can detect the corruption, but it only runs the check when the program exits. To get closer to the problem I can do the same check and do it while the program is running, hopefully narrowing down when and where it happens.
To get started I first looked at the compiler source code to see exactly what it was doing, and I found what I needed in bld/clib/startup/a/cstrt086.asm. (I've slightly simplified and reformatted it here for clarity.)
This is where the 32 bytes of reserved storage are defined. (It is allocated as 16 words of 0x0101.)
assume ds:DGROUP
INIT_VAL equ 0101h
NUM_VAL equ 16
_NULL segment para public 'BEGDATA'
__nullarea label word
dw NUM_VAL dup(INIT_VAL)
public __nullarea
_NULL ends
Here is the error message that I was seeing:
; ; miscellaneous code-segment messages ; NullAssign db '*** NULL assignment detected',0
And here is the code that checks the storage for changes at the end of the program:
__exit proc near
public "C",__exit
push ax
mov dx,DGROUP
mov ds,dx
cld ; check lower region for altered values
lea di,__nullarea ; set es:di for scan
mov es,dx
mov cx,NUM_VAL
mov ax,INIT_VAL
repe scasw
pop ax ; restore return code
je ok
;
; low memory has been altered
;
mov bx,ax ; get exit code
mov ax,offset NullAssign ; point to msg
mov dx,cs ; . . .
...
That code defines 32 bytes of 0x01 at the beginning of the data segment, and they can be addressed using the variable name "__nullarea". The bytes are present and initialized before the program starts. At the end of the program the __exit routine will be called and it will check to see that those 32 bytes are still 0x01. If they are not, you will get an error message.
I created a callable function in C that does the same thing:
extern "C" uint8_t _nullarea;
uint8_t *_nullareap = &_nullarea;
bool failed = false;
extern "C" void nullCheck( const char *loc ) {
// Only generate a trace message the first time it is detected.
if (failed == true) return;
int good = true;
for ( int i=0; i < 32; i++ ) {
if ( _nullareap[i] != 0x01 ) {
good = false;
break;
}
}
if ( good == false ) {
TRACE_WARN(("Null check failed at %s\n", loc));
Utils::dumpBytes( Trace_Stream, _nullareap, 32 );
failed = true;
}
}
And then, I inserted a call to this code in various places in my Telnet code to try to narrow down where the problem was happening.
Eventually I got to the code that calls the packet driver to send a packet on the wire:
nullCheck("Packet_send_pkt sendattempt");
int86x( Packet_int, &inregs, &outregs, &segregs);
nullCheck("Packet_send_pkt after soft int");
The second call to nullCheck was tripping. So it was not a problem in my Telnet code, it was something in the packet driver which is why my if-checks for NULL pointers never showed anything.
The trace showed me the following:
2026-04-17 16:41:13.76 Nullarea is at 318a:0000 . . . 2026-04-17 16:41:28.59 W Null check failed at Packet_send_pkt after soft int Buffer address: 318a:0000 01 01 00 02 12 00 56 34 01 01 01 01 01 01 01 01 ......V4........ 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ................
Interestingly, only six bytes were corrupted and the contents of the six bytes were the same in every trace that I looked at. Sometimes the six bytes would move around but that was probably due to the changes I was making to Telnet to add the code or the trace points.
Knowing that this was only happening on the packet send path I looked for those six bytes in the trace and found this right at the very top:
2026-04-17 16:41:13.22 mTCP telnet Version: Apr 17 2026 2026-04-17 16:41:13.27 PACKETINT=0x60 MAC=00.02.12.00.56.34 MTU=1400 2026-04-17 16:41:13.27 IPADDR=192.168.2.122 NETMASK=255.255.255.255 GATEWAY=192.168.2.121 2026-04-17 16:41:13.33 Debug level: 0xff, DOS Version: 6.00 2026-04-17 16:41:13.33 Tcp: Allocated 1 sockets, MTU is 1400, My MSS is 1360 2026-04-17 16:41:13.33 NAMESERVER=192.168.2.1 2026-04-17 16:41:13.38 DOS Sleep calls enabled: int 0x28:1 int 0x2f,1680:0
So the six bytes of corruption in the _nullarea are the MAC address of the simulated Ethernet device that EtherSLIP is providing. This is a very powerful clue - I now knew to look in EtherSlip on the send path, specifically where it might be copying a MAC address.
Some quick notes on x86 programming
Before we pick apart the packet driver code and expose the bug, we should review some x86 architecture.
Classic x86 architecture has 16 bit registers and a segmented memory model that allows you to address up to 1MB of memory. A segment is a region of memory that starts on a 16 byte boundary (a paragraph); on a classic IBM PC there are 64K possible segments, each spaced 16 bytes apart. Segment values are stored in special registers called segment registers.
To address a single byte of memory you combine a segment register and a 16 bit offset to construct a pointer. The segment defines the start of the memory region (always at a paragraph boundary) and the offset lets you reach any byte in that region, up to the range of the offset. Since the offset is a 16 bit value, that lets you address up to 64KB. To go outside of that 64KB region you have to change the segment register to point at a different paragraph.
Sixteen bit x86 has four segment registers and four corresponding offset registers:
| Segment register | Offset register |
|---|---|
| CS (Code segment) | IP (Instruction Pointer) |
| SS (Stack segment) | SP (Stack Pointer) |
| DS (Data Segment) | SI (Source Index) |
| ES Extra Segment) | DI (Destination Index) |
In addition to these registers, there are four general purpose registers (AX, BX, CX, and DX), another stack offset register (BP), and a flags register.
A full pointer is usually written as the segment register and the offset register, such as "DS:SI" or "ES:DI". If the segment register is not specified it is implied.
Here is a more concrete example of how this mechanism works. Assume we want to look for the second parallel port I/O address in the BIOS data area. Documentation tells us that is located at address 0040:000A, which is in segment and offset notation. We can convert this to a "flat" address by multiplying the segment by 0x10 (as it points to a 16 byte paragraph) and then adding the offset value. 0x0040 * 0x10 + 0x000A = 0x0040A, which is the address when looking at it as a flat 1MB address space.
When writing code to read this address we would set DS to be 0x0040 and SI to 0x000A. We could then use an instruction to read that memory and move the value into a register. While having to combine two registers to address memory is cumbersome, it does allow us to address more than 64KB when using 16 bit registers.
A new segment starts at every paragraph in memory and the offset register can specify a 64K offset, so you have the possibility of two different segment and offset registers pointing at the same physical byte of memory. For example:
003F:0032 points at the same byte of memory as 0040:0022 and 0041:0012 (0x003F * 0x10) + 0x0032 = 0x00422 (0x0040 * 0x10) + 0x0022 = 0x00422 (0x0041 * 0x10) + 0x0012 = 0x00422
That aliasing of memory addresses means you have to be careful when comparing pointers; you can have two pointers that look to be different, but effectively point at the same byte of memory. High level languages like C hide most of this complexity from you, but in assembly language we have to deal with it directly.
And now a look at an ARP packet
We're going to be talking about ARP so we should review the ARP protocol format quickly.
ARP is used to find the hardware address of a machine on the same network segment. Given an IP address, a machine can broadcast a query to the entire local network and ask for the hardware address (the Ethernet MAC) of that machine. If there is a machine on the local network with that IP address, it will respond back directly to the requesting machine.
Here is a diagram of the ARP packet format, including where it sits in an Ethernet frame:
(Source: https://homepages.uc.edu/~thomam/Net1/Packet_Formats/arp.html)
On a SLIP connection ARP is not needed, as SLIP is a point-to-point protocol so there is only ever one machine to talk to. That machine is also logically serving as a gateway, so all traffic has to go through it. But EtherSlip is emulating an Ethernet adapter so it has to do things that are normal and expected on Ethernet, including handling ARP requests.
ARP handling inside the packet driver
The EtherSLIP packet driver is open source software and it is distributed as part of the Crynwr packet driver collection. One of the reasons to love open-source software is that decades later you can look at it and modify it for what you need, or find the bug that you think it might have.
Even with source code available, this was not a pleasant task. The changes to the SLIP driver that added Ethernet emulation were made in 1991 and the comments were sparse. The code is in x86 assembly language and it jumps around quite a bit.
Based on my debugging and traces I started my search for the bug in the code that lets a program send packets to the network. At a high level, the send_pkt routine in the packet driver takes a pointer to a packet to be sent and a length, formats the data for the SLIP protocol, and sends it down the wire. SLIP has its own packet framing requirements so it is not sending the packets verbatim. The packets being sent include an Ethernet header which needs to be removed, as that makes no sense for a SLIP connection.
ARP packets get special handling in this code. If the packet driver code sees an ARP request from the application it doesn't try to send it on the wire. Instead it generates a simulated ARP response just like you were on Ethernet, satisfying the calling application. (That is a feature of EtherSlip as it is emulating an Ethernet packet driver, not a part of SLIP.)
mTCP is somewhat SLIP aware and will generally not send ARP packets when it is on a SLIP connection. It does this by stuffing a fake ARP entry for the other end of the SLIP connection into the ARP cache. This works because the other end of the SLIP connection is also the gateway and all packets go to the gateway, so if the gateway hardware address is known there is never a need to send an ARP request. However, mTCP was sending two different ARP requests even though it was on a SLIP connection:
- At program startup the program always ARPs its own assigned address to see if another machine responds. If that were to happen it would result in an address collision, and a warning message would be generated.
- During a retry mTCP will ARP the next hop address before resending the packet.
The ARP request at program startup does not cause problems, as EtherSlip does not generate a simulated ARP response packet when it sees that specific condition. EtherSlip will generate simulated ARP responses for the second case. mTCP could be made more explicitly aware of SLIP connections and never send ARP packets under any condition when it knows SLIP is in use - I will make that improvement. However, everything should work as-is given that EtherSlip simulates ARP responses.
So what went wrong?
Now we get to the good part ...
There was a TCP packet dropped and that packet needed to be resent. Before resending a TCP packet mTCP will send an ARP request to the next hop address, making sure everything is up to date. The packet driver detected the ARP request and went down the path of simulating an ARP response. First it asked mTCP for a receive buffer for the simulated ARP response it was going to create. Then it started creating the simulated ARP response in that buffer.
Here is the relevant code with some extra comments to annotate it:
; Brutman: On entry to this section of code ...
; ds:si is the packet to be sent, now back to pointing at the Ethernet header.
; bx has a copy of si
; cx is the packet length, including the Ethernet header.
; es:di is the user buffer where we will copy the fake ARP response to.
; dx has a copy of di
;
; Set up ARP Reply by first copying the ARP Request packet.
rep movsb
;
; Skip Ethernet header
add bx,14
add dx,14
;
; Swap target and source protocol addresses from ARP request to ARP
; reply packet.
push es ; mods by Joe Doupnik
mov si,ds
mov es,si
mov si,bx ; incoming packet interior
sub si,2+6 ; walk back to originator's Ethernet address
mov cx,6 ; six bytes of Ethernet address
mov di,dx ; outgoing packet
sub di,2+6+6 ; Ethernet destination address
rep movsb ; copy originator's address as new dest
pop es ;
So on entry to this code DS:SI points at the ARP request from the program and ES:DI points at the receive buffer for the simulated ARP response. The actual values for all of the registers are in the trace:
The outgoing ARP request buffer is at 318A:3CB4
2026-04-17 16:41:26.51 Packet: Sending 60 bytes, dumping 60 Buffer address: 318a:3cb4 FF FF FF FF FF FF 00 02 12 00 56 34 08 06 00 01 ..........V4.... 08 00 06 04 00 01 00 02 12 00 56 34 C0 A8 02 7A ..........V4...z FF FF FF FF FF FF C0 A8 02 79 6D 54 43 50 20 62 .........ymTCP b 79 20 4D 20 42 72 75 74 6D 61 6E 00 y M Brutman.
The receive buffer address given to the packet driver for the simulated ARP response is at 489C:0002. (This appears in the trace when the simulated ARP response is given to Telnet.) Note the very different segment registers and offsets; these buffers are not close to each other in memory.
; Set up ARP Reply by first copying the ARP Request packet.
rep movsb
The REP MOVSB instruction copies the request into the response byte-for-byte, starting at the Ethernet header. It uses DS:SI as the source and ES:DI as the target pointers.
; Skip Ethernet header
add bx,14 ; These numbers are decimal
add dx,14
After the request is copied into the response buffer the code will move some fields around and fix other fields to make it look like an ARP response. It starts by advancing BX and DX past the Ethernet headers in the ARP request and response respectively. (BX and DX here are being used as offsets into segments, just as SI and DI are.) That is done with the two ADD instructions. So now we have the following in the registers:
Source buffer: DS=318A, SI=3CF0 Target buffer: ES=489C, DI=003E BX (offset into the source buffer) = 3CC2 DX (offset into the target buffer) = 0010
SI and DI have both advanced 0x3C bytes; this is a side effect of the REP MOVSB instruction. But that is ok as we can compute their original values by subtracting 0x3C or using their copies in BX and DX, which have been modified to point past the Ethernet header.
Next we have this sequence of instructions:
push es ; mods by Joe Doupnik
mov si,ds
mov es,si
That saves ES onto the stack so it can be restored later and copies DS into ES, using SI as an intermediate storage location. (x86 doesn't let you copy a segment register directly into another segment register.)
mov si,bx ; incoming packet interior
sub si,2+6 ; walk back to originator's Ethernet address
These two instructions then move BX (currently pointing at the start of the ARP request header) into SI, then move SI backward 8 bytes so now SI is an offset to the source MAC address of the Ethernet frame that holds the ARP request.
mov cx,6 ; six bytes of Ethernet address
mov di,dx ; outgoing packet
sub di,2+6+6 ; Ethernet destination address
These instructions do something similar to the simulated ARP response buffer. They take DX (currently pointing at the simulated ARP response header), move it into DI, and then subtract 14 from DI making it point to the beginning of the Ethernet frame for the simulated ARP response. CX is also set to 6, setting up the next instruction to copy six bytes.
Here are the registers at the time of the copy:
Source buffer: DS=318A, SI=3CBA Target buffer: ES=318A, DI=0002 BX (offset into the source buffer) = 3CC2 DX (offset into the target buffer) = 0002
And here comes the copy instruction:
rep movsb ; copy originator's address as new dest
Hopefully you see the mistake - DS was copied into ES before the copy, so while the source pointer (DS:SI) is correct and pointing into the ARP request the destination pointer (ES:DI) is bogus and it is not pointing into the simulated ARP response buffer. It has the wrong segment value in register ES, which by accident it is pointing two bytes into the start of the data segment right where the compiler is looking for writes through a NULL pointer. This wasn't a write using a NULL pointer, but the effect is the same - data corruption where it doesn't belong.
If we look at the simulated ARP response we can see that the source MAC address from the ARP request was not copied into the simulated ARP response:
2026-04-17 16:41:26.89 Packet: Received 60 bytes, dumping 60 Buffer address: 489c:0002 FF FF FF FF FF FF 00 02 12 00 56 34 08 06 00 01 ..........V4.... 08 00 06 04 00 02 00 00 00 22 34 66 C0 A8 02 79 ........."4f...y 00 02 12 00 56 34 C0 A8 02 7A 6D 54 43 50 20 62 ....V4...zmTCP b 79 20 4D 20 42 72 75 74 6D 61 6E 00 y M Brutman.
The bolded bytes are where that REP MOVSB should have copied data to. Everything else is manipulated to look like an ARP response, as expected.
This is just a bug - you can't mix and match segments and offsets like this and expect to get a good pointer. Depending on the compiler and how the calling program is written, that bug could have corrupted something more critical, possibly causing a crash or data corruption. It was just lucky that in this instance the errant MAC address landed in a place that was being monitored by the compiler runtime.
Fixing the bug
Before fixing the bug it is important to look at it in context and see what might have caused the bug. The rest of the code continues to restore ES to the correct segment value, and then swaps fields around using similar offset manipulations and REP MOVSB instructions.
There are at least two possible good fixes:
- Remove the incorrect move of DS into ES, and continue using the ARP request buffer as the source of data.
- Since you have already copied all of the ARP request into the simulated ARP response, all of the changes could have been made with DS:SI and ES:DI pointing into the simulated ARP response packet.
I chose the path of least resistance and removed the incorrect move of DS into ES. If one wanted to go further one could optimize the code by not copying the entire ARP request packet into the simulated ARP response, as a lot of that is wasted effort. Then again, this code should not run often so it might not be worth the effort.
How did this bug hide for so long? There are a few factors contributing to its age:
- If your program uses the small memory model then ES and DS probably have the same value. When that happens, the bug is masked.
- If the program is not checking the Ethernet header of the ARP response it would not have detected that it was set incorrectly. I doubt that any TCP/IP programs check the Ethernet header in this case, as what they need is actually in the ARP response itself.
- The impact the bug has depends entirely on where the badly constructed target pointer lands in memory. If the bug hits a non-critical area of memory nobody would notice anything wrong.
- Open Watcom has the detection code for NULL porinters. A different compiler might not have feature.
- DOS machines often crash for all sorts of reasons. If EtherSlip was causing problems people may have just been shrugging their shoulders and moving on.
The fixed source code with an executable can be found at EtherSlip_v11.8.zip. The original source code and the full Crynwr packet driver collection can be found at Crynwr.com.
The lesson? Take every warning seriously, and get to the root cause. If enough bugs like this creep into a system you wind up with unexplained behavior and a lot of people shrugging their shoulders saying "It's DOS ...".
To Russell, Denis, Phil, Joe and the other contributors to EtherSlip - thanks for the code, and we're still happily using it 34+ years later!
Created April 19th, 2026
(C)opyright Michael Brutman, mbbrutman at gmail dot com