iSCSI is a protocol from the era when “the network” meant a rack-scale fibre channel replacement. Initiators and targets trusted each other, CHAP was optional theatre, and a packet from an initiator carried the implicit assumption “we’re on the same L2 segment.”
scsipub serves iSCSI targets to arbitrary clients on the public internet. That’s a different set of assumptions. This post is the decision log — the small choices that add up to “this works and doesn’t break from day one.”
It started as the missing dependency for two adjacent projects of mine — a Raspberry Pi netboot shim and an ESP32-based USB-mass- storage bridge — both of which needed an iSCSI target out on the open internet to point demos at, and there wasn’t one. Building a target turned out to be the bigger of the three problems.
The listener
Both ports are Ranch 2.x listeners — plain TCP on 3260, TLS on 3261.
Scsipub.Target.Listener returns a pair of child specs that the
application supervisor adds at boot:
def child_specs(opts) do
tcp_spec = tcp_child_spec(opts[:port] || 3260, protocol_opts)
if opts[:tls_certfile] && opts[:tls_keyfile] &&
File.exists?(opts[:tls_certfile]) && File.exists?(opts[:tls_keyfile]) do
tls_spec = tls_child_spec(opts[:tls_port] || 3261, certfile, keyfile, protocol_opts)
[tcp_spec, tls_spec]
else
[tcp_spec]
end
end
Ranch runs a small acceptor pool in front of a :ranch_protocol
callback. When a connection arrives, Ranch spawns a fresh BEAM
process and hands it the socket. For iSCSI that’s the unit we
want: one process per TCP connection, one TCP connection per
initiator session, one initiator session per user-visible
mountable disk.
“One BEAM process per connection” only works because processes here aren’t OS threads. A BEAM process is ~2.5 KB of initial heap and some bookkeeping — the scheduler happily runs tens of thousands of them on a single core. iSCSI sessions sit idle waiting for SCSI PDUs most of the time, which is the ideal shape for green threads: cheap to park, cheap to wake.
Contrast with the C implementations: target_core_iblock and friends carry a thread pool and a queue, and tuning the pool size is an ongoing concern. We don’t tune anything and the BEAM happily handled 446 req/s in our web-side load test before latency started climbing — and that’s the Phoenix surface with its DB hops, not the iSCSI listener, which has smaller payloads and no SQL in the hot path at all.
One process per session
The protocol module is Scsipub.Target.Session, a plain
GenServer. Its state machine walks through three phases:
phase: :security_negotiation # csg=0, CHAP challenge/response
phase: :operational # csg=1, negotiate parameters
phase: :full_feature # csg=1 transit done, handling SCSI PDUs
Each PDU comes in on the socket, gets parsed into a struct, and routed to a handler. If a handler raises — malformed PDU, unexpected state transition, disk error — the process dies. That’s on purpose. The supervisor doesn’t restart it, because there’s no meaningful recovery; the initiator will notice the TCP close and try to log in again. State doesn’t leak between sessions because state doesn’t leave the process.
This is the standard Erlang story (“let it crash”), but it’s more than a platitude for iSCSI. The real-world alternative — carefully defending every parser branch against every attacker-shaped PDU — is how RFC 7143’s more colourful edge cases turn into CVEs in other implementations. We don’t defend; we fence. One bad PDU kills one session.
The Registry (Scsipub.Sessions.Registry, ETS-backed) is how
a session announces itself once it reaches Full Feature Phase:
Registry.set_pid(iqn, self())
The Registry monitors the pid and auto-cleans the entry on
:DOWN. The admin dashboard reads from the same ETS table to
show live connections.
COW overlays
The base image is a regular file — .img, .iso, or .qcow2
decompressed to raw on fetch. It’s read-only. Every concurrent
session gets its own overlay file, sparse-allocated to the
same size as the base:
/var/lib/scsipub/overlays/
71a61232479cc467.img ← overlay, sparse
71a61232479cc467.img.bitmap ← 1 bit per sector
The bitmap tracks which 512-byte sectors have been written. Reads check the bit: if set, the overlay has the sector; if clear, fall through to the base image. Writes set the bit and write to the overlay.
The layout means:
- The base image is never touched. CI verifies this — we SHA-256 the base before and after an integration run.
- The overlay file is sparse. A session that only writes the MBR costs ~512 bytes on disk, not “the full virtual size of the disk.” Filesystem holes do the work.
- Disconnecting is cheap. Non-persistent tiers delete the overlay on the TCP close; persistent tiers keep it until the session’s TTL elapses or the user destroys it explicitly.
-
Writes are counted. Each overlay write bumps a counter
against
write_limitfrom the user’s tier config. Hit the limit and the target respondsWRITE_PROTECTuntil the session ends.
The Janitor, a GenServer on a 10-minute tick, sweeps the overlay directory and deletes files that don’t match any live session in the database. That’s how we clean up from the rare case where a process dies before its terminate callback runs.
Caddy in front, TLS everywhere
Caddy terminates HTTPS on port 443 and reverse-proxies to the
Phoenix app on port 4000. The same Let’s Encrypt certificate
also protects the iSCSI-TLS listener on port 3261 — which is
the interesting part, because the iSCSI listener isn’t behind
Caddy. It binds :ranch_ssl directly.
Caddy writes the ACME-obtained cert to its internal storage
(/var/lib/caddy/.local/share/caddy/...), which the app user
can’t read. The bridge is a tiny systemd service running
inotifywait against that directory and copying the cert into
/var/lib/scsipub/tls/ — owned by a shared group both users
can read — whenever the bytes change.
The iSCSI listener picks up rotations without a restart because
its sni_fun re-reads the PEM on every TLS handshake, with
guardrails:
# lib/scsipub/target/tls_certs.ex
def sni_opts(certfile, keyfile) do
now = System.monotonic_time(:second)
case :persistent_term.get(cache_key, nil) do
{_cert_mtime, _key_mtime, loaded_at, opts}
when now - loaded_at < @min_reload_interval ->
opts # 60s cooldown — serve cache unconditionally
{cert_mtime, key_mtime, _loaded_at, opts} ->
if stat_unchanged?(certfile, keyfile, cert_mtime, key_mtime) do
opts # mtime unchanged — still fresh
else
reload_and_cache(...) # rotation happened — re-read PEM
end
nil ->
reload_and_cache(...) # cold cache — first load
end
end
Two guards, in order: a 60-second cooldown that serves the
cached opts without any syscall (absorbs a thundering-herd
handshake burst), and an mtime check after the cooldown that
only pays for a fresh PEM read when the files have actually
changed. Both matter — sni_fun is on the hot path for every
TLS handshake, and without them a rotation every few months
would still cost two stat syscalls per mount.
Things open-iscsi cares about
If you’re building against the open-iscsi initiator that ships in every
Linux distro, the protocol is less “what’s on the wire” and more
“what iscsiadm does with what’s on the wire.” Three concrete
examples that each cost us a day.
/ in the IQN type-name separator
Our first cut of anonymous target names was iqn.2025-01.pub.scsipub:image/ubuntu.
That parses fine as an IQN. iscsiadm even does discovery against it
happily. What it can’t do is log in:
iscsiadm: Could not make /etc/iscsi/nodes/iqn.2025-01.pub.scsipub:image/ubuntu
open-iscsi stores its persistent state in /etc/iscsi/nodes/<iqn>/...
— it uses the IQN verbatim as a filesystem path. Any / in the name
becomes a subdirectory boundary, and the create-if-missing path walk
fails. We switched to . as the type/name separator
(iqn.2025-01.pub.scsipub:image.ubuntu), which parses the same way and
sidesteps the whole problem.
SendTargets has to advertise an address the client can reach
When an initiator does discovery, the target replies with a list of
TargetName + TargetAddress records. The initiator saves that
address as the portal for future logins — even if the discovery
request itself went through a different IP.
In our CI, the target runs inside a CI container and the initiator
inside a QEMU VM. QEMU’s user-mode networking NATs to 10.0.2.2 from
the VM’s perspective. If we let the server advertise whatever
sockname() returns — 127.0.0.1:3260 — iscsiadm dutifully saves
that as the portal, and every subsequent login attempt tries to
reach the runner’s loopback from inside the VM and fails forever.
# lib/scsipub/target/session.ex
defp advertise_address(socket, transport) do
case Application.get_env(:scsipub, :public_host) do
host when is_binary(host) -> "#{host}:#{port(socket, transport)}"
_ -> sockname_string(socket, transport)
end
end
Pin :public_host (we ship this as PHX_HOST in deploy env) and
SendTargets returns something the client can actually get back to.
The -o new dance for static logins
Once you’ve been bitten by the SendTargets-saves-the-portal behaviour
enough times, you learn to skip discovery for anything that needs a
non-default portal. For example: iSCSI-over-TLS via stunnel. The
natural flow would be “discover via the tunnel, then log in.” But the
discovery response names the server’s public portal, not
127.0.0.1:3260 where stunnel is terminating, so iscsiadm saves
the wrong portal and logs in plain instead of through the tunnel.
The fix is static login:
IQN=iqn.2025-01.pub.scsipub:blank
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 -o new
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 \
-o update -n node.session.auth.authmethod -v None
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 --login
-o new creates a fresh node record at the portal you specify instead
of using whatever the discovery step saved. Our landing page renders
exactly that command sequence for the TLS path, because the alternative
is an infuriating 30 minutes with iscsiadm --debug=6.
Bonus: stale records retry forever
Once a node record exists under /etc/iscsi/nodes/, iscsid retries
the login indefinitely if the session drops. If the target has been
destroyed server-side, that manifests as a steady 1-every-3-second
stream of “unknown target” login attempts in our server logs. The
cure is on the client:
iscsiadm -m node -T <iqn> -o delete
On the server we throttle the log line (once per (ip, target) per 5
minutes at warning level, debug after that) so a stale initiator
doesn’t bury real warnings under 17,280 lines of the same complaint
per day. See Scsipub.Target.Session.log_unknown_target/2.
Cluster primitives: PR and multi-LUN
What turns this from “a fancy iSCSI sandbox” into “a target real cluster software can drive” is two SAM-5 / SPC-4 features — multi-LUN sessions and SCSI-3 Persistent Reservations. The wire protocol already supports both; the work is on our side, plumbing them into the Session and into something that survives a BEAM restart.
Multiple LUNs per session
A SCSI Logical Unit Number is the byte in each CDB that selects
which device behind a target the initiator is addressing. Real
storage products expose one target with N LUNs all the time; our
Session struct holds a map keyed by LUN number, and the SCSI
dispatcher routes by pdu.lun:
case Map.get(state.lun_backends, pdu.lun) do
nil -> {:error, :logical_unit_not_supported}
cow -> Handler.dispatch(pdu.cdb, pdu.data, cow, ...)
end
There’s an anonymous demo target wired up —
iqn.2025-01.pub.scsipub:multi exposes two LUNs, each backed by a
different image — and the session-creation API on the paid side
takes an images: [...] array. The unglamorous half of the work
was cleanup: multi-LUN sessions write to <sid>.lun0.img,
<sid>.lun1.img, etc., and a terminator that only knew about
state.overlay_path (the single-LUN field) leaked overlays on
disconnect. The fix is a separate cleanup_multi_lun_overlays/1
walker, gated on state.overlay_path == nil so the single-LUN
path’s own File.rm doesn’t double-close the same fd.
Persistent reservations
SCSI-3 PR is the primitive cluster software uses to fence a node
out of shared storage. The per-LUN state is small: a set of
registered initiator keys, plus an optional “reservation” naming
one of them as holder along with a type (Write Exclusive,
Exclusive Access, and four flavours combining “Registrants Only”
and “All Registrants”). Pacemaker, ESXi HA, and Windows MSCS all
drive this via sg_persist.
The state machine is Scsipub.Sessions.PR — a pure module, no DB
or process baggage, so it’s tested as a struct. The runtime layer
(SharedLU, one GenServer per (session_id, lun)) wraps it with
write-through to the persistent_reservations Postgres table on
every successful PR OUT. SPC-4 says PR state must survive a target
reboot, and the table is the only honest way to honour that. A
BEAM-restart unit test cycles the SharedLU through stop+restart
and asserts the registrations and reservation reload identical.
Two subtle bits of plumbing.
The I_T nexus identifier is the iSCSI InitiatorName, not the
CHAP user. Two initiators behind the same CHAP credential are
distinct nexuses by design, and trusting CHAP_N would let a
second client write under the first’s reservation. The Session
struct keeps both:
:initiator_name, # CHAP_N for paid sessions
:iscsi_initiator_name, # the InitiatorName from the first login PDU
# — what PR identifies by
The other surprise was that Linux’s open-iscsi doesn’t send PR
OUT parameter lists as immediate data. It uses the R2T (Ready To
Transfer) flow, the same way it does WRITE — which makes sense,
the spec lets it, but the original implementation only handled
the immediate path. sg_persist --register returned Invalid opcode until R2T-driven PR OUT joined the existing two-phase
command machinery that SCSI WRITE already used.
Two-initiator scenario, end to end:
# Initiator A: register a key, reserve Write Exclusive
sg_persist --out --register --param-sark=$KEY_A $DEV_A
sg_persist --out --reserve --param-rk=$KEY_A --prout-type=1 $DEV_A
# Initiator B (different InitiatorName, may share CHAP user):
# READ is allowed, WRITE returns RESERVATION CONFLICT.
dd if=$DEV_B bs=512 count=1 iflag=direct >/dev/null # ok
dd if=/dev/zero of=$DEV_B bs=512 count=1 oflag=direct # EBUSY
# A releases; B's write now succeeds.
sg_persist --out --release --param-rk=$KEY_A --prout-type=1 $DEV_A
The CI integration suite runs that exact sequence. Combined with the restart-resume contract above, that’s enough to back a 2-node failover cluster off a target on the public internet — the BEAM deploy ritual (SIGTERM, wait for sessions to checkpoint, SIGKILL, restart, Resumer wakes the suspended LUs) doesn’t lose reservations along the way.
What we’re not solving
Deliberate omissions, for the record:
- Multi-region. Everything runs in a single datacenter. A multi-region story would need per-session persistence to be a distributed system problem; it currently isn’t, and we like that.
-
S3- or NBD-backed base images. Images are local sparse
files. Upload via the admin UI or an
ecto runscript; that’s the whole ingestion story. Cloud-backed storage changes the read-path latency distribution meaningfully enough that we’d want to think about it rather than bolt it on. - iSER / RDMA. No. scsipub is a public-internet service; RDMA is a rack-scale protocol. If you need 40 Gbit/s into a block device, the physics say you aren’t on the public internet anyway.
- MPIO. Not yet. The initiator side of multipath works fine, but until we have multi-region there’s nowhere to failover to.
- Per-session encryption above TLS. The iSCSI protocol has IPsec and a few other approaches for payload secrecy; none are widely deployed, and adding our own on top of TLS would just be framing for framing’s sake.
What comes next
The two projects scsipub originally existed to serve are now both shipped and have their own posts — the Pi netboot shim and how it killed the SD-card shuffle is at Netboot a Pi fleet from iSCSI; the ESP32 USB-mass-storage bridge for lab equipment is at An ESP32 as a network-attached USB stick.
Past that, the interesting question is what happens when a Phoenix app serving iSCSI meets someone who really wants to use it — tens of thousands of sessions, sustained writes, a pathological initiator. We’ve done a load test up to a few hundred concurrent web requests; we haven’t yet found the shape of the BEAM’s failure mode under actual iSCSI load. That’s the next thing to measure.