On latency, measurement, and optimization in algorithmic trading systems

原始链接: https://www.architect.co/posts/how-fast-is-it-really

Measuring latency in algorithmic trading systems is complex, requiring careful consideration to avoid introducing measurement overhead that skews results. Initial attempts to measure code execution time with simple timers are flawed, as they miss critical components like network I/O and parsing. A more comprehensive approach involves capturing the entire critical path: from network packet arrival to order transmission. This can be achieved by simulating the exchange and the algorithmic trading system (ATS) itself. The simulator timestamps market trades, and the ATS includes this timestamp in its orders. The simulator then records the order arrival time, allowing for latency calculation. However, this method overestimates latency due to simulator overhead. A baseline measurement using a simple "ping-pong" exchange and ATS simulator can be subtracted to refine the result. The most accurate solution involves advanced network hardware for packet replication and timestamping, but this is a more complex setup. Properly measuring latency is crucial for optimizing algorithmic trading systems, but it presents significant technical challenges.

This Hacker News thread discusses latency, measurement, and optimization in algorithmic trading systems. A key point is the difficulty of accurate performance measurement, especially in environments with garbage collection, JIT compilation, and asynchronous code. Simple timers can be misleading, necessitating specialized benchmarking libraries like BenchmarkDotNet or JMH. Participants emphasize measuring various latency percentiles (p50, p90, p99, max) rather than just the mean, as averages can be skewed by outliers. In HFT, fiber tapping and network card timestamps are used for external latency measurements, while `rdtsc` counters and low-overhead loggers track internal processing. Pinning threads to cores and using invariant TSC are common practices. A model predicting latency based on factors like trading team, exchange, and order volume is also mentioned. A crucial takeaway is that while perfectly measuring latency is difficult, even basic performance testing integrated into CI/CD can quickly detect regressions. Prioritizing the happy path over runtime behavior under stress is cautioned.
相关文章

原文

"The speed of light sucks." - John Carmack


Software engineers within the world of low-latency automated trading (colloquially known as "high-frequency trading" or HFT) obsess over speed. From purchasing private bandwidth between microwave towers to analyzing x86 instructions from different compiler versions, those with experience in this industry have seen colossal time and expense committed to the problem of optimizing code and network paths for minimal execution times.

But how does one actually measure how fast a program is? To the uninitiated, it sounds like a simple task. However, there are many layers of complexity in measuring the true latency of a trading system, or even defining what to measure in the first place. Understanding latencies in algorithmic trading systems can present Heisenberg-esque dilemmas, where the more code you write to measure latency the more overhead you add to your program and therefore the more inaccurate your measurements become.

At Architect we have been using a combination techniques to measure the latency of various codepaths and processes that comprise our institutional trading technology suite. Let's explore some solutions to these problems.

--

Let's say you've written an algorithmic trading strategy. Your strategy reacts to market trades in an instrument, perhaps by computing a proprietary model valuation, and sends an order in that instrument provided certain conditions are met. You would like to measure the time that this reaction takes in a reproducible way, so that you can reduce the time as much as possible. Let's use Python-style pseudocode to describe the program (although in practice it is most common to use languages like, C, C++, and Rust where optimal program latencies are required):

def on_market_trade(self, instrument, market_trade):
  model_value = self.compute_model_value(instrument, market_trade)
  order = self.compute_order_decision(instrument, model_value)
  if order is not None:
    self.send_order(order)

A reasonable place to start in understanding the latency of your critical codepath is to wrap timers around the functions doing the heavy lifting:

def on_market_trade(self, instrument, market_trade):
  start_time = datetime.now()
  model_value = self.compute_model_value(instrument, market_trade)
  order = self.compute_order_decision(instrument, model_value)
  end_time = datetime.now()
  self.add_time_sample(end_time - start_time)
  if order is not None:
    self.send_order(order)

The function self.add_time_sample would add the elapsed time to a histogram that you could print statistics for at the end of your program’s lifecycle, or on some regular basis based on time or number of samples observed.

There are many issues with the above approach:

  1. It measures the time required to compute the order decision, but does not include the time it takes to send the actual order.
  2. It observes computation time on every market trade, rather than just the trades that result in orders — this can bias results because the most interesting times to send orders may be the ones where your program is running the slowest due to volume of market events or other factors.
  3. datetime.now() itself is a slow, expensive function that can impact the runtime speed and memory profile of the code above, which adds up if your program is already operating on a microsecond-timescale. The typical way to fix this last issue is to use native performance counters that most programming language have primitives to access.

Here’s a new code sample that attempts to fix the above:

def on_market_trade(self, instrument, market_trade):
  start_time = time.perf_counter_ns()
  model_value = self.compute_model_value(instrument, market_trade)
  order = self.compute_order_decision(instrument, model_value)
  if order is not None:
    self.send_order(order)
    end_time = time.perf_counter_ns()
    self.add_time_sample(end_time - start_time)

This is an improvement, but are we really getting at the full latency of the trading system? The above doesn’t include significant elements of the critical path, including the time to parse the market trade update, or anything involving network I/O. Let’s take a step back and trace a market data update through the complete critical path of the automated trading system (ATS):

  1. Network packet containing the market trade hits the network card of the box where the ATS is running (sent from the exchange)
  2. The packet is passed to the runtime of the ATS
  3. The ATS parses the bytes of the packet to pull out necessary fields (such as trade price or trade size)
  4. The ATS computes a model value and makes a decision to send an order
  5. The internal memory representation of the order is converted to the protocol of the exchange that the order is being sent to
  6. The ATS makes function calls to pass the order bytes to the network card of the box for sending
  7. The network card of the box sends the order bytes to the exchange

(There are many details missing from the above, such as the multiple methods of going from steps 1 to 2 and 6 to 7, but we are omitting those for simplicity for now.)

The code sample above is only measuring steps 4, 5, and 6. I have seen many real-world instances where 90% or more of the full latency profile was present in 1, 2, 6, and 7. A large chunk of latency could be incurred in step 3 if performed uncarefully, or if there is any order-book-building necessary in steps 3 and 4.

If we solely wanted to measure the latency of the system, we could tag each outbound message with the market data event sequence number, then, for each order, you can grab the NIC hardware timestamp tagged for the inbound market data event, and subtract it from the NIC hardware timestamp for the outbound order.

To truly capture all seven steps in a reproducible way (so that we can do A/B testing on your code improvements), you can set up this alternative method for measuring latency:

  • Write a program that simulates the exchange marketdata, by sending random market trade events on a timer
  • Have that same program simulate the exchange itself by receiving orders in the exchange’s native protocol
  • Have the simulator timestamp the market trades with the current time right before sending
  • Configure the ATS to receive data from the simulator and send orders to the simulator. Have the ATS attach the exchange trade timestamp on the order it sends to the exchange, or if that’s not possible in the protocol have it record a mapping from order id to exchange trade timestamp
  • Have the simulator write down the timestamp when it receives orders from the ATS. From either the data on the order itself or from the ATS’s mapping of order id to exchange trade timestamp, compute the difference between market trade send time and order receive time.

While the above does completely capture the full critical path, it provides too conservative an estimate of latency: it also captures a similar codepath for the simulator itself! To get closer to the right answer, you can write another simulation Exchange and also a simulation ATS that just ping/pongs a single timestamp back and forth without any protocol translation, model building, order sending, etc. This provides a baseline for inter-program latency that could be subtracted from the above experiment.

The as-close-as-possible-to-perfect solution involves a much more advanced setup, where modern switching hardware is used to replicate packet traffic in and out of the box and the raw network packets are parsed and correlated for timestamping. But I’ll leave those details for a future post.

Writing fast algorithmic trading system code is hard. Measuring it properly is even harder. At Architect we have created institutional-grade low-latency trading software for both regulated derivatives and digital assets, so that you can let us do the work for you. For more information, contact us at hello at architect.co.

联系我们 contact @ memedata.com