When people talk about low-latency systems, they tend to talk about nanoseconds the way economists talk about basis points: as small, interchangeable units of a scarce resource. You have a budget, you spend it, you try to spend less next quarter.
That framing is useful for planning, but it hides something important. A nanosecond is not an accounting unit. It is a physical thing. In one nanosecond, light travels almost exactly thirty centimeters in vacuum. In a fiber optic cable, closer to twenty. On a PCB trace, about fifteen. A nanosecond is a distance. It is a piece of geography.
A nanosecond is a distance
Three pulses launch together. One nanosecond later, this is where each one is.
Same start, same elapsed time. The distance is what the medium spent.
Once you internalize that, latency stops being a mysterious performance metric and becomes something you can literally point to. The gap between an "800 nanosecond" system and a "600 nanosecond" system is not a number on a spreadsheet. It is, somewhere, a real stretch of fiber, a real queue in a real chip, a real gate whose output is still settling. You can walk up to it.
This article builds that mental model from the ground up. We will start with the speed of light, pick up the propagation constants of the materials signals actually travel through, add gate delays and memory accesses and serial links, and end by walking through a full tick-to-trade path: the kind that shows up in exchange-connected trading systems, but also in any scenario where a packet arrives, a decision is made, and a response goes out. The specifics of the trading context are not the point. The point is the budget: where every nanosecond goes, and why.
The speed of light, and why you care
The speed of light in vacuum is, exactly by definition:
For engineering purposes, we round this to m/s, or equivalently cm/ns. That second form is the one worth memorizing. Light travels thirty centimeters (about a foot) per nanosecond. If you know nothing else about physics, know that.
Now, light in vacuum is the universe's hard upper bound on information propagation. Every signal in your system (every voltage change on a wire, every photon in a fiber, every edge of a clock) travels more slowly than this. How much more slowly depends on the material. We capture that with a number called the velocity factor, usually written or , which is simply the ratio of the signal's propagation speed to .
Three numbers cover almost everything you will encounter in a digital system:
- Vacuum / air: . Signals travel at nearly . Relevant for wireless, radar, satellite, and lab setups where signals cross open space.
- Optical fiber (single-mode, 1310 or 1550 nm): . Light in glass moves at about two-thirds of , so roughly cm/ns, or equivalently ns/m.
- Copper PCB trace (FR4 substrate, microstrip or stripline): , give or take. Signals move at about half , so roughly cm/ns, or ns/m.
These numbers are not vendor-specific tradecraft. They are material properties, taught in undergraduate electromagnetics courses, derivable from the dielectric constants of glass and epoxy resin. They show up in every serious board-design textbook. But the implications of taking them seriously are underappreciated.
Consider a trivial example. You have two servers in the same rack, connected by a meter of fiber. The one-way propagation delay of the cable itself, ignoring every other source of latency, is:
Five nanoseconds. That's the floor. No amount of clever engineering will make a one-meter fiber link faster than about 5 ns one-way, because the signal has to physically traverse the glass and the glass has an index of refraction of about 1.5. You cannot optimize this away any more than you can optimize away the hypotenuse of a right triangle.
Scale this up and the implications get vivid. A cross-country fiber link from New York to Chicago is roughly 1200 km as the crow flies (the actual route is longer, but ignore that for a moment). The theoretical minimum one-way latency is:
Six milliseconds, one-way, for photons in glass along the shortest possible path. Every extra mile of routing, every splice, every amplifier, adds to that. This is why, in the late 2000s, people spent enormous sums of money laying straighter fiber routes between financial centers: the crow-flies distance is a physical lower bound on the latency between those two cities, and shaving kilometers off the path is the only way to reduce it. You cannot clever your way past the index of refraction of glass.
So far we have established one thing: distance has a floor, and the floor is set by materials physics. Now let's look at what happens inside the endpoints.
Inside the chip: gates, wires, and cycles
Step inside a digital chip and the length scales collapse. Instead of meters, we are talking about millimeters and micrometers. Instead of cables, we have metal interconnect running between transistors. The speed of light hasn't changed, but the geometry is very different, and a new phenomenon dominates: gate delay.
A logic gate (an AND, an OR, a flip-flop) takes some finite time to change its output after its input changes. The exact value depends on the process node, the cell library, the loading on the output, the voltage, and the temperature, but for a modern FPGA fabric running at nominal conditions you can use a rough-and-ready number: a single lookup table (LUT) takes on the order of 100 to 300 picoseconds to produce a valid output, plus some additional time to drive its output through the interconnect to the next cell. Call it roughly half a nanosecond per logic level, all-in, for planning purposes.
This sets a hard ceiling on clock frequency. If your combinational path between two flip-flops goes through, say, eight levels of logic and some routing, you might be looking at 4 to 5 ns of delay, which means you cannot close timing above about 200 to 250 MHz on that path. Every pipeline register you insert breaks that combinational chain and buys you frequency, at the cost of latency measured in clock cycles.
This is the fundamental tradeoff of synchronous digital design:
- Shorter combinational paths (more pipeline stages): higher clock frequency, more throughput, more latency.
- Longer combinational paths (fewer pipeline stages): lower clock frequency, less throughput, less latency.
A CPU pipeline is deeply pipelined, with twenty-plus stages in modern cores, because it optimizes for single-thread throughput and clock speed. An FPGA design for low-latency packet processing is typically pipelined as shallowly as timing closure allows, because every stage added is another clock period of waiting.
Which brings us to the unit that actually matters for on-chip reasoning: the clock period.
At 250 MHz, ns. At 322 MHz (a common SerDes-derived clock), ns. At 500 MHz (fast for fabric logic, routine for hardened blocks), ns.
Every pipeline stage in your design costs one clock period of latency. If your packet parser is ten stages deep at 322 MHz, you have spent 31 ns getting a byte through the parser, regardless of what the parser actually does. This is why, in latency-critical FPGA design, engineers obsess over stage count the way software engineers obsess over cache misses. A stage is a nanosecond you will never get back.
Serial links: where the time really goes
Here is a fact that surprises nearly everyone the first time they encounter it: the dominant latency in modern digital systems is often not logic, not memory, and not even cable propagation. It is serialization and deserialization at the I/O boundary.
Modern high-speed interfaces, including 10/25/100 Gigabit Ethernet, PCIe, Interlaken, and so on, all use SerDes (serializer/deserializer) blocks to move data across a small number of differential pairs at very high bit rates. On the transmit side, parallel data from the chip's fabric is serialized onto the wire; on the receive side, serial bits are recovered, aligned, and reassembled into parallel words. This process has inherent latency that no amount of clever engineering can eliminate, because it involves:
- Clock recovery: the receiver has to lock to the transmitter's clock by observing bit transitions. This takes time, though most of it is amortized at link startup.
- Alignment: the receiver has to figure out where word boundaries fall in the bit stream, typically by looking for a special comma character or alignment pattern. This adds a few byte-times of latency.
- Line coding: most SerDes schemes use 8b/10b or 64b/66b encoding, which means extra bits on the wire for DC balance and clock recovery. This doesn't add much latency on its own, but the encode/decode pipelines do.
- Elastic buffering: to absorb small clock frequency differences between transmitter and receiver, the receive path usually includes a small FIFO. Its depth directly adds latency.
The actual numbers vary dramatically by implementation. A well-optimized 10GbE MAC+PHY in an FPGA might have a receive latency, from first bit on the wire to first byte available in the fabric, somewhere in the range of 50 to 150 ns. A less-optimized implementation can easily be 300 ns or more. The transmit path is typically similar. These are big numbers relative to everything we have discussed so far.
And here is the kicker: you usually cannot eliminate this latency, because it is mostly dictated by the physical layer protocol itself. You can sometimes reduce it by using a cut-through MAC instead of a store-and-forward one, by keeping the elastic buffer shallow, by bypassing optional layers of processing. But there is a floor, and the floor is not low.
This is one of the big reasons FPGAs can beat CPUs on end-to-end latency for network-facing workloads. A CPU-based system typically has a NIC doing SerDES and MAC processing, then DMA'ing into host memory, then waiting for a CPU thread to pick it up through a kernel (or user-space) driver. Each of those stages adds latency. An FPGA connected directly to the same SerDes can start acting on the data while the CPU-based system is still DMAing it into RAM.
Memory: not all RAM is equal
Memory accesses are another place where the length scales change abruptly. We already saw that on-chip, half-nanosecond gate delays and sub-nanosecond interconnect hops dominate. When the data has to leave the chip, everything slows down.
A rough hierarchy, for planning purposes:
- FPGA block RAM (BRAM) or UltraRAM: 1 to 2 clock cycles, i.e., a few nanoseconds at typical fabric frequencies. Basically free, latency-wise.
- On-chip distributed RAM (LUTRAM): essentially combinational read, one cycle to register. Smaller capacity but fastest access.
- High-speed off-chip memory (QDR, RLDRAM, HBM): tens of nanoseconds, depending on technology and access pattern. HBM in particular is much closer to the die and has much better latency than traditional DDR.
- DDR4/DDR5 SDRAM: 50 to 100+ ns for a random access, dominated by the row-activation and column-access timings of the DRAM protocol itself. Streaming access is much better, because you amortize the row activation.
- Host memory over PCIe: this is the real killer. A PCIe round-trip for a small read (say, an FPGA reading a doorbell word from CPU memory) is easily 500 ns to over a microsecond, depending on root complex topology, bridge depth, and whether the access hits any caches.
The last point is worth pausing on. PCIe is a packetized, layered protocol. A read transaction involves composing a memory-read TLP, serializing it across the PCIe link, having the root complex translate and forward it to the memory controller, waiting for DRAM access, and then sending the data back through the same layers in reverse. It is not a wire with a voltage on it. It is a protocol stack. And the latency of that protocol stack is often the dominant term in systems that involve CPU-to-FPGA interaction on a per-packet basis.
Building a mental model: latency as geography
Let's take stock. We now have a small set of primitives, each with a rough order-of-magnitude latency:
- Fiber propagation: ns/m.
- Copper/PCB propagation: ns/m.
- FPGA logic (one pipeline stage at ~300 MHz): ns per stage.
- FPGA BRAM read: ns (one cycle, negligible setup).
- 10GbE SerDes/MAC receive: ns.
- 10GbE SerDes/MAC transmit: ns.
With this toolbox, you can stand at a whiteboard, sketch a block diagram, and estimate the latency of any digital system to within a factor of two, before writing any RTL or code. This is the single most valuable skill in low-latency design, and it comes entirely from internalizing the orders of magnitude above.
The mental move is always the same: trace the signal path from input to output. At every point, ask two questions. First, what physical thing is the signal traversing right now: a fiber, a trace, a stack of gates, a protocol boundary? Second, what is the latency primitive for that thing? Sum up the primitives and you have a budget.
This also tells you where to focus optimization effort. If 80% of your budget is in SerDes, there is no point shaving a pipeline stage in the fabric; you have the wrong target. If 80% is in a PCIe round-trip, moving logic onto the FPGA is the answer, not rewriting your CPU code in SIMD. The budget tells you where the time actually is.
A worked example: the tick-to-trade path
Let's walk through a complete end-to-end path. The scenario is generic: a packet arrives on a network interface, something in the system examines it and decides whether to send a response, and if so, a response packet leaves on a network interface. This pattern of receive, decide, respond appears in exchange-connected trading systems, in network security appliances, in game servers, in industrial control, and in any number of telecom workloads. The trading case is what gave us the term "tick-to-trade," so we will use that framing, but the shape of the problem is general.
Assume the following setup:
- Market data arrives over 10 Gigabit Ethernet on fiber, one hop through a switch, into an FPGA-based NIC.
- The FPGA parses the packet, examines a few fields against some precomputed state, and (if the conditions are met) constructs an order.
- The order goes out the same FPGA on another 10 GbE fiber port back through the switch to the matching engine.
We'll use a 200 MHz fabric clock for the logic stages, chosen entirely because it makes every stage land on a clean 5 ns boundary and I don't have to write "~" in front of every number. Every number that follows is like this: round, teaching-friendly, picked for arithmetic clarity rather than fidelity to any real part. The shape of the budget is what matters; if you want real totals, substitute your own synth report's Fmax and redo the multiplication.
Here is the breakdown, stage by stage.
Stage 1: Exchange gateway to switch (fiber propagation). Assume 10 meters of fiber between the exchange's gateway and the switch serving the co-located trading hardware. At 5 ns/m, this is 50 ns. Not much, but non-zero. If the gateway were 100 m away, it would be 500 ns, and we'd notice.
Stage 2: Switch traversal. Even a good cut-through Ethernet switch adds latency. A high-end cut-through switch might add around 300 ns end-to-end, measured from first bit in to first bit out. Store-and-forward switches add more, proportional to packet size. We'll assume cut-through.
Stage 3: Switch to FPGA NIC (fiber propagation). Another short fiber run, maybe 3 meters. That's 15 ns. Tiny, but a real number on a real stretch of glass.
Stage 4: Optical-to-electrical conversion and SerDes receive. The fiber signal is converted to electrical by the SFP+ module, then recovered and deserialized by the FPGA's high-speed transceiver. Assume a well-tuned receive path: 80 ns from first bit on the wire to the first parallel word appearing in the fabric.
Stage 5: MAC and packet framing. The 10GbE MAC handles preamble stripping, frame delimiting, CRC check initiation, and presents bytes to user logic. A cut-through, latency-optimized MAC can do this in 20 ns or so, assuming it does not wait for the end-of-packet before starting to hand bytes downstream.
Stage 6: Packet parsing. The FPGA parses the relevant protocol headers (Ethernet, IP, UDP, and whatever application-layer framing is in use) and extracts the fields of interest. In a well-designed parser, this is a pipelined state machine operating on a wide bus. Assume 4 pipeline stages at 200 MHz: 20 ns. Note that this includes all the parsing; the parser starts operating on header bytes the moment they appear and is done shortly after the last relevant field arrives.
Stage 7: Decision logic. Compare the extracted fields against precomputed state (thresholds, symbol-specific parameters, position limits, risk checks). This is typically a handful of pipelined comparators and a small amount of arithmetic. Assume 2 stages at 200 MHz: 10 ns.
Stage 8: Order construction. Assemble the response packet with headers, payload, and checksums. Much of this can be precomputed, so the runtime work is inserting a few dynamic fields. Assume 3 stages at 200 MHz: 15 ns.
Stage 9: MAC transmit framing. Frame the outgoing packet: preamble, headers, CRC calculation. A fast transmit MAC is similar to the receive path: ~20 ns before bits start hitting the SerDes.
Stage 10: SerDes transmit and electrical-to-optical. Serialize to the wire and light up the fiber. Assume ~80 ns, symmetric with the receive side.
Stage 11: FPGA to switch (fiber propagation). Same 3 m of fiber back: 15 ns.
Stage 12: Switch traversal. Another ~300 ns through the switch.
Stage 13: Switch to matching engine (fiber propagation). Same 10 m of fiber as on the way in: 50 ns.
Totaling all of that:
Stage | Latency (ns) |
|---|---|
| 50 |
| 300 |
| 15 |
| 80 |
| 20 |
| 20 |
| 10 |
| 15 |
| 20 |
| 80 |
| 15 |
| 300 |
| 50 |
Call it about a microsecond. That is a plausible tick-to-trade number for a well-engineered FPGA-based path.
Now look at where the time actually goes. The two switch traversals alone are 600 ns, over 60% of the entire budget. The two SerDes crossings are another 160 ns, roughly 16%. The entire FPGA decision-making logic (the parser, the decision, the order construction) is 45 ns, under 5% of the total. Fiber propagation across the whole path is 130 ns, about 13%.
This is the point of the exercise. A beginner's intuition says "make the logic faster." A budget says "the logic is already almost nothing; if you want to halve this number, you need to get rid of a switch hop, not rewrite a comparator." In practice, this is exactly what high-end systems do: they go direct-connect, bypassing the exchange's switch entirely where possible, because the switch is the single largest latency contributor in the path.
It also tells you what a further 2× improvement would look like. You cannot meaningfully reduce fiber propagation (the geography is fixed). You cannot easily reduce SerDes latency (the protocol sets a floor). You can reduce switch traversals to zero via direct connection, and you can shave stages off the fabric pipeline at the cost of tighter timing closure. Everything else is noise. The budget tells you this. The budget is the plan.
What to take away
A few durable ideas:
A nanosecond is a distance. Thirty centimeters in vacuum, twenty in fiber, fifteen on a PCB. When you hear a latency number, translate it into the physical thing it represents. The translation will sharpen your intuition for what is possible and what is not.
Every digital system has a latency budget, whether you draw it or not. Drawing it, even roughly, on a whiteboard, with order-of-magnitude numbers, is the single most leveraged thing you can do when designing for low latency. It tells you where the time is, which tells you where optimization effort is actually worth spending.
The floors are set by physics and protocol, not by effort. Fiber velocity factor, SerDes alignment, PCIe round-trip structure, DRAM row-activation timing: these are floors. You cannot engineer past them. You can only work around them by choosing different physical paths, different protocols, different architectures.
Pipeline stages are cheap to add and expensive to remove. Each stage is a clock period you will never recover at runtime. This does not mean never add pipeline stages (you often must, to close timing), but it does mean every stage should earn its keep. Count them. Know the count.
The worst latency is the latency you didn't know was there. PCIe round-trips, kernel context switches, garbage collection pauses, NIC interrupt coalescing: these are the hidden hippos in the latency river. Most "mysterious latency" in real systems is not mysterious; it is a stage that was not on the diagram because nobody thought to draw it.
Internalize these, and the nanosecond stops being a unit of frustration and becomes a unit of geography. You can see it. You can point at it. And once you can point at it, you can decide whether to spend it.