Clock Domain Crossing: Every Way to Get It Wrong

Every experienced digital designer has a story about a clock domain crossing bug. The best ones involve silicon that passed every simulation, every formal check, every lab test, and then, three months after shipping to customers, started failing at a rate of roughly one unit per warehouse, only in products manufactured on Tuesdays, only when the ambient temperature rose above 40°C, and only during the third hour of operation.

CDC bugs are like this because they are probabilistic failures of timing, not logical mistakes in behavior. Your RTL is correct. Your testbench passes. Your gate-level simulation is clean. And yet, somewhere deep in your design, a flip-flop is occasionally being asked to sample a signal at exactly the wrong moment, and occasionally that flip-flop emits a value that is neither 0 nor 1 but something in between, and occasionally (very, very occasionally) downstream logic interprets that intermediate voltage in a way that corrupts state.

This article is a guided tour of that problem and its solutions. We will start with why multi-clock designs exist, derive the core physical phenomenon (metastability) well enough to reason about probability, and then work through the handful of techniques that actually solve CDC correctly. By the end, you should be able to look at any block diagram that has a clock boundary in it and know exactly what needs to be done at the crossing, and why.

The "why" matters, because CDC is one of the most cookbook-taught topics in digital design ("just use a two-flop synchronizer"), and cookbook teaching is exactly how engineers end up applying a single-bit synchronizer to a multi-bit bus and shipping broken silicon. Understanding the physics beats memorizing the recipes.

Why we have more than one clock

The simplest digital design has one clock. Every flip-flop in the design samples its input on the rising edge of that clock, and every signal between any two flip-flops has a full clock period (minus setup and hold margins) to propagate and settle. This is the synchronous design paradigm, and it is how you should structure any part of your design where you can.

A clock domain is the set of flip-flops driven by a single clock signal. Within a domain, every register-to-register path is statically analyzable for timing. Between domains, no fixed phase relationship exists, and that's the entire problem this article is about.

So why does any real system ever have more than one clock? A few reasons, all unavoidable at system scale:

External interfaces run at their own clock rates. A PCIe link runs at a frequency derived from the PCIe reference clock. A DDR4 memory controller has its own training-dependent clock. An Ethernet MAC has clocks locked to the recovered-data rate coming off the wire. When a high-speed serial link recovers its clock from the incoming data stream, that clock is by definition not in phase with your fabric clock, because it came from a different oscillator on a different board across a cable.
Different parts of a chip run at different speeds. A packet parser might want to run at 322 MHz because that's the MAC's natural rate; a compute pipeline might want 500 MHz because its path is short enough to close timing there; a slow housekeeping block might run at 100 MHz because faster is wasted power. These are typically generated by different PLLs with different VCO settings. Even if they nominally have a rational frequency relationship, phase drift between PLLs means they are effectively asynchronous for timing purposes.
Low-power modes gate clocks independently. A block that needs to sleep must have its clock stopped. The wakeup transition, and any communication across a boundary between "always-on" and "gated" clocks, is a CDC problem.
Asynchronous inputs. A button press, a reset pin, an interrupt from another chip: any signal that crosses the chip boundary and was not generated by your fabric clock is, from your design's point of view, a signal in a foreign clock domain. Even if that "domain" has no clock at all.

Given all that, you will have multiple clock domains. The question is not whether, but how to handle the boundaries between them without breaking.

The setup/hold window, and what happens when you miss it

A flip-flop has two timing requirements for reliable operation: the input must be stable for some small window before the rising edge of the clock (the setup time, $t_{s}$ ), and for some small window after the rising edge (the hold time, $t_{h}$ ). If the input is stable throughout that combined window, the flip-flop's output transitions cleanly to the new value within some specified delay after the edge.

Setup and hold window: data must be stable Tₛ before and Tₕ after the clock edge

Valid: data is stable across the whole window. Setup violation: data arrives too late. Hold violation: data leaves too early. Either one can leave the flip-flop metastable.

If the input changes during that window, the flip-flop enters a region of undefined behavior.

This is not hand-waving; it is literal physics. A flip-flop, at its heart, is a pair of cross-coupled inverters with a positive feedback loop. During a clock edge, this feedback loop is trying to resolve the input into one of two stable states: 0 or 1. If the input transitions exactly during the sampling window, the initial voltage presented to the feedback loop can be somewhere between the two rails, neither a clear 0 nor a clear 1. The feedback loop will eventually resolve to one rail or the other (the metastable state is an unstable equilibrium, and thermal noise breaks the tie sooner or later), but the time it takes to resolve can be arbitrarily long.

Before going further, it's worth looking at what "between the two rails" actually means in voltage terms. The voltage transfer characteristic of a single digital gate maps input voltage to output voltage; the classic shape is flat at the rails with a steep transition in the middle:

Inverter voltage transfer characteristic

Output voltage vs. input voltage for a single digital gate

A digital gate's transfer curve is steep in the middle and flat at the rails. V_IH and V_IL are the input thresholds (the minimum input reliably read as logic 1, and the maximum reliably read as logic 0); V_OH and V_OL are the matching output guarantees (the voltage the gate actually produces when driving high or low). Inputs in the shaded band (between V_IL and V_IH) aren't reliably read as either level, and the output during that band sits between V_OL and V_OH. CMOS is shown, but the shape is general; any inverter has this structure.

A single inverter's curve shows what "indeterminate voltage" means, but it doesn't explain two things we need next: why the flip-flop can get stuck at an indeterminate voltage in the first place, and why the time to escape is exponential rather than just slow. Both fall out of the cross-coupled inverter pair that lives inside every flop's storage latch:

Cross-coupled inverter pair (latch storage cell)

Colors identify the inverters: top = red, bottom = blue, matching the butterfly curves that follow. The small 0/1 digits show the two stable states. In State A (red digits), the top inverter holds 0→1 and the bottom holds 1→0; each output feeds the other's input and reinforces it. State B (blue digits) is the mirror, every value flipped, and is equally stable. A third equilibrium exists; the butterfly diagram reveals it.

The top inverter (red) imposes Y = inv(X) and the bottom inverter (blue) imposes X = inv(Y). Plotted on the same axes (each curve colored to match its inverter), the two are mirror images across the diagonal Y = X and intersect at the equilibria. This is the butterfly diagram:

Butterfly diagram: cross-coupled inverters

Two transfer curves overlaid; three intersections, three equilibria

The corners are stable equilibria: perturbations snap back. The center (V_M, V_M) is unstable: both nodes sit in their forbidden bands, neither inverter has railed, and any nudge is amplified until the system resolves to a corner. Resolution time is exponential in proximity to center. That is metastability.

This is where the $τ$ in the next section comes from. Near the hilltop, the loop gain through both inverters is greater than one, so a tiny voltage offset $ϵ$ grows exponentially as $ϵ \cdot e^{t / τ}$ , where $τ$ is set by the inverter's small-signal gain and the capacitance it drives (tens of picoseconds in a modern silicon flop). The system has "resolved" once that growing offset reaches a full rail, which takes about $τ \cdot ln (V_{D D} / ϵ)$ seconds. Thermal noise picks but doesn't lower-bound it, so resolution time has no upper bound either, only an exponential survival probability . That exponential is what makes the MTBF formula in the next section behave the way it does.

This is metastability: the flip-flop is "stuck" in an unresolved state for an unpredictable amount of time before settling. While it is stuck, its output is at some indeterminate voltage, possibly oscillating, possibly just sitting somewhere in the forbidden region between logic thresholds. Any downstream gate that samples this output may interpret it as a 0, as a 1, or (worst of all) might itself go metastable.

In a single-clock design, the static timing analyzer prevents this from ever happening: the tool checks that every path from one flip-flop to the next meets setup and hold, and if it doesn't, the tool reports a violation and you fix it. In a multi-clock design, the relationship between the two clocks is undefined (there is no fixed phase relationship to analyze), and so there is no way to guarantee that a signal in clock domain A will not change during the setup/hold window of a flip-flop in clock domain B. It will, eventually. The question is only how often, and what happens when it does.

Metastability math, briefly

How often does a flip-flop go metastable? And having gone metastable, how long will it take to resolve?

The standard model treats the resolution time as an exponential decay: if a flip-flop enters metastability at time $t = 0$ , the probability that it has not yet resolved by time $t$ falls as $e^{- t / τ}$ , where $τ$ is a process-and-library-dependent time constant. For modern FPGA flip-flops, $τ$ is on the order of tens of picoseconds.

The mean time between failures (MTBF) for a simple flip-flop sampling an asynchronous input is roughly:

MTBF = \frac{e ^{t_{r} / τ}}{T _{0} \cdot f _{c l k} \cdot f _{d a t a}}

where:

$t_{r}$ is the resolution time: how long the flip-flop has to resolve out of metastability before its output is consumed (essentially, the slack between the sampling edge and downstream logic),
$τ$ is the metastability time constant (a property of the silicon),
$T_{0}$ is the metastability window: the size of the time interval around the clock edge during which an input transition can trigger metastability (a property of the silicon),
$f_{c l k}$ is the sampling clock frequency, and

The key insight is the exponential in the numerator. MTBF is exponentially sensitive to the resolution time $t_{r}$ . Give the flip-flop twice the time to resolve, and MTBF doesn't double; it squares (roughly).

This is why the standard fix works, and works so well. Let's put numbers on it.

Two things to take from this:

A single unprotected flip-flop sampling an async signal will go metastable often enough to matter. Not in your lab. Not in your FPGA demo. But in the field, across thousands of units, over months of runtime, absolutely, demonstrably, yes.
Each additional flip-flop in the synchronizer chain multiplies MTBF by a huge factor, because each one buys you another full clock period of resolution time, and MTBF is exponential in resolution time.

This is the entire theoretical justification for the single most common piece of CDC advice: use a two-flop synchronizer.

Single-bit CDC: the two-flop synchronizer

Here is the technique in its simplest form. Suppose you have a single-bit signal sig_a generated in clock domain A (clock clk_a) and you want to sample it in clock domain B (clock clk_b). The two clocks are unrelated.

The correct structure is two back-to-back flip-flops clocked on clk_b, with the signal from domain A connected to the first flip-flop's input:

// Two-flop synchronizer for a single-bit signalmodule sync_1bit (    input  wire clk_b,    input  wire rst_n,      // active-low reset in domain B    input  wire sig_a,      // asynchronous input from another domain    output wire sig_b       // synchronized output in domain B);    (* ASYNC_REG = "TRUE" *) reg sync_ff1;    (* ASYNC_REG = "TRUE" *) reg sync_ff2;     always @(posedge clk_b or negedge rst_n) begin        if (!rst_n) begin            sync_ff1 <= 1'b0;            sync_ff2 <= 1'b0;        end else begin            sync_ff1 <= sig_a;            sync_ff2 <= sync_ff1;        end    end     assign sig_b = sync_ff2;endmodule

What's happening here: the first flip-flop (sync_ff1) is the one that actually samples the asynchronous input. It may well go metastable, and that's fine; that's expected. What we've bought it is a full clock period of clk_b to resolve before its output is consumed by the second flip-flop (sync_ff2). By the time sync_ff2 samples sync_ff1, the probability that sync_ff1 is still metastable is astronomically small (that's the exponential in the MTBF formula doing its job). sync_ff2's output is therefore, for all practical purposes, a clean synchronized version of the original signal.

A few important details about this structure:

The ASYNC_REG attribute (Xilinx syntax) tells the synthesis tool that these flip-flops are a synchronizer and should be placed physically adjacent, in the same slice, with no combinational logic between them. The Intel/Altera Quartus equivalent is the SYNCHRONIZER_IDENTIFICATION logic option set to FORCED, applied either via a QSF set_instance_assignment or an inline (* altera_attribute = "-name SYNCHRONIZER_IDENTIFICATION FORCED" *) directive; other tools have their own names. This matters because every picosecond of delay between the two flip-flops eats into the resolution time and reduces MTBF exponentially. Without this attribute, the placer is free to scatter the two flip-flops wherever it likes, and your carefully-designed synchronizer loses most of its MTBF.
No combinational logic between the two flip-flops. None. Not even an inverter. Every gate between them is a picosecond of lost $t_{r}$ , and we already saw how expensive that is.
Register the source signal in domain A before it crosses. The signal feeding sync_ff1 should come out of a domain-A flip-flop, not directly from combinational logic. Combinational signals can glitch as their inputs settle, and a glitch that lands inside sync_ff1's sampling window will be captured as a clean (but spurious) transition in domain B. Registering on the source side guarantees the signal is glitch-free between clock edges.
The reset is in domain B. The synchronizer's flip-flops are reset by domain B's reset, not domain A's; using a foreign domain's reset would just reintroduce the CDC problem we're trying to solve. The negedge rst_n in the sensitivity list makes this an async-assert/sync-deassert reset, which assumes rst_n has itself already been run through a reset synchronizer for (see the Reset CDC section below).

The two-flop synchronizer is correct, simple, robust, and universally applicable for single-bit signals that represent levels, and where the receiver is allowed to observe the transition one or two cycles late. That set of caveats matters: short pulses break the "level" assumption (handled next), multi-bit buses break the "single-bit" assumption (the section after that), and latency-sensitive receivers basically can't use a 2-flop sync at all.

Pulse synchronizers

The two-flop synchronizer assumes your input is stable long enough for clk_b to reliably sample it. For a level signal held for many cycles, this is fine. But if sig_a pulses high for a single cycle of clk_a, and clk_a is faster than clk_b, there's no guarantee that clk_b will have a rising edge while the pulse is high. The synchronizer may miss the pulse entirely. This isn't a CDC bug per se (the synchronizer is doing exactly what it promised); it's a sampling problem.

If you need to transfer pulses from a fast domain to a slow domain, you need a pulse-stretcher (often called a toggle synchronizer): convert the pulse into a level change on the source side, synchronize the level across, and edge-detect on the destination side. The structure looks like this:

// Toggle synchronizer: convert a source-domain pulse to a destination-domain// pulse by toggling a level on each input, synchronizing the level across,// and edge-detecting on the far side.module pulse_sync (    input  wire clk_src,    input  wire rst_n_src,    input  wire pulse_src,    // one-cycle pulse in source domain     input  wire clk_dst,    input  wire rst_n_dst,    output wire pulse_dst     // one-cycle pulse in destination domain);    // Source side: toggle the level on each input pulse    reg toggle_src;    always @(posedge clk_src or negedge rst_n_src) begin        if (!rst_n_src)     toggle_src <= 1'b0;        else if (pulse_src) toggle_src <= ~toggle_src;    end     // Two-flop synchronize the level into the destination domain,    // plus one more register to hold the previous value for edge detection.    (* ASYNC_REG = "TRUE" *) reg toggle_sync1, toggle_sync2;    reg toggle_sync3;    always @(posedge clk_dst or negedge rst_n_dst) begin        if (!rst_n_dst) begin            toggle_sync1 <= 1'b0;            toggle_sync2 <= 1'b0;            toggle_sync3 <= 1'b0;        end else begin            toggle_sync1 <= toggle_src;            toggle_sync2 <= toggle_sync1;            toggle_sync3 <= toggle_sync2;        end    end     // Edge detect: pulse whenever the synced toggle has changed value    assign pulse_dst = toggle_sync2 ^ toggle_sync3;endmodule

The caveat is that consecutive source pulses must be spaced at least two destination clock periods apart; otherwise toggles arrive faster than the destination can resolve them and you start dropping pulses. For sustained rates anywhere near the destination clock, use an async FIFO instead.

The multi-bit problem: why you cannot just use more synchronizers

Suppose instead of a single bit, you have an 8-bit count value in domain A, and you need to read it in domain B. A reasonable-but-wrong first instinct is: "I'll just instantiate eight two-flop synchronizers, one per bit."

This is catastrophic. Here's why.

Consider a counter incrementing through the sequence ...00000111 → 00001000 → 00001001.... In a single edge of the fast domain's clock, the counter transitions from 00000111 to 00001000. Four bits change simultaneously on the source side.

But "simultaneously" only means simultaneously in the source clock domain. Those four bits are carried across the boundary on eight separate wires, each with its own propagation delay, its own routing, its own loading. Each of the eight synchronizers samples its bit at the destination clock's edge, and each bit's transition may arrive slightly before or after the sampling edge, independently of the others.

Source counter increments 7 → 8 (binary 0111 → 1000); each bit propagates with its own delay

clk samples mid-transition. Depending on which bits have settled, the destination sees something like 1011, 1001, or 1101, none of which ever existed in the source domain.

So the destination-side sampler can easily observe an intermediate value that never existed in the source domain. You can get 00000000, 00001111, 00000011, any combination of old and new bits mixed together. For one cycle of clk_b, the counter in domain B appears to take an insane value. The next cycle, it has settled to 00001000. The cycle after that, perhaps to 00001001.

If downstream logic in domain B is making decisions based on the counter value (comparing it to a threshold, using it as a RAM address, driving a state machine transition), those decisions are wrong for that one cycle. The design doesn't crash; it just occasionally produces garbage for a few nanoseconds, in ways that depend on the microsecond-scale phase drift between two PLLs on the chip.

This is the bug that convinces teams they have "sporadic corruption" that "only happens at high speed." It is also the bug that CDC lint tools are most aggressive about, because it is the most common and the most destructive.

The fix is never "more synchronizers." The fix is one of three structural approaches, depending on the shape of your data:

Gray-code the bus if the signal is a counter or something that changes incrementally.
Use a handshake if the signal is a control/data word that only changes occasionally, and both domains can afford to wait.
Use an async FIFO if the signal is a stream of data at high rate.

Each of these is not a synchronizer plus some extra logic; each is a different structural approach that avoids the multi-bit problem by ensuring that whatever gets sampled on the destination side is always a legal value. Let's go through them.

Reconvergence: the trap that survives correct synchronization

The multi-bit problem has a sibling that lint tools flag under "reconvergence" or "data correlation." Suppose two single-bit signals each cross from domain A to domain B through their own correct two-flop synchronizer, and downstream domain-B logic then combines them: an AND, a comparator's two inputs, two conditions in a state-machine transition. On any given event in domain A, the two synchronizer chains have independent sampling latencies: one bit may emerge on the first sample edge, the other on the second. For that one cycle, the downstream combination takes a value the source never produced, and any logic that depended on the correlation between the two bits sees garbage. The fix is the same as for buses: encode the correlated bits into one Gray-coded value, gate them with a single handshake, or push the whole correlated payload through a FIFO. As soon as you have two bits that must be consistent in domain B, they must cross together, not independently.

Multi-bit technique 1: Gray code for counters

Gray code is a binary encoding with one defining property: adjacent values differ by exactly one bit. The 3-bit Gray code counts 000 → 001 → 011 → 010 → 110 → 111 → 101 → 100 → 000 → ....

This property is exactly what we need. If a counter is Gray-coded, then every increment changes exactly one bit. When the destination domain samples the bus during a transition, it either sees the old value (the transitioning bit hasn't propagated yet) or the new value (it has), but never a half-updated intermediate, because only one bit is ever in flight at a time.

Gray code: exactly one bit changes between adjacent counts

A cross-domain sample reads old or new, never garbage.

Here's a Gray-code counter being crossed from domain A to domain B:

// Gray-coded counter with CDC to another domainmodule gray_counter_cdc #(    parameter WIDTH = 8) (    input  wire              clk_a,    input  wire              rst_n_a,    input  wire              inc_a,    input  wire              clk_b,    input  wire              rst_n_b,    output wire [WIDTH-1:0]  count_b_binary);     // Source domain: maintain binary counter, convert to Gray for crossing    reg  [WIDTH-1:0] count_a_bin;    wire [WIDTH-1:0] count_a_gray;     always @(posedge clk_a or negedge rst_n_a) begin        if (!rst_n_a) count_a_bin <= {WIDTH{1'b0}};        else if (inc_a) count_a_bin <= count_a_bin + 1'b1;    end     // Binary-to-Gray: gray = bin XOR (bin >> 1)    assign count_a_gray = count_a_bin ^ (count_a_bin >> 1);     // Two-flop synchronize the Gray-coded bus into domain B    (* ASYNC_REG = "TRUE" *) reg [WIDTH-1:0] gray_sync1;    (* ASYNC_REG = "TRUE" *) reg [WIDTH-1:0] gray_sync2;     always @(posedge clk_b or negedge rst_n_b) begin        if (!rst_n_b) begin            gray_sync1 <= {WIDTH{1'b0}};            gray_sync2 <= {WIDTH{1'b0}};        end else begin            gray_sync1 <= count_a_gray;            gray_sync2 <= gray_sync1;        end    end     // Destination domain: convert Gray back to binary    // bin[i] = XOR of all gray bits from MSB down to bit i    genvar i;    generate        for (i = 0; i < WIDTH; i = i + 1) begin : gray_to_bin            assign count_b_binary[i] = ^(gray_sync2 >> i);        end    endgenerate endmodule

The binary-to-Gray conversion gray = bin ^ (bin >> 1) is the classic trick, and it's free: just XOR gates. The Gray-to-binary conversion on the destination side is a chain of XORs from the MSB down.

A subtle but critical point: the destination value may be up to two cycles of clk_b stale relative to the source. That's fine for applications where "the counter" is a position that changes over time, such as a FIFO write pointer seen from the read side. It is not fine if you need the destination to know the counter's value at a specific moment in the source domain. In that case, you need a handshake.

Multi-bit technique 2: Handshake for control transfers

Sometimes you need to transfer a multi-bit value that isn't a counter (a command, a configuration register write, a one-off data word) and you need the destination to see the exact value the source sent, at a known time. For this, use a request/acknowledge handshake.

The structure: the source puts data on a bus and asserts a req signal. The destination synchronizes req, samples the data bus on the synchronized request, and asserts ack. The source synchronizes ack, and on seeing it, deasserts req (and is now free to update the data). The destination synchronizes the deasserted req and deasserts ack. Both sides are now back to idle.

The key insight is that the data bus itself is not synchronized; only the single-bit req and ack lines are. The data is held stable on the source side from the moment req is asserted until the moment ack is deasserted, which means it is stable for many cycles of the destination clock when the destination samples it. Under those conditions, sampling a wide bus is safe; there's no "bit in flight" to catch halfway.

// 4-phase handshake CDC for multi-bit control/data transfermodule handshake_cdc #(    parameter WIDTH = 32) (    input  wire              clk_src,    input  wire              rst_n_src,    input  wire [WIDTH-1:0]  data_src,    input  wire              send_src,     // pulse to initiate transfer    output reg               busy_src,     // high until transfer completes     input  wire              clk_dst,    input  wire              rst_n_dst,    output reg  [WIDTH-1:0]  data_dst,    output reg               valid_dst     // one-cycle pulse when data_dst is new);     // --- Source side ---    reg              req_src;    reg  [WIDTH-1:0] data_src_held;     // Synchronize ack back from destination into source domain    (* ASYNC_REG = "TRUE" *) reg ack_sync1, ack_sync2;    always @(posedge clk_src or negedge rst_n_src) begin        if (!rst_n_src) begin            ack_sync1 <= 1'b0;            ack_sync2 <= 1'b0;        end else begin            ack_sync1 <= ack_dst;            ack_sync2 <= ack_sync1;        end    end     always @(posedge clk_src or negedge rst_n_src) begin        if (!rst_n_src) begin            req_src       <= 1'b0;            busy_src      <= 1'b0;            data_src_held <= {WIDTH{1'b0}};        end else begin            if (send_src && !busy_src) begin                req_src       <= 1'b1;                busy_src      <= 1'b1;                data_src_held <= data_src;            end else if (req_src && ack_sync2) begin                // destination has acknowledged; drop req                req_src  <= 1'b0;            end else if (!req_src && !ack_sync2 && busy_src) begin                // destination has deasserted ack; handshake complete                busy_src <= 1'b0;            end        end    end     // --- Destination side ---    wire ack_dst;    reg  ack_dst_reg;     // Synchronize req into destination domain    (* ASYNC_REG = "TRUE" *) reg req_sync1, req_sync2;    always @(posedge clk_dst or negedge rst_n_dst) begin        if (!rst_n_dst) begin            req_sync1 <= 1'b0;            req_sync2 <= 1'b0;        end else begin            req_sync1 <= req_src;            req_sync2 <= req_sync1;        end    end     always @(posedge clk_dst or negedge rst_n_dst) begin        if (!rst_n_dst) begin            data_dst    <= {WIDTH{1'b0}};            valid_dst   <= 1'b0;            ack_dst_reg <= 1'b0;        end else begin            valid_dst <= 1'b0;  // default            if (req_sync2 && !ack_dst_reg) begin                // sample the held data bus; it's been stable for many cycles                data_dst    <= data_src_held;                valid_dst   <= 1'b1;                ack_dst_reg <= 1'b1;            end else if (!req_sync2 && ack_dst_reg) begin                ack_dst_reg <= 1'b0;            end        end    end     assign ack_dst = ack_dst_reg; endmodule

This is more code than the Gray counter, and there's a reason: we're trading latency for exactness. A handshake takes multiple round-trip times (each synchronizer is two cycles on each side, plus the source needs to see ack rise and fall), so transfers can easily cost 6 to 10 cycles of the slower clock. If you're transferring at a rate anywhere close to that, you'll back-pressure the source. For bulk data, use a FIFO instead.

Multi-bit technique 3: The asynchronous FIFO

An async FIFO is the workhorse of CDC. Any time you have a data stream crossing a clock boundary (packets from a MAC, samples from an ADC, computation results going to an I/O interface), the right answer is almost always an async FIFO. It provides not just synchronization but also buffering, which absorbs small rate mismatches between the two sides.

The classical async FIFO architecture, due to Clifford Cummings, works like this:

A dual-port memory (block RAM) stores the data. One port is written from the write clock; the other port is read from the read clock.
A write pointer increments in the write domain; a read pointer increments in the read domain.
Each pointer is converted to Gray code and synchronized into the other domain for comparison purposes.
The write side compares its own write pointer against the synchronized read pointer to detect "full."
The read side compares its own read pointer against the synchronized write pointer to detect "empty."

The combination of Gray coding and two-flop synchronization on each pointer is what makes this safe. And notice what is and isn't being synchronized: the pointers cross the domain boundary, but the data itself never does. It sits in the dual-port RAM, written by one clock and read by the other, and since RAM is designed to support this kind of access, there's no CDC problem on the data.

One architectural constraint worth flagging: this scheme requires DEPTH to be a power of two. The Gray-coded full-detect trick relies on the binary pointer's MSB flip aligning with the FIFO wrap, which only happens when DEPTH = 2^ADDR_WIDTH. Non-power-of-two depths need additional pointer-management logic that production IP cores include and the bare-bones architecture above does not.

// Asynchronous FIFO, with Gray-coded pointer synchronization.// Simplified for illustration; production versions have more features.module async_fifo #(    parameter DATA_WIDTH = 32,    parameter ADDR_WIDTH = 8) (    // Write side    input  wire                   wr_clk,    input  wire                   wr_rst_n,    input  wire                   wr_en,    input  wire [DATA_WIDTH-1:0]  wr_data,    output reg                    wr_full,     // Read side    input  wire                   rd_clk,    input  wire                   rd_rst_n,    input  wire                   rd_en,    output wire [DATA_WIDTH-1:0]  rd_data,    output reg                    rd_empty);     localparam DEPTH = (1 << ADDR_WIDTH);     // Dual-port memory shared between write and read clocks    reg [DATA_WIDTH-1:0] mem [0:DEPTH-1];     // Pointers are ADDR_WIDTH+1 bits wide, so MSB differentiates    // "full" (pointers equal but MSBs differ) from "empty" (fully equal).    reg [ADDR_WIDTH:0] wr_ptr_bin, wr_ptr_gray;    reg [ADDR_WIDTH:0] rd_ptr_bin, rd_ptr_gray;     // Pointer sync chains    (* ASYNC_REG = "TRUE" *) reg [ADDR_WIDTH:0] rd_ptr_gray_sync1, rd_ptr_gray_sync2;    (* ASYNC_REG = "TRUE" *) reg [ADDR_WIDTH:0] wr_ptr_gray_sync1, wr_ptr_gray_sync2;     // Next-pointer combinational logic. Uses the *registered* wr_full/rd_empty;    // driving the flags combinationally below would feed them back into    // themselves through these expressions, creating a loop.    wire [ADDR_WIDTH:0] wr_ptr_bin_next  = wr_ptr_bin  + (wr_en && !wr_full);    wire [ADDR_WIDTH:0] wr_ptr_gray_next = wr_ptr_bin_next ^ (wr_ptr_bin_next >> 1);     wire [ADDR_WIDTH:0] rd_ptr_bin_next  = rd_ptr_bin  + (rd_en && !rd_empty);    wire [ADDR_WIDTH:0] rd_ptr_gray_next = rd_ptr_bin_next ^ (rd_ptr_bin_next >> 1);     // Full / empty next-state (combinational), registered into the flags below.    // Full: next write Gray pointer equals read pointer with the two MSBs    // inverted, meaning the write side is one wrap ahead of the read side.    wire wr_full_next  = (wr_ptr_gray_next == {~rd_ptr_gray_sync2[ADDR_WIDTH:ADDR_WIDTH-1],                                                rd_ptr_gray_sync2[ADDR_WIDTH-2:0]});    // Empty: next read Gray pointer equals synchronized write Gray pointer.    wire rd_empty_next = (rd_ptr_gray_next == wr_ptr_gray_sync2);     // Write-side: pointer update, RAM write, and full-flag register    always @(posedge wr_clk or negedge wr_rst_n) begin        if (!wr_rst_n) begin            wr_ptr_bin  <= 0;            wr_ptr_gray <= 0;            wr_full     <= 1'b0;        end else begin            wr_ptr_bin  <= wr_ptr_bin_next;            wr_ptr_gray <= wr_ptr_gray_next;            wr_full     <= wr_full_next;            if (wr_en && !wr_full)                mem[wr_ptr_bin[ADDR_WIDTH-1:0]] <= wr_data;        end    end     // Read-side: pointer update and empty-flag register (empty asserts on reset)    always @(posedge rd_clk or negedge rd_rst_n) begin        if (!rd_rst_n) begin            rd_ptr_bin  <= 0;            rd_ptr_gray <= 0;            rd_empty    <= 1'b1;        end else begin            rd_ptr_bin  <= rd_ptr_bin_next;            rd_ptr_gray <= rd_ptr_gray_next;            rd_empty    <= rd_empty_next;        end    end     assign rd_data = mem[rd_ptr_bin[ADDR_WIDTH-1:0]];     // Synchronize read pointer (Gray) into write clock domain    always @(posedge wr_clk or negedge wr_rst_n) begin        if (!wr_rst_n) {rd_ptr_gray_sync1, rd_ptr_gray_sync2} <= 0;        else begin            rd_ptr_gray_sync1 <= rd_ptr_gray;            rd_ptr_gray_sync2 <= rd_ptr_gray_sync1;        end    end     // Synchronize write pointer (Gray) into read clock domain    always @(posedge rd_clk or negedge rd_rst_n) begin        if (!rd_rst_n) {wr_ptr_gray_sync1, wr_ptr_gray_sync2} <= 0;        else begin            wr_ptr_gray_sync1 <= wr_ptr_gray;            wr_ptr_gray_sync2 <= wr_ptr_gray_sync1;        end    end endmodule

A few things to notice:

The pointers are ADDR_WIDTH+1 bits wide, not ADDR_WIDTH. The extra MSB is what distinguishes "full" (both pointers at the same index, but one wrap ahead) from "empty" (both pointers identical). Without that extra bit, you cannot tell the two conditions apart.
The full condition compares the next write pointer, not the current one. This is because full is an inhibit signal: we need it to go active before the write that would overflow, not after.
The full and empty flags are registered, not driven combinationally. The naïve assign wr_full = (wr_ptr_gray_next == …) would create a feedback loop, because wr_ptr_gray_next itself depends on wr_full through the increment-or-hold. Registering breaks the loop without losing correctness: wr_full_next already looks at the would-be-next pointer, so the flag still goes active on the same cycle as the write that fills the last slot.
The full and empty flags are conservative: because each pointer sync has up to two cycles of latency, the "full" flag on the write side sees a stale read pointer (the read side may have read more than the write side knows), and similarly the "empty" flag on the read side sees a stale write pointer. This is fine. The conservatism means you might refuse a write when the FIFO is briefly nearly-full, or block a read when data has just arrived. You never corrupt data; you just occasionally under-utilize the buffer by one or two slots. That's a trade every real FIFO makes.
The read port shown here is async (assign rd_data = mem[...] is combinational), which on most FPGAs maps to distributed LUT-RAM rather than block RAM. A true BRAM-backed FIFO would register the read output and pay one cycle of read latency; vendor IP cores parameterize this as "first-word fall-through" (async, as shown) versus "standard" (registered) read.

In practice, every FPGA vendor provides a pre-verified async FIFO IP core, and you should use it rather than rolling your own unless you have a specific reason. The code above is illustrative. Production versions handle power-on reset behavior, partial-full/empty flags, and various corner cases that are easy to get wrong. But the architecture is exactly the one above.

Reset CDC: the one everyone forgets

Resets deserve their own discussion because resets violate a surprising number of the assumptions that make the rest of your design safe.

The first question is whether your reset is synchronous or asynchronous. A synchronous reset is sampled on a clock edge and behaves like any other input: timing analysis handles it, setup/hold applies, and if you get it right in one domain, you're fine. The tradeoff is that synchronous reset can't be used to initialize logic that has no clock yet (for instance, during power-up before PLLs lock), and it consumes a path in the datapath on every flip-flop.

An asynchronous reset, by contrast, goes to a dedicated reset pin on the flip-flop and takes effect immediately, without needing a clock edge. This is what you want for power-on reset and for bringing blocks out of clock-gated states: the reset forces the flip-flop to a known state even before clocks are running or stable.

The problem is that asynchronous reset has its own subtle CDC hazard: async assertion, async deassertion is unsafe.

Consider: an async reset arrives and is distributed to every flip-flop in the domain. While it's asserted, every flip-flop is held in reset (good). Now the reset deasserts. If the deassertion happens to occur very close to the clock edge, some flip-flops may see "not reset" one clock cycle before their neighbors (because of skew in the reset distribution network relative to the clock tree). The logic then emerges from reset in an inconsistent state, with different flip-flops taking their first post-reset clock edge under different conditions. Downstream logic may latch nonsense.

The standard fix is async assert, sync deassert: allow the reset to assert asynchronously (so it takes effect immediately, no clock required), but gate its deassertion through a synchronizer so that it releases all flip-flops on the same clock cycle.

// Async-assert, sync-deassert reset synchronizer.// Reset asserts immediately when raw_rst_n goes low,// but deassertion is synchronous with clk.module reset_sync (    input  wire clk,    input  wire raw_rst_n,   // async, active-low    output wire sync_rst_n   // async-assert, sync-deassert);    (* ASYNC_REG = "TRUE" *) reg rst_ff1, rst_ff2;     always @(posedge clk or negedge raw_rst_n) begin        if (!raw_rst_n) begin            rst_ff1 <= 1'b0;            rst_ff2 <= 1'b0;        end else begin            rst_ff1 <= 1'b1;   // tied to logic-1            rst_ff2 <= rst_ff1;        end    end     assign sync_rst_n = rst_ff2;endmodule

Note the structure: rst_ff1's D input is tied to 1'b1. The flip-flops sit at 0 as long as raw_rst_n is low, because the async reset holds them. When raw_rst_n rises, the flip-flops are released, and they begin shifting the hardwired 1 through on each clock edge. After two cycles, sync_rst_n goes high. The deassertion of sync_rst_n is therefore precisely aligned with a clock edge in clk's domain: safe.

One reset synchronizer per clock domain. If your design has three clock domains, you need three reset synchronizers, each producing an async-assert/sync-deassert reset aligned to its own clock. Using the same synchronized reset in multiple domains reintroduces the problem you were trying to avoid.

A few further reset pitfalls:

Reset tree skew. The reset signal fans out to potentially thousands of flip-flops. If its routing isn't balanced like a clock tree, some flip-flops see the edge before others. For async-assert/sync-deassert, this mostly matters on the assertion side (where the edge is still asynchronous); for small-to-medium designs, the place-and-route tools handle it adequately, but large designs sometimes need explicit reset-tree balancing.
Reset domains within a clock domain. You may want different parts of a clock domain to come out of reset at different times. For instance, you want a configuration register block to emerge from reset first, and only once it has stable configuration should a datapath block be released. The right answer is typically a small reset-sequencing FSM gating each sub-block's reset, not a chain of synchronizers-of-synchronizers.
Don't mix sync and async reset in the same flip-flop. Every flip-flop should be resettable by exactly one mechanism. Pick sync or async for each block and stick with it. FPGAs specifically tend to prefer synchronous resets for fabric logic (they pack better), with async resets reserved for globally-routed signals like POR.

Finding CDC bugs before they find you

CDC bugs are the textbook case of "you cannot test your way to correctness." A simulation runs for maybe milliseconds of simulated time; a real system runs for years across thousands of units. Your simulation will not hit the rare timing window that causes a metastability-induced failure, because in simulation your clocks have exact mathematical relationships. There is no phase drift, no jitter, no real-world asynchrony. You can bang on the RTL for a year of wall-clock simulation time and never see the bug that takes down a customer's deployment in its first week.

So you need static analysis. The techniques and tools:

CDC lint. Specialized tools (Synopsys SpyGlass CDC, Cadence Conformal CDC, Questa CDC, and others, with varying trade-offs for FPGA vs. ASIC flows) read your RTL and your clock definitions, identify every signal that crosses a clock boundary, and check that each crossing has an appropriate synchronizer structure. They're good. They catch the vast majority of naive bugs: missing synchronizers, multi-bit buses not using Gray or handshake, synchronizers without ASYNC_REG attributes, combinational logic in the middle of a sync chain. Running CDC lint should be a mandatory sign-off step before any tapeout or deployment.

Constrain your tools. Make sure your SDC (Synopsys Design Constraints) file properly declares clock groups and false paths across asynchronous boundaries. If you forget to declare two clocks as asynchronous, the static timing analyzer will try to close timing across the crossing as if it were synchronous, and depending on the tool it may either report meaningless violations or, worse, silently add logic to "close" paths that should have been CDC-protected. Your clock constraints are part of your design; treat them like code, review them in code review, and don't ship without them being correct.

Simulate with random phase. Parameterize your testbench clock generators so the clocks have randomized initial phase and frequency ratios within their nominal tolerances, and rerun your test suite many times. You still won't hit every timing window, but you'll hit more than you would with fixed phase, and occasionally you'll trip a real bug. Gate-level simulation with SDF annotations helps too, for the same reason.

Inject metastability in simulation. Some simulation methodologies include "metastability injection" at CDC boundaries: the sampling flip-flop on a CDC path occasionally emits an X (unknown), simulating a metastable event. Downstream logic that relies on specific bit values will then propagate the X, and the failure becomes visible. This is extremely effective at catching logic that only nearly works (for example, logic that assumed a one-cycle pulse could be caught when actually sometimes it can't) but it requires testbench support and isn't universal.

Code reviews. Have a second pair of eyes look at every CDC boundary in the design, armed with this article or its equivalent. CDC is the kind of problem where the mental model either makes the bug obvious on inspection or renders it invisible. An experienced reviewer will catch things the lint tool misses (false negatives happen, especially with complex waivers).

A checklist for your next design review

Put this on your wall:

List every clock in the design. If you can't name them all, you don't understand the design well enough to review CDC in it.
For each clock pair, are they asynchronous? Declared as such in SDC?
For each signal crossing a clock boundary, identify its shape. Single-bit level? Pulse? Counter? Control word? Data stream?
For each, is the right structure in place? Two-flop sync, pulse synchronizer, Gray-coded bus, handshake, or async FIFO?
Are the synchronizer flip-flops marked with ASYNC_REG or the equivalent attribute?
Is there combinational logic between the flip-flops of any two-flop synchronizer? There should be none.
Is a multi-bit bus being synchronized bit-by-bit without Gray coding or handshake? This is the red-flag bug. Treat with extreme prejudice.
Does each clock domain have its own async-assert/sync-deassert reset synchronizer?
Does CDC lint pass with no waivers, or with every waiver individually justified in writing?
Are clock groups and false paths correctly declared in the SDC?

Most CDC bugs come from violating exactly one of these. Checking all ten, systematically, catches almost everything. The bugs that remain are the ones where someone invented a new shape of CDC problem, and those are interesting enough to go in their own article someday.

Taking it away

Clock domain crossing is, for all its reputation as a dark art, governed by a small set of ideas that fit together cleanly once you see the structure:

Multi-clock designs are unavoidable at system scale.
Any signal crossing an asynchronous clock boundary can cause a sampling flip-flop to go metastable, with a probability that is large enough to matter over millions of hours of operation across thousands of units.
Adding a second synchronizer flip-flop buys enormous MTBF because metastability resolution is exponential in available time.
Single-bit level signals are solved by two-flop synchronizers.
Short pulses need to be converted to levels (toggle synchronizers) before crossing.
Multi-bit buses cannot be synchronized bit-by-bit. They need Gray coding (for counters), handshakes (for control transfers), or async FIFOs (for data streams).
Resets are a CDC problem of their own, solved by async-assert/sync-deassert synchronization, one per clock domain.
You cannot find CDC bugs by simulation alone. Use lint. Use proper SDC constraints. Review the crossings explicitly.

CDC is one of those rare topics where mastering a finite, learnable set of patterns essentially eliminates a whole class of production bugs. There are maybe five techniques. Learn them, apply them reflexively, and the bugs that do make it to silicon will at least be interesting ones.