Preserving Order in Concurrent Go Apps: Three Approaches Compared

Concurrency is one of Go’s greatest strengths, but it comes with a fundamental trade-off: when multiple goroutines process data simultaneously, the natural ordering gets scrambled. Most of the time, this is fine – unordered processing is enough, it’s faster and simpler.

But sometimes, order matters.

When Order Matters

Here are three real-world scenarios where preserving order becomes critical:

Real-time Log Enrichment: You’re processing a high-volume log stream, enriching each entry with user metadata from a database or external API. Sequential processing can’t keep up with the incoming rate, but concurrent processing breaks the sequence, making the enriched logs unusable for downstream consumers that depend on chronological order.

Finding the First Match in a File List: You need to download a list of files from cloud storage and find the first one containing a specific string. Concurrent downloads are much faster, but they complete out of order – the 50th file might finish before the 5th file, so you can’t simply return the first match you find without knowing if an earlier file also contains the string.

Time Series Data Processing: This scenario inspired my original implementation. I needed to download 90 days of transaction logs (~600MB each), extract some data, then compare consecutive days for trend analysis. Sequential downloads took hours; concurrent downloads could give an order of magnitude speedup, but would destroy the temporal relationships I needed for comparison.

The challenge is clear: we need the speed benefits of concurrent processing without sacrificing the predictability of ordered results. This isn’t just a theoretical problem – it’s a practical constraint that affects real systems at scale.

In this article, we’ll explore three approaches I’ve developed and used in production Go applications. We’ll build a concurrent OrderedMap function that transforms a channel of inputs into a channel of outputs while preserving order. Through benchmarks of each approach, we’ll understand their trade-offs and discover surprising performance insights along the way.

The Problem: Why Concurrency Breaks Order

Why are my results out of order? Concurrency.

Let’s quickly recall why concurrency messes up ordering. One of the reasons is that goroutines process tasks at different speeds. Another common reason – we can’t predict how exactly goroutines will be scheduled by the Go runtime.

For example, goroutine #2 might finish processing item #50 before goroutine #1 finishes item #10, causing results to arrive out of order. This is the natural behavior of concurrent processing.

If you want to see this in action, here’s a quick demo the Go playground.

Design Philosophy: Backpressure vs Buffering

The classic approach to ordered concurrency uses some sort of reorder buffer or queue. When a worker calculates a result but it’s too early to write it to the output, the result gets stored in that buffer until it can be written in the correct order.

In such designs buffers can typically grow without bound. This happens when:

  • The input is skewed – early items take longer to process than later items
  • Downstream consumers are slow

The algorithms presented below are backpressure-first. If a worker can’t yet write its result to the output channel, it blocks. This design is memory-bound and preserves the behavior developers expect from Go channels.

Technically speaking, such algorithms also do buffering, but here out-of-order items are held on the stacks of running goroutines. So, to get a larger “buffer” in these algorithms, you can simply increase the concurrency level. This works well in practice since typically when applications need larger buffers they also need higher concurrency levels.

Establishing a Performance Baseline

To understand the true cost of ordering, we first need a baseline to measure against. Let’s implement and benchmark a basic concurrent Map function that doesn’t preserve order — this will show us exactly what overhead the ordering approaches add.

Our Map function transforms an input channel into an output channel using a user-supplied function f. It’s built on top of a simple worker pool, which spawns multiple goroutines to process input items concurrently.

// Map transforms items from the input channel using n goroutines, and the
// provided function f. Returns a new channel with transformed items.
func Map[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
	out := make(chan B)
	Loop(in, n, out, func(a A) {
		out <- f(a)
	})
	return out
}

// Loop is a worker pool implementation. It calls function f for each 
// item from the input channel using n goroutines. This is a non-blocking function 
// that signals completion by closing the done channel when all work is finished.
func Loop[A, B any](in <-chan A, n int, done chan<- B, f func(A)) {
	var wg sync.WaitGroup

	for i := 0; i < n; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for a := range in {
				f(a)
			}
		}()
	}

	go func() {
		wg.Wait()
		if done != nil {
			close(done)
		}
	}()
}

// Discard is a non-blocking function that consumes and discards
// all items from the input channel
func Discard[A any](in <-chan A) {
	go func() {
		for range in {
			// Discard the value
		}
	}()
}

func BenchmarkMap(b *testing.B) {
	for _, n := range []int{1, 2, 4, 8, 12, 50} {
		b.Run(fmt.Sprint("n=", n), func(b *testing.B) {
			in := make(chan int)
			defer close(in)
			out := Map(in, n, func(a int) int {
				//time.Sleep(50 * time.Microsecond)
				return a // no-op: just return the original value
			})
			Discard(out)

			b.ReportAllocs()
			b.ResetTimer()

			for i := 0; i < b.N; i++ {
				in <- 10 // write something to the in chan
			}
		})
	}
}

As you can see, Map leverages Loop to create a worker pool that processes items concurrently, while Loop itself handles the low-level goroutine management and synchronization. This separation of concerns will become important later when we build our ordered variants.

What exactly are we measuring here? We’re measuring throughput – how fast we can push items through the entire pipeline. Since the Map function creates backpressure (blocking when the pipeline is full), the rate at which we can feed items into the input channel acts as an accurate proxy for overall processing speed. Let’s run the benchmark (I used Apple M2 Max laptop to run it):

GoroutinesTime /opAllocs/op
2408.6ns0
4445.1ns0
8546.4ns0
12600.2ns0
501053ns0

You might wonder: “Shouldn’t higher concurrency increase throughput?” In real applications, absolutely – but only when there’s actual work to parallelize. Here I used a trivial no-op transformation to isolate and benchmark the pure overhead of goroutines, channels, and coordination. As expected, this overhead grows with the number of goroutines.

We’ll use this overhead-focused benchmark for comparisons later in the article, but to demonstrate that concurrency improves performance, let’s run one more benchmark with some work simulated (50μs sleep):

GoroutinesTime /opSpeedupAllocs/op
161656ns1.0x0
230429ns2.0x0
415207ns4.1x0
87524ns8.2x0
125034ns12.2x0
501277ns48.3x0

Perfect! Here we see the dramatic benefits of concurrency when there’s real work to be done. With 50μs of work per item, increasing concurrency from 1 to 50 goroutines improves performance by nearly 50x. This demonstrates why concurrent processing is so valuable in real applications.

We’re now ready to compare the 3 approaches and measure exactly what price we pay for adding order preservation.

Approach 1: ReplyTo Channels

This is probably the most Go-native way to implement ordered concurrency. The ReplyTo pattern is well-known in Go (I also used it in my batching article), but somehow this was the hardest approach for me to explain clearly.

Here’s how it works:

  • A packer goroutine creates jobs by attaching a unique replyTo channel to every input item.
  • Workers process jobs concurrently, and send results through those replyTo channels.
  • An unpacker goroutine unpacks the values sent via replyTo channels and writes them to the output.

The following diagram illustrates how this pattern in more detail:

ReplyTo Pattern for Order Preservation

The left part of this diagram is sequential (packer and unpacker) while the worker pool on the right operates concurrently. Notice that workers can only send results when the unpacker is ready to receive them, because the replyTo channels are unbuffered. This creates natural backpressure and prevents unnecessary buffering.

func OrderedMap1[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
	type Job struct {
		Item    A
		ReplyTo chan B
	}

	// Packer goroutine.
	// `jobs` chan will be processed by the pool
	// `replies` chan will be consumed by unpacker goroutine
	jobs := make(chan Job)
	replies := make(chan chan B, n)
	go func() {
		for item := range in {
			replyTo := make(chan B)
			jobs <- Job{Item: item, ReplyTo: replyTo}
			replies <- replyTo
		}
		close(jobs)
		close(replies)
	}()

	// Worker pool of n goroutines.
	// Sends results back via replyTo channels
	Loop[Job, any](jobs, n, nil, func(job Job) {
		job.ReplyTo <- f(job.Item) // Calculate the result and send it back
		close(job.ReplyTo)
	})

	// Unpacker goroutine.
	// Unpacks replyTo channels in order and sends results to the `out` channel
	out := make(chan B)
	go func() {
		defer close(out)
		for replyTo := range replies {
			result := <-replyTo
			out <- result
		}
	}()
	return out
}

Performance Results:

GoroutinesTime /opvs BaselineAllocs/op
2818.7ns+410ns1
4808.9ns+364ns1
8826.8ns+280ns1
12825.6ns+225ns1
50772.3ns-281ns1

This approach introduces up to 410ns of overhead per input item compared to our baseline. Part of this cost comes from allocating a new replyTo channel for every item. Unfortunately, we can’t use a package level sync.Pool to mitigate this because our function is generic – channels for different types can’t share the same pool.

What’s also interesting about this result is that the overhead brought by ordering becomes smaller as the number of goroutines grows. At some point even an inversion happens – OrderedMap1 becomes faster than Map (-281ns at 50 goroutines).

I haven’t investigated this phenomenon deeply. I believe it can’t be caused by inefficiencies inside Map since it’s already based on the simplest possible channel-based worker pool. One guess that I have is that in Map we have 50 goroutines competing to write into a single output channel. On the contrary, in OrderedMap, despite additional moving parts, only one goroutine is writing to the output.

Let’s now move on to the next approach.

Approach 2: sync.Cond for Turn-Taking

This was the first algorithm I implemented when I needed ordered concurrency, and it’s much easier to explain than the ReplyTo approach.

Here we attach an incremental index to each item and send it to the worker pool. Each worker performs the calculation, then waits its turn to write the result to the output channel.

This conditional waiting is implemented using a shared currentIndex variable protected by sync.Cond, a powerful but underused concurrency primitive from the standard library that allows goroutines to wait for specific conditions and be woken up when those conditions change.

Here’s how the turn-taking mechanism works:

Turn-taking with sync.Cond Here, after each write, all workers wake up (using broadcast) and recheck “is it my turn?” condition

func OrderedMap2[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
	type Job struct {
		Item  A
		Index int
	}

	// Indexer goroutine.
	// Assign an index to each item from the input channel
	jobs := make(chan Job)
	go func() {
		i := 0
		for item := range in {
			jobs <- Job{Item: item, Index: i}
			i++
		}
		close(jobs)
	}()

	// Shared state.
	// Index of the next result that must be written to the output channel.
	nextIndex := 0
	cond := sync.NewCond(new(sync.Mutex))

	// Worker pool of n goroutines.
	out := make(chan B)
	Loop(jobs, n, out, func(job Job) {
		result := f(job.Item) // Calculate the result

		// Cond must be used with a locked mutex (see stdlib docs)
		cond.L.Lock()

		// wait until it's our turn to write the result
		for job.Index != nextIndex {
			cond.Wait()
		}

		// Write the result
		out <- result

		// Increment the index and notify all other workers
		nextIndex++
		cond.Broadcast()

		cond.L.Unlock()
	})

	return out
}

Performance Results:

GoroutinesTime /opvs BaselineAllocs/op
2867.7ns+459ns0
41094ns+649ns0
81801ns+1255ns0
122987ns+2387ns0
5016074ns+15021ns0

The results are telling – no more per-item allocations, which is excellent for memory efficiency. But there’s a critical flaw: significant performance degradation as goroutine count increases. This happens because of the shared state and the “thundering herd” problem: after each write, all goroutines wake up via cond.Broadcast(), but only one will do useful work.

This inefficiency led me to think: “How can I wake only the goroutine that should write next?” And this is how the 3rd approach was born.

Approach 3: Permission Passing Chain

Here’s the key insight: when is it safe to write output #5? After output #4 was written. Who knows when output #4 was written? The goroutine that wrote it.

In this algorithm, any job must hold the write permission before its worker can send results to the output channel. We chain jobs together so each one knows exactly which job comes next and can pass the permission to it. This is done by attaching two channels to each job: canWrite channel to receive the permission, and nextCanWrite channel to pass the permission to the next job.

Permission Passing Chain for Order Preservation

This chain structure makes the worker logic remarkably simple:

  • Calculate: Process the job using the provided function
  • Wait: Receive the permission from canWrite channel
  • Write: Send the result to the output channel
  • Pass: Send the permission to the next job via nextCanWrite channel

Here’s the diagram that illustrates the whole flow:

Permission Passing Chain for Order Preservation

The green arrows show how the permission to write is passed from one job to another along the chain. Essentially this is a token-passing algorithm that eliminates the “thundering herd” problem entirely — each goroutine wakes exactly one other goroutine, creating efficient point-to-point signaling rather than expensive broadcasts.

Let’s see how this translates to code. The implementation has two parts: a “linker” goroutine that builds the chain, and workers that follow the calculate-wait-write-pass pattern:

func OrderedMap3[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
	type Job[A any] struct {
		Item         A
		CanWrite     chan struct{}
		NextCanWrite chan struct{} // canWrite channel of the next job
	}

	// Linker goroutine:
	// Builds a chain of jobs where each has a CanWrite channel attached.
	// Additionally, each job knows about the CanWrite channel of the next job in the chain.
	jobs := make(chan Job[A])
	go func() {
		defer close(jobs)

		var canWrite, nextCanWrite chan struct{}
		nextCanWrite = make(chan struct{}, 1)
		close(nextCanWrite) // the first job can write immediately

		for item := range in {
			canWrite, nextCanWrite = nextCanWrite, make(chan struct{}, 1)
			jobs <- Job[A]{item, canWrite, nextCanWrite}
		}
	}()

	// Worker pool of n goroutines.
	// Jobs pass the write permission along the chain.
	out := make(chan B)
	Loop(jobs, n, out, func(job Job[A]) {
		result := f(job.Item) // Calculate the result

		<-job.CanWrite          // Wait for the write permission
		out <- result           // Write to the output channel
		close(job.NextCanWrite) // Pass the permission to the next job
	})

	return out
}

Performance Results:

GoroutinesTime /opvs BaselineAllocs/op
2927.2ns+519ns1
4939.8ns+495ns1
8860.7ns+314ns1
12823.8ns+224ns1
50609.8ns-443ns1

Here the result is very similar to what we’ve seen in the ReplyTo approach. Almost the same overhead, the same inversion at higher levels of concurrency, and the same extra allocation per item. But there’s one difference…

Unlike approach 1, here we’re allocating a non-generic chan struct{}. This means we can use a package level sync.Pool to eliminate those allocations – let’s explore that next.

Approach 3a: Zero-Allocation Permission Passing Chain

Let’s create a pool for canWrite channels. Implementation is straightforward – the pool itself and make/release functions.

// Package-level pool for canWrite channels
type chainedItem[A any] struct {
	Value        A
	CanWrite     chan struct{}
	NextCanWrite chan struct{} // canWrite channel for the next item
}

var canWritePool sync.Pool

func makeCanWriteChan() chan struct{} {
	ch := canWritePool.Get()
	if ch == nil {
		return make(chan struct{}, 1)
	}
	return ch.(chan struct{})
}

func releaseCanWriteChan(ch chan struct{}) {
	canWritePool.Put(ch)
}

Now let’s use the pool in the permission passing algorithm. Since channels are reused, we can no longer signal by closing them. Instead workers must read and write empty structs form/to these channels.

func OrderedMap3a[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
	type Job[A any] struct {
		Item         A
		CanWrite     chan struct{}
		NextCanWrite chan struct{} // canWrite channel of the next job
	}

	// Linker goroutine:
	// Builds a chain of jobs where each has a CanWrite channel attached.
	// Additionally, each job knows about the CanWrite channel of the next job in the chain.
	jobs := make(chan Job[A])
	go func() {
		defer close(jobs)

		var canWrite, nextCanWrite chan struct{}
		nextCanWrite = makeCanWriteChan()
		nextCanWrite <- struct{}{} // the first job can write immediately

		for item := range in {
			canWrite, nextCanWrite = nextCanWrite, makeCanWriteChan()
			jobs <- Job[A]{item, canWrite, nextCanWrite}
		}
	}()

	// Worker pool of n goroutines.
	// Jobs pass the write permission along the chain.
	out := make(chan B)
	Loop(jobs, n, out, func(job Job[A]) {
		result := f(job.Item) // Calculate the result

		<-job.CanWrite                    // Wait for the write permission
		out <- result                     // Write to the output channel
		releaseCanWriteChan(job.CanWrite) // Release our canWrite channel to the pool
		job.NextCanWrite <- struct{}{}    // Pass the permission to the next job
	})

	return out
}

Performance Results with Pooling:

GoroutinesTime /opvs BaselineAllocs/op
2891.0ns+482ns0
4916.5ns+471ns0
8879.5ns+333ns0
12872.6ns+272ns0
50657.6ns-395ns0

Perfect! Zero allocations and good performance, meaning less GC pressure for long running jobs. But this approach has one more trick up its sleeve…

One more thing: Building Reusable Abstractions

The permission passing approach has another significant advantage over the ReplyTo method: it controls when to write rather than where to write.

I’ll admit it – sometimes I get a bit obsessed with building clean abstractions. When working on rill, I really wanted to extract this ordering logic into something reusable and testable. This “when vs where” distinction was an AHA moment for me.

Since the algorithm doesn’t care where the outputs are written, it’s easy to abstract it into a separate function – OrderedLoop. The API is very similar to the Loop function we used before, but here the user function receives two arguments – an item and a canWrite channel. It’s important that the user function must read from the canWrite channel exactly once to avoid deadlocks or undefined behavior.

func OrderedLoop[A, B any](in <-chan A, done chan<- B, n int, f func(a A, canWrite <-chan struct{})) {
	type Job[A any] struct {
		Item         A
		CanWrite     chan struct{}
		NextCanWrite chan struct{} // canWrite channel of the next job
	}

	// Linker goroutine:
	// Builds a chain of jobs where each has a CanWrite channel attached.
	// Additionally, each job knows about the CanWrite channel of the next job in the chain.
	jobs := make(chan Job[A])
	go func() {
		defer close(jobs)

		var canWrite, nextCanWrite chan struct{}
		nextCanWrite = makeCanWriteChan()
		nextCanWrite <- struct{}{} // the first job can write immediately

		for item := range in {
			canWrite, nextCanWrite = nextCanWrite, makeCanWriteChan()
			jobs <- Job[A]{item, canWrite, nextCanWrite}
		}
	}()

	// Worker pool of n goroutines.
	// Jobs pass the write permission along the chain.
	Loop(jobs, n, done, func(job Job[A]) {
		f(job.Item, job.CanWrite) // Do the work

		releaseCanWriteChan(job.CanWrite) // Release item's canWrite channel to the pool
		job.NextCanWrite <- struct{}{}    // Pass the permission to the next job
	})
}

The typical usage looks like:

OrderedLoop(in, out, n, func(a A, canWrite <-chan struct{}) {
	// [Do processing here]
	
	// Everything above this line is executed concurrently,
	// everything below it is executed sequentially and in order
	<-canWrite
	
	// [Write results somewhere]
})

With this abstraction in hand it’s remarkably simple to build any ordered operations. For example OrderedMap becomes just 7 lines of code:

func OrderedMap3b[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
	out := make(chan B)
	OrderedLoop(in, out, n, func(a A, canWrite <-chan struct{}) {
		result := f(a)
		<-canWrite
		out <- result
	})
	return out
}

We can also easily build an OrderedFilter that conditionally writes outputs:

func OrderedFilter[A any](in <-chan A, n int, predicate func(A) bool) <-chan A {
	out := make(chan A)
	OrderedLoop(in, out, n, func(a A, canWrite <-chan struct{}) {
		keep := predicate(a)
		<-canWrite
		if keep {
			out <- a
		}
	})
	return out
}

Or even an OrderedSplit that distributes items to two channels based on a predicate:

func OrderedSplit[A any](in <-chan A, n int, predicate func(A) bool) (<-chan A, <-chan A) {
	outTrue := make(chan A)
	outFalse := make(chan A)
	done := make(chan struct{})
	
	OrderedLoop(in, done, n, func(a A, canWrite <-chan struct{}) {
		shouldGoToTrue := predicate(a)
		<-canWrite
		if shouldGoToTrue {
			outTrue <- a
		} else {
			outFalse <- a
		}
	})
	
	go func() {
		<-done
		close(outTrue)
		close(outFalse)
	}()
	
	return outTrue, outFalse
}

Simply put, this abstraction makes building ordered operations trivial.

Performance Comparison

Here’s how all approaches perform across different concurrency levels:

ConcurrencyBaselineApproach 1
(ReplyTo)
Approach 2
(sync.Cond)
Approach 3
(Permission)
Approach 3a
(+ Pool)
2408.6ns818.7ns867.7ns927.2ns891.0ns
4445.1ns808.9ns1094ns939.8ns916.5ns
8546.4ns826.8ns1801ns860.7ns879.5ns
12600.2ns825.6ns2987ns823.8ns872.6ns
501053ns772.3ns16074ns609.8ns657.6ns
Zero allocs

Key Takeaways

  1. sync.Cond is a no-go for ordered concurrency – While it starts with decent performance at low concurrency, it completely falls apart as goroutine count increases, due to the thundering herd problem.

  2. ReplyTo is a strong contender – it adds at most ~500ns of overhead compared to the baseline, but requires one additional allocation per input item, increasing GC pressure.

  3. Permission Passing emerges as the clear winner – It has it all:

    • Good performance: at most ~500ns of overhead compared to the baseline
    • Zero allocations: Less GC pressure for long running tasks
    • Clean abstraction: Core synchronization logic can be abstracted away and used to build various concurrent operations.
    • Maintainability: Separation of concerns and the intuitive “calculate → wait → write → pass” pattern make code easy to support and reason about

This exploration shows that ordered concurrency doesn’t have to be expensive. With the right approach, you can have concurrency, ordering and backpressure at the same time. The permission passing pattern, in particular, demonstrates how Go’s channels can be used creatively to solve complex coordination problems.

Finally, these patterns have been battle-tested in production through rill concurrency toolkit (1.7k 🌟 on GitHub). It implements Map, OrderedMap, and many other concurrent operations. Rill focuses on composability – operations chain together into larger pipelines – while adding comprehensive error handling, context-friendly design, and maintaining over 95% test coverage.