Concurrency is one of Go’s greatest strengths, but it comes with a fundamental trade-off: when multiple goroutines process data simultaneously, the natural ordering gets scrambled. Most of the time, this is fine – unordered processing is enough, it’s faster and simpler.
But sometimes, order matters.
When Order Matters
Here are three real-world scenarios where preserving order becomes critical:
Real-time Log Enrichment: You’re processing a high-volume log stream, enriching each entry with user metadata from a database or external API. Sequential processing can’t keep up with the incoming rate, but concurrent processing breaks the sequence, making the enriched logs unusable for downstream consumers that depend on chronological order.
Finding the First Match in a File List: You need to download a list of files from cloud storage and find the first one containing a specific string. Concurrent downloads are much faster, but they complete out of order – the 50th file might finish before the 5th file, so you can’t simply return the first match you find without knowing if an earlier file also contains the string.
Time Series Data Processing: This scenario inspired my original implementation. I needed to download 90 days of transaction logs (~600MB each), extract some data, then compare consecutive days for trend analysis. Sequential downloads took hours; concurrent downloads could give an order of magnitude speedup, but would destroy the temporal relationships I needed for comparison.
The challenge is clear: we need the speed benefits of concurrent processing without sacrificing the predictability of ordered results. This isn’t just a theoretical problem – it’s a practical constraint that affects real systems at scale.
In this article, we’ll explore three approaches I’ve developed and used in production Go applications. We’ll build a concurrent OrderedMap
function that transforms a channel of inputs into a channel of outputs while preserving order. Through benchmarks of each approach, we’ll understand their trade-offs and discover surprising performance insights along the way.
The Problem: Why Concurrency Breaks Order
Let’s quickly recall why concurrency messes up ordering. One of the reasons is that goroutines process tasks at different speeds. Another common reason – we can’t predict how exactly goroutines will be scheduled by the Go runtime.
For example, goroutine #2 might finish processing item #50 before goroutine #1 finishes item #10, causing results to arrive out of order. This is the natural behavior of concurrent processing.
If you want to see this in action, here’s a quick demo the Go playground.
Design Philosophy: Backpressure vs Buffering
The classic approach to ordered concurrency uses some sort of reorder buffer or queue. When a worker calculates a result but it’s too early to write it to the output, the result gets stored in that buffer until it can be written in the correct order.
In such designs buffers can typically grow without bound. This happens when:
- The input is skewed – early items take longer to process than later items
- Downstream consumers are slow
The algorithms presented below are backpressure-first. If a worker can’t yet write its result to the output channel, it blocks. This design is memory-bound and preserves the behavior developers expect from Go channels.
Technically speaking, such algorithms also do buffering, but here out-of-order items are held on the stacks of running goroutines. So, to get a larger “buffer” in these algorithms, you can simply increase the concurrency level. This works well in practice since typically when applications need larger buffers they also need higher concurrency levels.
Establishing a Performance Baseline
To understand the true cost of ordering, we first need a baseline to measure against.
Let’s implement and benchmark a basic concurrent Map
function that doesn’t preserve order — this will show us exactly what overhead the ordering approaches add.
Our Map
function transforms an input channel into an output channel using a user-supplied function f
. It’s built on top of a simple worker pool, which spawns multiple goroutines to process input items concurrently.
// Map transforms items from the input channel using n goroutines, and the
// provided function f. Returns a new channel with transformed items.
func Map[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
out := make(chan B)
Loop(in, n, out, func(a A) {
out <- f(a)
})
return out
}
// Loop is a worker pool implementation. It calls function f for each
// item from the input channel using n goroutines. This is a non-blocking function
// that signals completion by closing the done channel when all work is finished.
func Loop[A, B any](in <-chan A, n int, done chan<- B, f func(A)) {
var wg sync.WaitGroup
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for a := range in {
f(a)
}
}()
}
go func() {
wg.Wait()
if done != nil {
close(done)
}
}()
}
// Discard is a non-blocking function that consumes and discards
// all items from the input channel
func Discard[A any](in <-chan A) {
go func() {
for range in {
// Discard the value
}
}()
}
func BenchmarkMap(b *testing.B) {
for _, n := range []int{1, 2, 4, 8, 12, 50} {
b.Run(fmt.Sprint("n=", n), func(b *testing.B) {
in := make(chan int)
defer close(in)
out := Map(in, n, func(a int) int {
//time.Sleep(50 * time.Microsecond)
return a // no-op: just return the original value
})
Discard(out)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
in <- 10 // write something to the in chan
}
})
}
}
As you can see, Map
leverages Loop
to create a worker pool that processes items concurrently, while Loop
itself handles the low-level goroutine management and synchronization. This separation of concerns will become important later when we build our ordered variants.
What exactly are we measuring here? We’re measuring throughput – how fast we can push items through the entire pipeline. Since the Map
function creates backpressure (blocking when the pipeline is full), the rate at which we can feed items into the input channel acts as an accurate proxy for overall processing speed.
Let’s run the benchmark (I used Apple M2 Max laptop to run it):
Goroutines | Time /op | Allocs/op |
---|---|---|
2 | 408.6ns | 0 |
4 | 445.1ns | 0 |
8 | 546.4ns | 0 |
12 | 600.2ns | 0 |
50 | 1053ns | 0 |
You might wonder: “Shouldn’t higher concurrency increase throughput?” In real applications, absolutely – but only when there’s actual work to parallelize. Here I used a trivial no-op transformation to isolate and benchmark the pure overhead of goroutines, channels, and coordination. As expected, this overhead grows with the number of goroutines.
We’ll use this overhead-focused benchmark for comparisons later in the article, but to demonstrate that concurrency improves performance, let’s run one more benchmark with some work simulated (50μs sleep):
Goroutines | Time /op | Speedup | Allocs/op |
---|---|---|---|
1 | 61656ns | 1.0x | 0 |
2 | 30429ns | 2.0x | 0 |
4 | 15207ns | 4.1x | 0 |
8 | 7524ns | 8.2x | 0 |
12 | 5034ns | 12.2x | 0 |
50 | 1277ns | 48.3x | 0 |
Perfect! Here we see the dramatic benefits of concurrency when there’s real work to be done. With 50μs of work per item, increasing concurrency from 1 to 50 goroutines improves performance by nearly 50x. This demonstrates why concurrent processing is so valuable in real applications.
We’re now ready to compare the 3 approaches and measure exactly what price we pay for adding order preservation.
Approach 1: ReplyTo Channels
This is probably the most Go-native way to implement ordered concurrency. The ReplyTo pattern is well-known in Go (I also used it in my batching article), but somehow this was the hardest approach for me to explain clearly.
Here’s how it works:
- A packer goroutine creates jobs by attaching a unique
replyTo
channel to every input item. - Workers process jobs concurrently, and send results through those
replyTo
channels. - An unpacker goroutine unpacks the values sent via
replyTo
channels and writes them to the output.
The following diagram illustrates how this pattern in more detail:
The left part of this diagram is sequential (packer and unpacker) while the worker pool on the right operates concurrently. Notice that workers can only send results when the unpacker is ready to receive them, because the replyTo
channels are unbuffered. This creates natural backpressure and prevents unnecessary buffering.
func OrderedMap1[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
type Job struct {
Item A
ReplyTo chan B
}
// Packer goroutine.
// `jobs` chan will be processed by the pool
// `replies` chan will be consumed by unpacker goroutine
jobs := make(chan Job)
replies := make(chan chan B, n)
go func() {
for item := range in {
replyTo := make(chan B)
jobs <- Job{Item: item, ReplyTo: replyTo}
replies <- replyTo
}
close(jobs)
close(replies)
}()
// Worker pool of n goroutines.
// Sends results back via replyTo channels
Loop[Job, any](jobs, n, nil, func(job Job) {
job.ReplyTo <- f(job.Item) // Calculate the result and send it back
close(job.ReplyTo)
})
// Unpacker goroutine.
// Unpacks replyTo channels in order and sends results to the `out` channel
out := make(chan B)
go func() {
defer close(out)
for replyTo := range replies {
result := <-replyTo
out <- result
}
}()
return out
}
Performance Results:
Goroutines | Time /op | vs Baseline | Allocs/op |
---|---|---|---|
2 | 818.7ns | +410ns | 1 |
4 | 808.9ns | +364ns | 1 |
8 | 826.8ns | +280ns | 1 |
12 | 825.6ns | +225ns | 1 |
50 | 772.3ns | -281ns | 1 |
This approach introduces up to 410ns of overhead per input item compared to our baseline. Part of this cost comes from allocating a new replyTo
channel for every item. Unfortunately, we can’t use a package level sync.Pool
to mitigate this because our function is generic – channels for different types can’t share the same pool.
What’s also interesting about this result is that the overhead brought by ordering becomes smaller as the number of goroutines grows. At some point even an inversion happens – OrderedMap1
becomes faster than Map
(-281ns at 50 goroutines).
I haven’t investigated this phenomenon deeply. I believe it can’t be caused by inefficiencies inside Map
since it’s already based on the simplest possible channel-based worker pool. One guess that I have is that in Map
we have 50 goroutines competing to write into a single output channel. On the contrary, in OrderedMap
, despite additional moving parts, only one goroutine is writing to the output.
Let’s now move on to the next approach.
Approach 2: sync.Cond for Turn-Taking
This was the first algorithm I implemented when I needed ordered concurrency, and it’s much easier to explain than the ReplyTo approach.
Here we attach an incremental index to each item and send it to the worker pool. Each worker performs the calculation, then waits its turn to write the result to the output channel.
This conditional waiting is implemented using a shared currentIndex
variable protected by sync.Cond
, a powerful but underused concurrency primitive from the standard library that allows goroutines to wait for specific conditions and be woken up when those conditions change.
Here’s how the turn-taking mechanism works:
Here, after each write, all workers wake up (using broadcast) and recheck “is it my turn?” condition
func OrderedMap2[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
type Job struct {
Item A
Index int
}
// Indexer goroutine.
// Assign an index to each item from the input channel
jobs := make(chan Job)
go func() {
i := 0
for item := range in {
jobs <- Job{Item: item, Index: i}
i++
}
close(jobs)
}()
// Shared state.
// Index of the next result that must be written to the output channel.
nextIndex := 0
cond := sync.NewCond(new(sync.Mutex))
// Worker pool of n goroutines.
out := make(chan B)
Loop(jobs, n, out, func(job Job) {
result := f(job.Item) // Calculate the result
// Cond must be used with a locked mutex (see stdlib docs)
cond.L.Lock()
// wait until it's our turn to write the result
for job.Index != nextIndex {
cond.Wait()
}
// Write the result
out <- result
// Increment the index and notify all other workers
nextIndex++
cond.Broadcast()
cond.L.Unlock()
})
return out
}
Performance Results:
Goroutines | Time /op | vs Baseline | Allocs/op |
---|---|---|---|
2 | 867.7ns | +459ns | 0 |
4 | 1094ns | +649ns | 0 |
8 | 1801ns | +1255ns | 0 |
12 | 2987ns | +2387ns | 0 |
50 | 16074ns | +15021ns | 0 |
The results are telling – no more per-item allocations, which is excellent for memory efficiency. But there’s a critical flaw: significant performance degradation as goroutine count increases. This happens because of the shared state and the “thundering herd” problem: after each write, all goroutines wake up via cond.Broadcast()
, but only one will do useful work.
This inefficiency led me to think: “How can I wake only the goroutine that should write next?” And this is how the 3rd approach was born.
Approach 3: Permission Passing Chain
Here’s the key insight: when is it safe to write output #5? After output #4 was written. Who knows when output #4 was written? The goroutine that wrote it.
In this algorithm, any job must hold the write permission before its worker can send results to the output channel. We chain jobs together so each one knows exactly which job comes next and can pass the permission to it. This is done by attaching two channels to each job: canWrite
channel to receive the permission, and nextCanWrite
channel to pass the permission to the next job.
This chain structure makes the worker logic remarkably simple:
- Calculate: Process the job using the provided function
- Wait: Receive the permission from
canWrite
channel - Write: Send the result to the output channel
- Pass: Send the permission to the next job via
nextCanWrite
channel
Here’s the diagram that illustrates the whole flow:
The green arrows show how the permission to write is passed from one job to another along the chain. Essentially this is a token-passing algorithm that eliminates the “thundering herd” problem entirely — each goroutine wakes exactly one other goroutine, creating efficient point-to-point signaling rather than expensive broadcasts.
Let’s see how this translates to code. The implementation has two parts: a “linker” goroutine that builds the chain, and workers that follow the calculate-wait-write-pass pattern:
func OrderedMap3[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
type Job[A any] struct {
Item A
CanWrite chan struct{}
NextCanWrite chan struct{} // canWrite channel of the next job
}
// Linker goroutine:
// Builds a chain of jobs where each has a CanWrite channel attached.
// Additionally, each job knows about the CanWrite channel of the next job in the chain.
jobs := make(chan Job[A])
go func() {
defer close(jobs)
var canWrite, nextCanWrite chan struct{}
nextCanWrite = make(chan struct{}, 1)
close(nextCanWrite) // the first job can write immediately
for item := range in {
canWrite, nextCanWrite = nextCanWrite, make(chan struct{}, 1)
jobs <- Job[A]{item, canWrite, nextCanWrite}
}
}()
// Worker pool of n goroutines.
// Jobs pass the write permission along the chain.
out := make(chan B)
Loop(jobs, n, out, func(job Job[A]) {
result := f(job.Item) // Calculate the result
<-job.CanWrite // Wait for the write permission
out <- result // Write to the output channel
close(job.NextCanWrite) // Pass the permission to the next job
})
return out
}
Performance Results:
Goroutines | Time /op | vs Baseline | Allocs/op |
---|---|---|---|
2 | 927.2ns | +519ns | 1 |
4 | 939.8ns | +495ns | 1 |
8 | 860.7ns | +314ns | 1 |
12 | 823.8ns | +224ns | 1 |
50 | 609.8ns | -443ns | 1 |
Here the result is very similar to what we’ve seen in the ReplyTo approach. Almost the same overhead, the same inversion at higher levels of concurrency, and the same extra allocation per item. But there’s one difference…
Unlike approach 1, here we’re allocating a non-generic chan struct{}
. This means we can use a package level sync.Pool
to eliminate those allocations – let’s explore that next.
Approach 3a: Zero-Allocation Permission Passing Chain
Let’s create a pool for canWrite
channels. Implementation is straightforward – the pool itself and make/release functions.
// Package-level pool for canWrite channels
type chainedItem[A any] struct {
Value A
CanWrite chan struct{}
NextCanWrite chan struct{} // canWrite channel for the next item
}
var canWritePool sync.Pool
func makeCanWriteChan() chan struct{} {
ch := canWritePool.Get()
if ch == nil {
return make(chan struct{}, 1)
}
return ch.(chan struct{})
}
func releaseCanWriteChan(ch chan struct{}) {
canWritePool.Put(ch)
}
Now let’s use the pool in the permission passing algorithm. Since channels are reused, we can no longer signal by closing them. Instead workers must read and write empty structs form/to these channels.
func OrderedMap3a[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
type Job[A any] struct {
Item A
CanWrite chan struct{}
NextCanWrite chan struct{} // canWrite channel of the next job
}
// Linker goroutine:
// Builds a chain of jobs where each has a CanWrite channel attached.
// Additionally, each job knows about the CanWrite channel of the next job in the chain.
jobs := make(chan Job[A])
go func() {
defer close(jobs)
var canWrite, nextCanWrite chan struct{}
nextCanWrite = makeCanWriteChan()
nextCanWrite <- struct{}{} // the first job can write immediately
for item := range in {
canWrite, nextCanWrite = nextCanWrite, makeCanWriteChan()
jobs <- Job[A]{item, canWrite, nextCanWrite}
}
}()
// Worker pool of n goroutines.
// Jobs pass the write permission along the chain.
out := make(chan B)
Loop(jobs, n, out, func(job Job[A]) {
result := f(job.Item) // Calculate the result
<-job.CanWrite // Wait for the write permission
out <- result // Write to the output channel
releaseCanWriteChan(job.CanWrite) // Release our canWrite channel to the pool
job.NextCanWrite <- struct{}{} // Pass the permission to the next job
})
return out
}
Performance Results with Pooling:
Goroutines | Time /op | vs Baseline | Allocs/op |
---|---|---|---|
2 | 891.0ns | +482ns | 0 |
4 | 916.5ns | +471ns | 0 |
8 | 879.5ns | +333ns | 0 |
12 | 872.6ns | +272ns | 0 |
50 | 657.6ns | -395ns | 0 |
Perfect! Zero allocations and good performance, meaning less GC pressure for long running jobs. But this approach has one more trick up its sleeve…
One more thing: Building Reusable Abstractions
The permission passing approach has another significant advantage over the ReplyTo method: it controls when to write rather than where to write.
I’ll admit it – sometimes I get a bit obsessed with building clean abstractions. When working on rill, I really wanted to extract this ordering logic into something reusable and testable. This “when vs where” distinction was an AHA moment for me.
Since the algorithm doesn’t care where the outputs are written, it’s easy to abstract it into a separate function – OrderedLoop
. The API is very similar to the Loop
function we used before, but here the user function receives two arguments – an item
and a canWrite
channel. It’s important that the user function must read from the canWrite
channel exactly once to avoid deadlocks or undefined behavior.
func OrderedLoop[A, B any](in <-chan A, done chan<- B, n int, f func(a A, canWrite <-chan struct{})) {
type Job[A any] struct {
Item A
CanWrite chan struct{}
NextCanWrite chan struct{} // canWrite channel of the next job
}
// Linker goroutine:
// Builds a chain of jobs where each has a CanWrite channel attached.
// Additionally, each job knows about the CanWrite channel of the next job in the chain.
jobs := make(chan Job[A])
go func() {
defer close(jobs)
var canWrite, nextCanWrite chan struct{}
nextCanWrite = makeCanWriteChan()
nextCanWrite <- struct{}{} // the first job can write immediately
for item := range in {
canWrite, nextCanWrite = nextCanWrite, makeCanWriteChan()
jobs <- Job[A]{item, canWrite, nextCanWrite}
}
}()
// Worker pool of n goroutines.
// Jobs pass the write permission along the chain.
Loop(jobs, n, done, func(job Job[A]) {
f(job.Item, job.CanWrite) // Do the work
releaseCanWriteChan(job.CanWrite) // Release item's canWrite channel to the pool
job.NextCanWrite <- struct{}{} // Pass the permission to the next job
})
}
The typical usage looks like:
OrderedLoop(in, out, n, func(a A, canWrite <-chan struct{}) {
// [Do processing here]
// Everything above this line is executed concurrently,
// everything below it is executed sequentially and in order
<-canWrite
// [Write results somewhere]
})
With this abstraction in hand it’s remarkably simple to build any ordered operations. For example OrderedMap
becomes just 7 lines of code:
func OrderedMap3b[A, B any](in <-chan A, n int, f func(A) B) <-chan B {
out := make(chan B)
OrderedLoop(in, out, n, func(a A, canWrite <-chan struct{}) {
result := f(a)
<-canWrite
out <- result
})
return out
}
We can also easily build an OrderedFilter
that conditionally writes outputs:
func OrderedFilter[A any](in <-chan A, n int, predicate func(A) bool) <-chan A {
out := make(chan A)
OrderedLoop(in, out, n, func(a A, canWrite <-chan struct{}) {
keep := predicate(a)
<-canWrite
if keep {
out <- a
}
})
return out
}
Or even an OrderedSplit
that distributes items to two channels based on a predicate:
func OrderedSplit[A any](in <-chan A, n int, predicate func(A) bool) (<-chan A, <-chan A) {
outTrue := make(chan A)
outFalse := make(chan A)
done := make(chan struct{})
OrderedLoop(in, done, n, func(a A, canWrite <-chan struct{}) {
shouldGoToTrue := predicate(a)
<-canWrite
if shouldGoToTrue {
outTrue <- a
} else {
outFalse <- a
}
})
go func() {
<-done
close(outTrue)
close(outFalse)
}()
return outTrue, outFalse
}
Simply put, this abstraction makes building ordered operations trivial.
Performance Comparison
Here’s how all approaches perform across different concurrency levels:
Concurrency | Baseline | Approach 1 (ReplyTo) | Approach 2 (sync.Cond) | Approach 3 (Permission) | Approach 3a (+ Pool) |
---|---|---|---|---|---|
2 | 408.6ns | 818.7ns | 867.7ns | 927.2ns | 891.0ns |
4 | 445.1ns | 808.9ns | 1094ns | 939.8ns | 916.5ns |
8 | 546.4ns | 826.8ns | 1801ns | 860.7ns | 879.5ns |
12 | 600.2ns | 825.6ns | 2987ns | 823.8ns | 872.6ns |
50 | 1053ns | 772.3ns | 16074ns | 609.8ns | 657.6ns |
Zero allocs | ✅ | ❌ | ✅ | ❌ | ✅ |
Key Takeaways
-
sync.Cond is a no-go for ordered concurrency – While it starts with decent performance at low concurrency, it completely falls apart as goroutine count increases, due to the thundering herd problem.
-
ReplyTo is a strong contender – it adds at most ~500ns of overhead compared to the baseline, but requires one additional allocation per input item, increasing GC pressure.
-
Permission Passing emerges as the clear winner – It has it all:
- Good performance: at most ~500ns of overhead compared to the baseline
- Zero allocations: Less GC pressure for long running tasks
- Clean abstraction: Core synchronization logic can be abstracted away and used to build various concurrent operations.
- Maintainability: Separation of concerns and the intuitive “calculate → wait → write → pass” pattern make code easy to support and reason about
This exploration shows that ordered concurrency doesn’t have to be expensive. With the right approach, you can have concurrency, ordering and backpressure at the same time. The permission passing pattern, in particular, demonstrates how Go’s channels can be used creatively to solve complex coordination problems.
Finally, these patterns have been battle-tested in production through rill concurrency toolkit (1.7k 🌟 on GitHub). It implements Map
, OrderedMap
, and many other concurrent operations. Rill focuses on composability – operations chain together into larger pipelines – while adding comprehensive error handling, context-friendly design, and maintaining over 95% test coverage.