It’s common for this kind of work to be scoped as a small feature request: connect the platform to an external accounting system, add a scheduled export job, ship it. If you only count the feature being asked for, that scope might even be accurate.
The problem shows up in the system that already exists. If your platform has grown three or four integrations and each one was built differently — different retry behaviour, different error handling, different config patterns — you’re accumulating something that looks manageable in isolation and becomes a liability at scale. Some integrations silently eat failures. Some have no alerting. None of them share infrastructure with each other. Each one is its own snowflake.
When the next integration request lands, you face a choice: build another snowflake, or stop and ask whether this is the last time you’ll be making that choice.
The cost of ad-hoc integrations
The individual cost of each snowflake integration is low. Connect to the external system, wire up an HTTP call, handle errors roughly, ship it. Done in days.
The systemic cost accrues invisibly. You don’t notice it until you’re debugging a production failure at 10pm and you realize:
- There’s no standard way to find out which integrations are registered and what state they’re in
- One integration retries three times before giving up; another retries forever; a third doesn’t retry at all
- A config change was made in production weeks ago and nobody remembers by whom or why
- An export job failed silently and the operations team only found out when a downstream system complained
Each of these problems is solvable individually. Add logging here, add a retry here, add a config record there. The patches accumulate. The architecture doesn’t improve. The next integration — and there’s always a next integration — inherits the same gaps.
The trigger for taking this seriously is usually a sufficiently complex request: cross-system data sync with an external provider, automated exports on a schedule, configuration that needs to be managed by the operations team without developer involvement. Too complex for a snowflake. Too visible to fail silently.
What building the platform actually means
The first decision is to define the surface area. An integration platform, in this context, means a shared layer that every concrete integration is built on top of — not a system you implement once for one integration.
Here’s what that layer needs to include:
Shared contracts. Every integration implements the same interfaces. The platform doesn’t care what the integration does — it cares that it can register it, invoke it, and observe it consistently. This is the foundation; without it, none of the rest of this is reusable.
In .NET, this typically takes the form of a single handler interface that each integration implements:
public interface IIntegrationHandler
{
string IntegrationName { get; }
Task<IntegrationResult> ExecuteAsync(IntegrationContext context, CancellationToken ct);
}
// Each integration implements this interface.
// The platform discovers and invokes handlers via DI —
// adding a new integration means registering a new handler.
services.AddIntegration<AccountingExportHandler>("accounting-export");
The platform discovers all registered handlers at startup, routes by name, and observes them uniformly. Adding a new integration is registering a new handler — no changes to shared infrastructure code required.
Circuit breaker. When an external system is unavailable, stop sending requests. The circuit breaker opens after a threshold of failures and blocks requests until a cooldown period expires. Without this, a temporarily unavailable external system causes cascading load on both sides — the kind of problem that turns a minor outage into a major one.
Polly is the standard .NET library for this. A minimal circuit breaker configuration looks like:
// Using Polly (https://github.com/App-vNext/Polly)
var circuitBreakerPolicy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 3,
durationOfBreak: TimeSpan.FromMinutes(1));
After three consecutive failures, the circuit opens for one minute. Calls during that window fail fast rather than piling up against an unavailable system. Tune the threshold and break duration to match your external system’s expected recovery time.
Retry with exponential backoff. Not retry-three-times-immediately. Configurable retry policies where the wait between attempts grows: 30 seconds, then 2 minutes, then 5 minutes. Transient failures become permanent ones when you retry too aggressively. Give downstream systems time to recover.
With Polly, this is a WaitAndRetryAsync policy with explicit delay intervals:
var retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(new[]
{
TimeSpan.FromSeconds(30),
TimeSpan.FromMinutes(2),
TimeSpan.FromMinutes(5)
});
The delays here are a reasonable starting point for integrations that hit external HTTP APIs. For integrations with longer recovery windows — batch jobs, nightly exports — you may want larger intervals. The key constraint is that the policy lives in one place, configured per integration type, not reimplemented inline.
Failure alerting. When retries are exhausted, the operations team needs to know — not “check the logs,” but an actual notification. Silent failures are a design smell. If the system can detect a failure, it should tell someone.
Config audit logging. Every change to integration configuration — credentials, schedules, field mappings — writes an immutable record: who changed it, when, what it was before, what it is now. This sounds bureaucratic until you’re in a production incident and the first question is “what changed?”
The implementation pattern is an append-only audit table — no updates, no deletes. Columns: ChangedBy, ChangedAt, FieldName, OldValue, NewValue, CorrelationId. Querying for “what changed in the last 24 hours” becomes a single SELECT against a table that can never be retroactively modified.
Manual retry UI. The operations team can retry a failed integration job without filing a ticket and waiting for a developer. This single capability removes a whole category of support escalation.
Dry-run mode. Run the integration against real infrastructure without committing changes. This makes testing against external systems safe — you can verify field mappings, authentication, and data shape before anything actually moves.
OAuth2 client credentials flow. If your integrations use OAuth2 — and many external systems require it — implement it once as a reusable provider rather than per-integration. Tokens are managed centrally, with refresh handled transparently.
None of these are sophisticated ideas in isolation. Every one of them exists in some form in most serious integrations. The point is to build them once, in a shared layer, before implementing the first concrete integration on top of them.
The honest trade-off
The first integration through a new platform takes longer to deliver than a snowflake would have. Building the shared layer adds real time to the timeline before the first end-to-end integration runs.
That’s the cost you pay, and it’s real. If this were the only integration you’d ever build, the platform would be the wrong call. The snowflake would be faster, cheaper, and entirely sufficient.
But the platform pays back within the same body of work if that work is complex enough. A scheduled export job — one that runs nightly, maps field data through a configuration-driven transform layer, authenticates via OAuth2, and reports failures to the operations team — can be built almost entirely by wiring together what the platform already provides. The circuit breaker and retry logic are already there. The alerting is already there. The audit trail is already there. The OAuth2 provider is already there.
And the next integration after that takes a fraction of what the first one took. Not because the engineers are faster — because the integration surface they need already exists.
When the investment is worth it
The snowflake approach is rational when:
- The integration is genuinely one-off with no likely successors
- The feature is small and the blast radius of failure is limited
- Speed to ship is the dominant constraint and the team will never look at this code again
The platform investment starts paying off when:
- You’re looking at the third or fourth integration and recognizing the same problems being solved again
- Failure modes matter — silent failures or cascading failures have real operational cost
- The operations team needs to manage integrations without developer involvement
- You want to be able to look at the system in two years and understand what’s registered, what’s running, and what’s failing
The specific trigger is usually scope and visibility. An integration that spans scheduling, data transformation, external authentication, and operational tooling is not something you can reasonably bolt onto shared infrastructure that doesn’t exist yet. The cost of doing it wrong is too visible.
The pattern
When you build the platform before the feature, you’re making a bet: that you’ll build enough integrations over a long enough period that the compounding return on shared infrastructure exceeds the upfront cost. That bet pays off more often than you’d think, because integration requests rarely stop at one.
The mistake that costs the most isn’t building the platform too early — it’s building five or six snowflakes before admitting you need a platform. By that point the problem isn’t just the next integration. It’s the ones you already have that all need to be retrofitted or left as permanent debt.
Build the platform when the cost of not having it becomes visible. Not when you’re drowning in snowflakes. Not so early you’re abstracting problems you don’t have yet. When the next integration is complex enough that you can see — clearly, specifically — what you’re going to build a second and third time if you don’t stop and generalize it now.
That’s when you stop building the feature and start building the platform.
Further Reading
- Polly — .NET resilience and transient-fault-handling library; the standard implementation for circuit breaker and retry in the .NET ecosystem
- Martin Fowler: Circuit Breaker — canonical pattern description with state machine diagram
- Microsoft: Cloud Design Patterns — covers circuit breaker, retry, health endpoint monitoring, and related patterns with implementation guidance
- Enterprise Integration Patterns — Hohpe & Woolf’s reference for messaging and integration architecture; the vocabulary most integration platform discussions build on
The views expressed here are my own. Examples and scenarios are composites drawn from broad industry experience and do not represent any specific organization, product, or system.