Five critical API integration errors at startup
Error # 1. No idempotence and "storm" of retreats
Symptoms: duplicate orders/payments, discrepancy in amounts, disputed returns, DLQ alerts are growing.
Root: repeated delivery of requests/webhooks and network flappies are normal. If the "create/write off" operation is not idempotent, retrays multiply the damage.
How to
Idempotency-Key/' operation _ id 'to all unsafe methods (POST/PATCH).
Unique index in the database for 'operation _ id'. Replay - return the previous result.
Webhooks through the Inbox table (dedupe by 'event _ id + signature'). Outbound events - Outbox.
Retrai: maximum 1-2 times, exponent + jitter, only for safe operations.
HTTP convention (example):http
POST /v1/payments
Idempotency-Key: ik_f35a2
Content-Type: application/json
{"amount": 5000, "currency": "EUR", "source": "card_..."}
SQL protection (simplified):
sql
ALTER TABLE payments ADD CONSTRAINT uniq_op UNIQUE (operation_id);
Retrai with jitter (pseudocode):
python for i in range(2):
try: return call_api(payload, timeout=0. 6)
except Timeout:
sleep(0. 05 2i + random. uniform(0, 0. 05))
raise UpstreamUnavailable
Checklist:
- All "monetary/creating" logic has' operation _ id'and uniq index.
- Inbound webhooks only via Inbox with idempotent worker.
- The client SDK automatically sets the Idempotency-Key.
Error number 2. Timeouts/Retrays vs. SLO: Dependency Overheating
Symptoms: p95 suddenly floats away, queues grow, circuit breaker "bangs."
Root: the total SLO of the response is 400-600 ms, and timeouts to external APIs are 1-2 s, and even retrays × 3. You do longer than you can and you storm the addiction with repetitions.
How to
Budget timing: if SLO is 400 ms, upstream timeout: 250-300 ms; Total timeout of SLO ≤ request.
Limits/Backpressure: semaphores/worker-pool for calls to each dependency. Crowded → 429/503 at once.
Circuit breaker: 'open' with timeouts/5xx, 'half-open' dosed.
Admission control: restrict concurrency (per thread, per endpoint/PSP).
Example (Go):go sem: = make (chan struct {}, 64 )//competition limit to PSP func callPSP (ctx context. Context, req Req) (Res, error) {
select {
case sem <- struct{}{}:
defer func(){ <-sem }()
c, cancel:= context. WithTimeout(ctx, 300time. Millisecond)
defer cancel()
return psp. Do(c, req)
default:
return Res {}, ErrBusy//immediate failure instead of endless queue
}
}
Checklist:
- Timeouts are shorter than SLO; retrai ≤ 2; there is jitter.
- Pools/semaphores to external APIs; circuit breaker with metrics.
- On busy routes, we return 429/Retry-After, not keep connections.
Error number 3. Weak Security: Webhook Signatures, Secrets, TLS
Symptoms: "other people's" webhooks pass, secrets in the code/log, MITM risks.
Root: no signature/freshness check, secrets live in env files, old TLS and weak headers.
How to
Signature of webhooks HMAC-SHA256 + 'X-Timestamp' (window ≤ 5-10 minutes), strict comparison of signature.
mTLS for critical integrations or IP allow-list.
Rotation of secrets via Vault/Cloud KMS; minimum rights; subtraction audit.
TLS 1. 2/1. 3 only, HSTS, correct CORS (narrow source list).
Signature Verification (Python):python def verify(sig_hdr, ts_hdr, body, secret):
if abs(time. time() - int(ts_hdr)) > 600: raise Expired()
calc = hmac. new(secret, (ts_hdr + "." + body). encode(), hashlib. sha256). hexdigest()
if not hmac. compare_digest(calc, sig_hdr): raise BadSig()
Checklist:
- All webhooks are signed and verified; the freshness window is limited.
- Secrets in KMS/Vault, there is rotation and auditing.
- TLS/HSTS enabled; CORS point; IP/mTLS where appropriate.
Error number 4. Contract drift: the scheme "lived its life"
Symptoms: the prod fell "only in some clients," 500/422 in the logs, different versions of the SDK and API argue.
Root: there is no strict description of contracts, backward incompatible changes, "quiet" fields, different meanings for the same names.
How to
Contract-first: OpenAPI/AsyncAPI + server/client generation; for events - Avro/Protobuf + Schema Registry.
Versioning: 'v1 → v2' (URI/header), deviation-plan, grace-period.
Backward-compat: only additive changes in minor releases; cannot be deleted/renamed without v-bump.
Contract tests: Pact/Buf - provider/consummer are tested in CI.
Examples:yaml
OpenAPI: clear type of sum in minor amount_minor units:
type: integer minimum: 0 description: Sum in minimum currency units (integer)
Checklist:
- Contracts are stored in git, CI validates/breaks if incompatible.
- Schema registers for events, "back/forward" compatibility.
- Docking page of changes, dates of depriction, test bench for partners.
Error number 5. "Blind" launch: no metrics/logs/trails and sandbox
Symptoms: "nothing is visible," support fills up, debag - hands in prod.
Root: observability was not included, there is no synthetics, the sandbox was tested "in words."
How to
RED/USE metrics: rate/error/latency on each endpoint, by route/method.
Correlation: 'trace _ id' in all logs and responses; bundle of zapros↔vebkhuk.
Synthetics: health tests (login/deposit sand), SLA monitoring T + 60 for webhooks.
Sandbox/stage: completely isolated keys/domains, fictitious PSPs, entries "not included in reports."
Response with trace ID:http
HTTP/1. 1 202 Accepted
Trace-Id: 7f2b3d8e9c1a4
Location: /v1/ops/req_42/status
Checklist:
- RED/USE metrics, dashboards, alerts (symptoms + causes).
- End-to-end trails; JSON logs, no PII, with 'trace _ id'.
- Synthetics from key regions; sandbox is required, different keys.
Prelaunch Plan (T-7 → T-0)
T-7 days:- Final contract scan: are there any incompatible changes; freeze schemes.
- Secrets/Certificates: check rotation, accesses, KMS policies.
- Synthetics 24 × 7, alerts are tied to on-call.
- Loading mini-run (burst 2-5 minutes): p95/pools/queues in the green zone.
- DRY-RUN webhooks (replays, 5xx, jitter), DLQ check.
- "Phone book" of partners: L1/L2 contacts, war-room channel.
- Channel traffic 5% → 25% → 50% for SLO gates; ready rollback.
- Kill-switch/feature-flags on risky features are included.
- War-room is active, status templates are prepared.
Rollback plan (if anything went wrong)
1. Release traffic to the previous stable version/route.
2. Disable phicheflag controversial changes.
3. Stabilize queues/pools, stop retreats in a storm.
4. Post-incident: collect timeline, roots, tasks (fixed forward/contract fixes).
Start self-test table (short)
Frequently asked "what if..."
... does the provider not support Idempotency-Key?
Store 'hash (body)' + 'partner _ request _ id' and enter your idempotency.
... webhooks sometimes come "before" the answer?
Sew on 'operation _ id' and temporarily keep the "unknown → reconcile" status; periodic reconciler will close discrepancies.
... need to support old customers and new?
Version the endpoints ('/v1'and '/v2'), route by header/URI, keep backward compatibility for at least N months.
Resume Summary
Integration failures are almost always about the same thing: no idempotency, wrong timeouts and retreats, weak signature of webhooks, contract drift and lack of visibility. Fix contracts in advance, enable observability, place limits/backprescher, sign all external interactions and run synthetics. Then, even in case of failures of partners, your release will remain manageable - without money lost in retras, and without a sleepless night for the whole team.