Five critical API integration errors at startup

Error # 1. No idempotence and "storm" of retreats

Symptoms: duplicate orders/payments, discrepancy in amounts, disputed returns, DLQ alerts are growing.

Root: repeated delivery of requests/webhooks and network flappies are normal. If the "create/write off" operation is not idempotent, retrays multiply the damage.

How to

Idempotency-Key/' operation _ id 'to all unsafe methods (POST/PATCH).

Unique index in the database for 'operation _ id'. Replay - return the previous result.

Webhooks through the Inbox table (dedupe by 'event _ id + signature'). Outbound events - Outbox.

Retrai: maximum 1-2 times, exponent + jitter, only for safe operations.

HTTP convention (example):

http
POST /v1/payments
Idempotency-Key: ik_f35a2
Content-Type: application/json

{"amount": 5000, "currency": "EUR", "source": "card_..."}

SQL protection (simplified):

sql
ALTER TABLE payments ADD CONSTRAINT uniq_op UNIQUE (operation_id);

Retrai with jitter (pseudocode):

python for i in range(2):
try: return call_api(payload, timeout=0. 6)
except Timeout:
sleep(0. 05 2i + random. uniform(0, 0. 05))
raise UpstreamUnavailable

Checklist:

All "monetary/creating" logic has' operation _ id'and uniq index.
Inbound webhooks only via Inbox with idempotent worker.
The client SDK automatically sets the Idempotency-Key.

Error number 2. Timeouts/Retrays vs. SLO: Dependency Overheating

Symptoms: p95 suddenly floats away, queues grow, circuit breaker "bangs."

Root: the total SLO of the response is 400-600 ms, and timeouts to external APIs are 1-2 s, and even retrays × 3. You do longer than you can and you storm the addiction with repetitions.

How to

Budget timing: if SLO is 400 ms, upstream timeout: 250-300 ms; Total timeout of SLO ≤ request.

Limits/Backpressure: semaphores/worker-pool for calls to each dependency. Crowded → 429/503 at once.

Circuit breaker: 'open' with timeouts/5xx, 'half-open' dosed.

Admission control: restrict concurrency (per thread, per endpoint/PSP).

Example (Go):

go sem: = make (chan struct {}, 64 )//competition limit to PSP func callPSP (ctx context. Context, req Req) (Res, error) {
select {
case sem <- struct{}{}:
defer func(){ <-sem }()
c, cancel:= context. WithTimeout(ctx, 300time. Millisecond)
defer cancel()
return psp. Do(c, req)
default:
return Res {}, ErrBusy//immediate failure instead of endless queue
}
}

Checklist:

Timeouts are shorter than SLO; retrai ≤ 2; there is jitter.
Pools/semaphores to external APIs; circuit breaker with metrics.
On busy routes, we return 429/Retry-After, not keep connections.

Error number 3. Weak Security: Webhook Signatures, Secrets, TLS

Symptoms: "other people's" webhooks pass, secrets in the code/log, MITM risks.

Root: no signature/freshness check, secrets live in env files, old TLS and weak headers.

How to

Signature of webhooks HMAC-SHA256 + 'X-Timestamp' (window ≤ 5-10 minutes), strict comparison of signature.

mTLS for critical integrations or IP allow-list.

Rotation of secrets via Vault/Cloud KMS; minimum rights; subtraction audit.

TLS 1. 2/1. 3 only, HSTS, correct CORS (narrow source list).

Signature Verification (Python):

python def verify(sig_hdr, ts_hdr, body, secret):
if abs(time. time() - int(ts_hdr)) > 600: raise Expired()
calc = hmac. new(secret, (ts_hdr + "." + body). encode(), hashlib. sha256). hexdigest()
if not hmac. compare_digest(calc, sig_hdr): raise BadSig()

Checklist:

All webhooks are signed and verified; the freshness window is limited.
Secrets in KMS/Vault, there is rotation and auditing.
TLS/HSTS enabled; CORS point; IP/mTLS where appropriate.

Error number 4. Contract drift: the scheme "lived its life"

Symptoms: the prod fell "only in some clients," 500/422 in the logs, different versions of the SDK and API argue.

Root: there is no strict description of contracts, backward incompatible changes, "quiet" fields, different meanings for the same names.

How to

Contract-first: OpenAPI/AsyncAPI + server/client generation; for events - Avro/Protobuf + Schema Registry.

Versioning: 'v1 → v2' (URI/header), deviation-plan, grace-period.

Backward-compat: only additive changes in minor releases; cannot be deleted/renamed without v-bump.

Contract tests: Pact/Buf - provider/consummer are tested in CI.

Examples:

yaml
OpenAPI: clear type of sum in minor amount_minor units:
type: integer minimum: 0 description: Sum in minimum currency units (integer)

Checklist:

Contracts are stored in git, CI validates/breaks if incompatible.
Schema registers for events, "back/forward" compatibility.
Docking page of changes, dates of depriction, test bench for partners.

Error number 5. "Blind" launch: no metrics/logs/trails and sandbox

Symptoms: "nothing is visible," support fills up, debag - hands in prod.

Root: observability was not included, there is no synthetics, the sandbox was tested "in words."

How to

RED/USE metrics: rate/error/latency on each endpoint, by route/method.

Correlation: 'trace _ id' in all logs and responses; bundle of zapros↔vebkhuk.

Synthetics: health tests (login/deposit sand), SLA monitoring T + 60 for webhooks.

Sandbox/stage: completely isolated keys/domains, fictitious PSPs, entries "not included in reports."

Response with trace ID:

http
HTTP/1. 1 202 Accepted
Trace-Id: 7f2b3d8e9c1a4
Location: /v1/ops/req_42/status

Checklist:

RED/USE metrics, dashboards, alerts (symptoms + causes).
End-to-end trails; JSON logs, no PII, with 'trace _ id'.
Synthetics from key regions; sandbox is required, different keys.

Prelaunch Plan (T-7 → T-0)

T-7 days:

Final contract scan: are there any incompatible changes; freeze schemes.
Secrets/Certificates: check rotation, accesses, KMS policies.
Synthetics 24 × 7, alerts are tied to on-call.

T-3 days:

Loading mini-run (burst 2-5 minutes): p95/pools/queues in the green zone.
DRY-RUN webhooks (replays, 5xx, jitter), DLQ check.
"Phone book" of partners: L1/L2 contacts, war-room channel.

T-0:

Channel traffic 5% → 25% → 50% for SLO gates; ready rollback.
Kill-switch/feature-flags on risky features are included.
War-room is active, status templates are prepared.

Rollback plan (if anything went wrong)

1. Release traffic to the previous stable version/route.

2. Disable phicheflag controversial changes.

3. Stabilize queues/pools, stop retreats in a storm.

4. Post-incident: collect timeline, roots, tasks (fixed forward/contract fixes).

Start self-test table (short)

Block	Question	Yes/No
Idempotence	Do all "creating" methods have an Idempotency-Key/' operation _ id 'and a uniq index?
Retrays/timeouts	Timeouts are shorter than SLO; retrai ≤2; jitter on?
Safety	Webhooks signed, freshness window ≤10 minutes, secrets in KMS?
Contracts	OpenAPI/AsyncAPI fixed, CI catches incompatibility?
Observability	RED/USE, trails, synthetic T + 60, sandbox isolated?
Rollback	Is there a/kill-switch rollback button, communications plan?

Frequently asked "what if..."

... does the provider not support Idempotency-Key?

Store 'hash (body)' + 'partner _ request _ id' and enter your idempotency.

... webhooks sometimes come "before" the answer?

Sew on 'operation _ id' and temporarily keep the "unknown → reconcile" status; periodic reconciler will close discrepancies.

... need to support old customers and new?

Version the endpoints ('/v1'and '/v2'), route by header/URI, keep backward compatibility for at least N months.

Resume Summary

Integration failures are almost always about the same thing: no idempotency, wrong timeouts and retreats, weak signature of webhooks, contract drift and lack of visibility. Fix contracts in advance, enable observability, place limits/backprescher, sign all external interactions and run synthetics. Then, even in case of failures of partners, your release will remain manageable - without money lost in retras, and without a sleepless night for the whole team.