Architect Decisions

Blocking Now — One-way doors that must be set before the first migration or route is written

Blocking Now 01

Database naming conventions and migration patterns

Every engineer will write Drizzle migrations. Without a shared convention established before the first table is created, the schema becomes inconsistent and migrations become a source of ongoing friction. This is a one-way door — renaming columns after data exists requires a migration, downtime coordination, and ORM changes across every query that references them.

-- Recommended conventions (establish in packages/db/CONVENTIONS.md)
table naming: snake_case, plural -- work_orders, asset_categories
column naming: snake_case -- created_at, tenant_id, is_active
UUID strategy: UUIDv7 -- time-sorted, better index locality than v4
timestamps: created_at, updated_at -- on EVERY table, non-nullable
soft deletes: deleted_at TIMESTAMPTZ -- nullable; NULL = active
tenant column: tenant_id UUID NOT NULL -- on every tenant-scoped table
FK naming: fk_{table}_{ref} -- fk_work_orders_tenant
-- WARNING: UUIDv4 primary keys cause index fragmentation at scale.
-- Switch to UUIDv7 from day one; retrofitting requires a table rewrite.

Recommended — Adopt conventions above

Document in packages/db/CONVENTIONS.md, enforce with a Drizzle custom linter in CI, include in the PR template checklist. UUIDv7 is available via the uuidv7 npm package and supported natively in Postgres 17 (use the package for Postgres 16).

Alternative — camelCase columns via Drizzle mapping

Drizzle can map snake_case DB columns to camelCase JS properties automatically. Still recommend snake_case at the DB level for compatibility with raw SQL, psql sessions, and migration scripts.

Blocking Now 02

How is the tenant ID passed to PostgreSQL for Row-Level Security?

RLS policies filter rows using a session variable (app.current_tenant). The method used to set this variable determines which connection pooling mode is compatible. This decision directly constrains the connection pooling choice and must be made before writing any RLS policy or database middleware.

Option A ★ Recommended

SET LOCAL app.current_tenant = $tenantId inside each transaction. Transaction-scoped — the variable resets automatically when the transaction ends. Compatible with PgBouncer transaction mode and RDS Proxy. Requires every DB operation to be wrapped in an explicit transaction, which is good practice anyway.

Option B

SET app.current_tenant = $tenantId per connection (session-scoped). Simpler to implement but requires session-mode pooling or connection pinning. Connection pinning reduces pooling efficiency — effectively one connection per active request at peak load.

Blocking Now 03

What are the OpenAPI contract standards before the first endpoint is written?

API contracts are frozen the moment an external consumer depends on them. Establishing URL structure, pagination shape, error format, and versioning strategy before the first route is written prevents the most expensive kind of breaking change: restructuring a live API that customers have already integrated against (VSXORG-246).

Recommended standards

URL structure: /v1/work-orders (URL versioning, hyphenated resources)
Pagination: Cursor-based with { data: [], nextCursor, hasMore } envelope
Errors: RFC 7807 Problem Details — { type, title, status, detail, instance }
Versioning: URL prefix (/v1/) — explicit, cacheable, easy to route
Auth header: Authorization: Bearer <jwt> on all routes
Timestamps: ISO 8601 with UTC timezone in all responses

Alternative — offset pagination

Offset pagination (?page=2&limit=50) is simpler to implement but breaks with concurrent inserts and is inefficient on large tables. Work order lists will grow large. Use cursor pagination from day one.

Blocking Now 04

What are the monorepo package boundaries?

The boundary between packages defines where business logic lives and who can import what. Blurring these boundaries — especially putting database access in packages/core or HTTP logic in domain packages — creates a tangled codebase that becomes expensive to test and impossible to reuse across apps. This must be defined before the first engineer writes a business rule.

packages/core

Business rules, domain types, Zod validation schemas, service layer interfaces

No DB access. No HTTP. No framework imports.

packages/db

Drizzle schema, migrations, query helpers, RLS middleware, connection factory

No business logic. No HTTP. Exposes typed query functions only.

apps/api

Hono routes, request/response mapping, auth middleware, OpenAPI spec

No business logic inline. Calls packages/core services. No raw SQL.

packages/ui

Shared React/React Native components, design tokens, accessible primitives

No API calls. No business logic. Pure presentation.

Recommended — enforce with ESLint import rules

Use eslint-plugin-import with custom no-restricted-imports rules to prevent packages/core from importing packages/db, and prevent apps/api from importing Drizzle directly. Violations fail CI. Document the boundaries in ARCHITECTURE.md at the monorepo root.

Alternative — loose boundaries, enforced by convention

Rely on code review to enforce package boundaries. Works for small teams early on; breaks down as the team grows or during high-pressure sprint work.

Needed Soon — Resolve before Phase 3 mobile and background job work begins

Needed Soon 05

Offline sync conflict resolution strategy for WatermelonDB

WatermelonDB requires a server-side sync endpoint that accepts a list of local changes and returns a list of server changes. Conflicts occur when the same record is modified both offline on the device and on the server. For MainTrac, the conflict-prone areas are work order status transitions and combination log entries — both have business rules that make simple last-write-wins incorrect.

Field / Record type	Strategy	Rationale
Work order status	Server wins	Status transitions have business rules (Complete cannot revert to In Progress). Server is authoritative.
Work order assignment (staff)	Server wins	Assignments may have changed while the field worker was offline. Server state is correct.
Combination log entries	Last write wins	Log entries are append-only in practice. Conflicts are rare; LWW is acceptable.
Work order notes / description	Last write wins	Free text; the most recent edit is the intended value in almost all cases.
Asset readings / meter values	Last write wins + append	Both readings are valid data points. Preserve both as separate log entries rather than overwriting.

Recommended — implement hybrid strategy above

The WatermelonDB sync endpoint receives change sets. Before applying client changes, check each work order status field against the server value. If the server status is terminal (Complete, Cancelled), reject the client's status change and return the server value. For all other fields, apply last-write-wins. Document rejected changes in a sync_conflicts log table for audit purposes.

Needed Soon 06

Background job architecture: BullMQ, SQS, or in-process cron?

PM plan auto-generation, work order email notifications, and report generation are background tasks that cannot run synchronously in the request/response cycle. The architecture must be decided before any of these features are implemented — the job framework is a shared infrastructure concern used by multiple packages.

Option A ★ Recommended — BullMQ + Redis

BullMQ runs in the API container, backed by a Redis instance (AWS ElastiCache or Upstash). Handles retries, delays, priorities, and job scheduling. Well-documented, strong TypeScript support, no new AWS services to manage beyond Redis. Migrate to SQS + Lambda if job volume exceeds what a single container can handle.

Option B — AWS SQS + Lambda

Serverless, scales to zero, no Redis dependency. Higher cold start latency for time-sensitive notifications, more IAM complexity, harder to develop locally without LocalStack.

Option C — In-process cron (node-cron)

Zero infrastructure, zero cost. No persistence — if the container restarts, in-flight jobs are lost. Not acceptable for PM plan generation or report jobs that may take minutes.

Needed Soon 07

Observability stack: what gets logged, traced, and alerted on?

Observability must be designed before any code goes to production — retrofitting structured logging onto an existing codebase is tedious and inconsistent. The stack also affects the cost model: Datadog is comprehensive but expensive at scale; a self-assembled stack requires more setup but offers better cost control.

Option A ★ Recommended — Layered approach

Error tracking: Sentry (excellent React Native SDK, session replay, source maps)
Structured logs: Pino (fast, JSON output) → CloudWatch Logs
Infrastructure metrics: CloudWatch (ECS CPU/memory, RDS connections, queue depth)
Tracing: Add OpenTelemetry in Phase 2 once the critical paths are identified
Total cost: ~$26/mo Sentry Team + CloudWatch usage. Scales well.

Option B — Datadog

Unified platform: logs, metrics, traces, APM, mobile RUM. Excellent product. ~$200–600/mo at this scale. Justified if VSI already has a Datadog contract.

Option C — OpenTelemetry + Grafana Cloud

Open standard, portable, generous free tier. Requires more setup to configure the collector, dashboards, and alert rules. Strong long-term choice if VSI wants to avoid vendor lock-in.

Needed Soon 08

What custom claims does the JWT carry?

Clerk issues JWTs on every request. What the JWT contains determines whether the API can resolve tenant, user, and permissions from the token alone or must make a database round-trip on every hot-path request. The claims structure is set at the auth provider level and changing it requires a coordinated update of every endpoint that reads claims.

Option A — Minimal claims

sub only — look up tenant, role, and permissions from the database on every request. Simple token, database round-trip on every authenticated call. Adds ~20–50ms latency on every request before any business logic runs.

Option B ★ Recommended — Rich claims

Include tenantId, role, and the 10–15 most-checked permission codes (e.g., work_orders:write, assets:read, reports:generate) in the JWT. The API resolves authorization entirely from the token on hot paths. Permissions are re-issued at next login if they change — acceptable for a B2B product where permission changes are infrequent.

Before Launch — Resolve before production traffic

Before Launch 09

What gets cached in Redis, and what always hits Postgres?

Premature caching adds complexity and introduces cache invalidation bugs. But some data — reference tables, permission lookups — is read thousands of times per hour and changes rarely. Defining the caching boundary before launch prevents both over-caching (which causes stale data bugs) and under-caching (which causes unnecessary DB load at scale).

Option A — No caching in Phase 1

Simplest, easiest to reason about. Revisit after load testing reveals actual hot paths. Risk: if a reference data query is called on every work order list render, it will hit Postgres thousands of times per minute.

Option B ★ Recommended — Cache reference data + permission lookups

Cache (Redis, TTL 1 hour): task codes, asset categories, GL codes, PM schedules — read-heavy, updated by admins only, rarely change.
Cache (Redis, TTL 5 min): resolved permission sets per user (if not using rich JWT claims).
Never cache: work orders, combination logs, asset readings — multi-tenant read-write data; caching requires per-tenant key namespacing and invalidation on every write, which adds more complexity than it saves at this scale.

Before Launch 10

What are the RTO and RPO targets for production?

Disaster recovery targets must be set before launch because they drive infrastructure decisions: Multi-AZ RDS adds ~$150–300/mo but reduces recovery time from hours to minutes. Government customers may have contractual RTO/RPO requirements that are more stringent than the default. Define these now so the infrastructure is built to meet them, not retrofitted after a contract is signed.

Option A — Minimal (RTO 24h / RPO 24h)

Daily automated backups only. Lowest cost. Unacceptable for customers whose operations depend on the system being available during business hours.

Option B ★ Recommended — Standard (RTO 4h / RPO 1h)

RDS automated backups with 1-hour backup window + Multi-AZ standby replica for failover. Covers the vast majority of failure scenarios within a business day. Verify with government/CNIC customers that their contracts do not require tighter targets before finalizing.

Option C — Premium (RTO 1h / RPO 15min)

Multi-AZ RDS + read replica + point-in-time recovery. ~$300–500/mo additional infrastructure cost. Appropriate for the dedicated compliance tier — include in the dedicated tier pricing model (Leadership #7).