Inside the pipeline

3 minutes. Six stages. Built from scratch for boutique ultrasound.

The Boutique Ultrasound pipeline turns a 3D / 4D ultrasound still into a photoreal newborn portrait the studio can hand to a parent. It's not a generic image-to-image tool — every stage was rebuilt for the specific failure modes of fetal ultrasound data. Here's exactly what happens between upload and delivery.

The full path · upload → delivery

Scan ingest

~3 sec

Vision pipeline

~15 sec

Conditioning + reference embedding

~12 sec

Diffusion rendering

~130 sec

Brand + safety overlay

~12 sec

Gallery delivery

~8 sec

Time distribution

~3 min total

~3 sec Scan ingest

~15 sec Vision pipeline

~12 sec Conditioning + reference embedding

~130 sec Diffusion rendering

~12 sec Brand + safety overlay

~8 sec Gallery delivery

Stage 04 (diffusion rendering) dominates — that's where the actual painting happens. Everything else is sub-15-second infrastructure.

The core lift

From a noisy 3D scan to a photoreal portrait.

The scan is what the scanner gives you — low effective resolution, grayscale, heavy speckle noise. The portrait is what the model paints from it, conditioned by the structural cues we extract. Same baby, dramatically different image fidelity.

Source 3D ultrasound scan — Input · what the scanner gives you
3D ultrasound still
Effective res
~300 × 300
Color depth
8-bit gray
Signal
speckle + shadow
File size
~80 KB

~47× pixel density

structural cues
preserved

AI-rendered photoreal newborn portrait — Output · what the studio delivers
Photoreal portrait
Resolution
2048 × 2048
Color depth
24-bit RGB
Frames
5 candidates
File size
~2.4 MB

Artistic interpretation · not a biometric prediction. The model renders in the visual style of a newborn portrait at the pose and proportion the scan dictates.

What the vision stage sees

Structural landmarks, not invented detail.

The vision pipeline finds anchors on the 3D mesh — nose tip, eye orbits, chin curvature, cheek profile — and uses only those signals to constrain the renderer. Anything blurry or shadowed in the scan stays blurry in the conditioning. The model doesn't paint features the scan didn't see.

3D ultrasound with vision pipeline landmarks

Vision pipeline · landmark anchors

What stays. What gets discarded.

The vision stage extracts only signals it can verify on the mesh. Everything below survives into the renderer; everything above the dotted line gets ignored so the model has nothing to hallucinate from.

Facial landmark anchors (~30 points)
Surface normals + curvature
Lighting direction
Mesh confidence map (per-pixel)
Pixel-level fine features
Speckle noise + acoustic shadow
Anything in low-confidence regions

This is why “the model can't paint what the scan didn't see” — half its inputs are explicitly discarded before rendering starts.

Stage detail

Six stages, mostly running in parallel.

What happens between the upload button and the delivery email. Runtime estimates are rolling averages from production traffic — actual values fluctuate with GPU queue depth.

Scan ingest

~3 sec

Sonographer uploads the same 3D/4D still she'd hand to the parent. We accept JPG / PNG / MP4 from any modern scanner — GE Voluson, Samsung HERA, Mindray DC-80, Philips EPIQ. No driver, no plugin, no scanner integration.

Vision pipeline

~15 sec

Our computer-vision pipeline analyses the 3D mesh: surface normals, facial landmark anchors (nose, eye orbits, chin curvature), lighting direction, mesh occlusions. These structural cues become the conditioning signal for the rendering model — they constrain pose, scale, and approximate proportion. The pipeline doesn't infer features it can't see.

Conditioning + reference embedding

~12 sec

Structural cues + studio's brand tokens + a curated reference embedding (drawn from a corpus of photographed newborns, parents-consented) get composed into the prompt + ControlNet inputs. This is where the model learns 'paint a newborn at the pose and proportion the scan dictates, in the visual style of a real-life portrait.'

Diffusion rendering

~130 sec

A diffusion model — fine-tuned on a custom newborn-portrait corpus, not a stock SDXL fork — generates 5 candidate frames in parallel. Our scoring head picks the highest-fidelity-to-conditioning frame as the primary, holds the other 4 as alternates if the studio wants to regenerate. Average end-to-end stays under 3 minutes because most steps run concurrently.

Brand + safety overlay

~12 sec

Cobranded (Growth) or full white-label (Studio) overlay applied — frame, logo, color. Every output gets the mandatory "Artistic interpretation · Not a medical prediction" tag baked into both the image metadata and the delivery gallery copy. This isn't a UI toggle the studio can remove.

Gallery delivery

~8 sec

Branded gallery URL ships to mom via email + SMS. She forwards to grandparents, posts to Instagram, prints the favorite. Every share carries the studio's brand and the artistic-interpretation tag.

The hard problems

Why we couldn't just point a stock model at it.

3D ultrasound + newborn portraiture has failure modes that off-the-shelf image AI hits hard. Here's what we had to solve, and how.

3D ultrasound is noisy

Real scans are full of acoustic shadow, partial occlusion, and surface artifacts the diffusion model would otherwise hallucinate into real features. The vision stage explicitly masks low-confidence regions and tells the renderer "don't paint here."

Newborn proportions aren't adult proportions

Out-of-the-box SDXL paints adults if you ask for "a baby." Our fine-tune corpus and conditioning prompts enforce neonatal proportion (eye-to-skull ratio, cheek fat, ear placement) — and reject any output that drifts toward adult features.

Skin-tone fidelity matters

Ethnicity / skin tone is parent-supplied at upload — not inferred from the scan (the scan can't see it anyway). The conditioning prompt incorporates this explicitly, which avoids the common AI failure of defaulting to a single ethnicity.

Variance has to be controllable

Studios need consistent output across sessions. Our scoring head + seed pinning let the studio re-generate with adjusted style without losing the structural conditioning. Same scan + same brand tokens → similar but never identical output.

Honest narrowness

What this technology is not.

Studios sometimes assume more capability than the system actually has. Here's what the pipeline explicitly does NOT do — so you can frame it correctly to parents.

We don't predict what the baby will look like

The output is an artistic interpretation in the visual style of a newborn portrait. It is not a biometric prediction. Two different scans of the same baby will produce two different portraits.

We don't perform medical analysis

Nothing in the pipeline detects, diagnoses, screens for, or flags fetal anomalies. Every output is labeled non-diagnostic. Clinical questions belong with the parent's OB/GYN.

We don't train on customer scans

Customer scans uploaded through the studio dashboard are processed and deleted on a 90-day retention cycle. They never enter the training corpus for our or any third-party model.

Data + security

How we handle the scan you upload.

Encrypted in transit + at rest

TLS 1.3 + AES-256. Scans live in a private storage bucket; only the rendering pipeline has read access.

90-day retention

Scan inputs deleted on a 90-day rolling window. Generated portraits stay in the studio's dashboard until the studio deletes them.

Never used for training

Customer scans don't enter our training corpus. The reference embeddings the model uses are sourced separately, with parent consent.

See it run on a real scan in 10 minutes.

Pilot is $39 for the month — enough deliveries to ship the workflow end-to-end and decide if the output quality earns its place in your packages. Full refund within 7 days if it doesn't.

Start a 7-day pilot — $39 ← Back to overview