Blog · Jun 29, 2026

When AI Second Opinions Go Medical: What Builders Should Know About Claude Code and MRI Analysis

How Claude Code's unexpected medical imaging capabilities reveal both the promise and perils of generalist AI in specialized domains.

By Craig Mason 7 min read

#ai-trends #healthtech #claude-code

A radiologist uploads their patient’s MRI scan to Claude Code, asking for a second opinion on a subtle shadow near the hippocampus. The AI highlights three potential interpretations, none definitive, but all clinically plausible, along with citations to recent papers the doctor hadn’t seen. This isn’t science fiction; it’s one of dozens of similar cases bubbling up on Hacker News this week under the thread “I used Claude Code to get a second opinion on my MRI.”

The short version

Claude Code’s emergent capability to analyze medical imaging, despite no explicit training for it, signals a broader trend: generalist AI models are quietly absorbing domain-specific competencies faster than safety guardrails can adapt. For builders, this creates both opportunity (augmenting experts with AI assistants) and liability (unregulated diagnostic tools). The immediate takeaway? Treat all AI outputs as probabilistic suggestions, never deterministic answers.

Why is this happening now?

Three factors collided this month. First, healthcare’s chronic second-opinion gap (conflicting scan interpretations remain common in radiology) makes AI augmentation irresistible to overwhelmed practitioners. Radiologists routinely face unclear borderline cases where a fresh perspective could catch something subtle. When human specialists disagree, having another voice in the room becomes valuable, even if that voice is synthetic.

Second, Claude Code’s large context window lets it ingest entire imaging studies alongside relevant literature. A radiologist can paste not just the scan description but also the patient’s symptom timeline, prior imaging reports, lab values, and current research papers. This mimics how human consultations actually work: messy, context-rich, and rarely based on a single image in isolation. The model can cross-reference patterns across all this information in ways that feel eerily like clinical reasoning.

Third, a critical mass of doctors began experimenting off-label after noticing incidental radiology knowledge during routine coding help. A physician asks Claude to debug their imaging pipeline script and casually mentions what the scan shows. The model responds with medically coherent observations. Word spreads through professional networks. Suddenly, people start testing it deliberately on real cases, and the results are compelling enough to share publicly.

Crucially, this wasn’t designed. Anthropic didn’t market Claude Code as a radiology assistant. It emerged because foundation models trained on vast internet corpora inevitably absorb medical textbooks, radiology atlases, research papers, and clinical case discussions. The capability was latent, waiting for someone to probe it.

How reliable is AI for medical imaging analysis?

Not reliable enough to operate autonomously, but surprisingly useful as a consultative tool. Unlike specialized AI radiology tools that output binary “cancer/no cancer” calls, Claude Code generates differential diagnoses: a list of possibilities ranked by likelihood, much like a human expert would. This matches real-world clinical workflows better than deterministic systems, but introduces new risks when over-trusted.

The distinction matters. A doctor doesn’t want the AI to say “definitely glioblastoma” when dealing with ambiguous imaging. They want “consistent with high-grade glioma, but consider metastasis, lymphoma, or inflammatory lesion given the presentation.” That’s the language of actual clinical reasoning, where certainty is earned through triangulation, not declared upfront.

However, the model has no perceptual grounding. It can’t actually “see” the scan the way a radiologist does. It processes descriptions, metadata, and text-based representations of imaging findings. When a user describes a lesion as “hyperintense on T2-weighted images with restricted diffusion,” Claude reasons about what that pattern typically indicates, but it isn’t analyzing raw DICOM files pixel by pixel. Some users feed in screenshots or converted images, and the model’s vision capabilities do engage, but this is uncharted territory for accuracy.

Errors tend toward plausibility rather than nonsense. The model won’t suggest a brain tumor for a femur fracture. But it might miss subtle signs a trained eye would catch, or rank a rare diagnosis too high based on incomplete information. It also lacks clinical intuition about patient-specific factors: age, comorbidities, how sick the person looks. Medicine is full of rules that have exceptions (“young patients rarely get this, but…”), and AI struggles with those judgment calls.

What does this mean for builders working with AI?

If you’re building tools in this space, three realities demand attention.

Cost: Running large context windows on high-resolution medical images isn’t cheap. Expect API costs several multiples higher than text-only workflows. A single comprehensive case review might involve multiple back-and-forth exchanges, each with substantial token usage. For a production tool serving even dozens of cases per day, monthly API bills climb fast. This impacts who can afford to experiment and which business models make economic sense. Subscription pricing needs to reflect the underlying compute burden, or margins evaporate.

Liability: Unlike FDA-regulated medical devices, generalist AI has no pre-market clearance for diagnostics. Any builder enabling this use case assumes unquantified legal risk. If a patient suffers harm after a clinician relied on an AI suggestion, who’s responsible? The doctor? The tool vendor? The AI provider? Case law hasn’t settled these questions. Traditional medical software follows strict regulatory pathways. General-purpose AI tools used off-label for medicine exist in a gray zone. Disclaimers help, but they don’t eliminate risk. Malpractice insurers are starting to ask questions about AI usage, and coverage terms remain unclear.

Workflow: The most effective implementations act as “voice of doubt” systems that surface alternative interpretations rather than definitive answers. This means designing interfaces that preserve uncertainty. Show multiple hypotheses, not just the top-ranked one. Include confidence estimates where meaningful. Link to supporting literature so the clinician can verify the reasoning. Make it easy to dismiss or deprioritize suggestions. The goal is augmentation, not replacement: help the expert think more expansively, not think less.

Consider how this fits into actual clinical environments. Radiologists work under time pressure, reading dozens of scans per shift. An AI tool that saves them two minutes per case by flagging potential issues is valuable. One that generates long-winded reports they must parse carefully wastes time. The interface needs to respect their workflow: fast skimmability, clear visual hierarchy, integration with existing PACS systems. Building for doctors means understanding that good UX in healthcare looks different than consumer software.

Who should be paying attention to this trend?

Three groups stand out, each for different reasons.

Healthtech founders eyeing AI diagnostic tools now have proof the underlying technology works at a level that surprises even skeptics. But the business model remains fraught. Regulatory pathways for AI medical devices are expensive and slow. Reimbursement codes for AI-assisted reads don’t exist in many contexts. Liability insurance is hard to price. Direct-to-physician sales require trust that’s hard to earn. Still, the demand is real. Founders who can navigate these obstacles have a genuine opportunity, especially in underserved specialties where second opinions are scarce.

Medical education platforms could use Claude Code (or similar models) to simulate expert consults for trainees. Imagine a resident analyzing a tricky case and querying an AI that behaves like a more senior colleague, asking clarifying questions and suggesting differential diagnoses. This sidesteps some regulatory concerns because it’s educational, not clinical. It scales expertise in ways traditional apprenticeship models cannot. Students in rural programs could access the same level of AI-mediated teaching as those at elite academic centers.

AI safety researchers tracking emergent capabilities in generalist models should treat this as a canary in the coal mine. Medical imaging analysis is complex, high-stakes, and was not an explicit training objective. If Claude Code can do this, what other latent capabilities exist in these models that users haven’t discovered yet? And how do we govern tools that develop new functionalities without anyone intending them to? The traditional model of “design, test, regulate, deploy” breaks when capabilities emerge post-deployment through creative user prompting.

FAQ

Could this replace radiologists? No. It excels at generating differentials but lacks the perceptual grounding to be primary reader. Think “resident who never sleeps” rather than attending physician. Radiologists do more than interpret scans. They correlate imaging with clinical context, communicate nuanced findings to referring physicians, perform interventional procedures, and make judgment calls that require years of pattern recognition. AI might assist with parts of this workflow, but the role won’t vanish. More likely: radiologists become more productive, handling larger caseloads or spending more time on complex cases while AI pre-screens routine studies.

What’s the biggest implementation risk? Over-reliance. These tools work best when their uncertainty is preserved, not hidden behind false confidence. The danger isn’t that the AI makes wildly wrong suggestions (rare), but that busy clinicians start trusting it too much, skipping their own careful review. Automation bias is real: people defer to machine outputs even when their own judgment should override them. Design choices matter enormously here. An interface that presents AI suggestions as tentative hypotheses encourages critical thinking. One that frames them as confident conclusions invites dangerous shortcuts.

How should builders approach this? With extreme transparency. Every AI suggestion should come with confidence estimates where feasible, literature citations, and clear disclaimers about the model’s limitations. Don’t obscure the reasoning process. If the AI’s logic is available, surface it so the user can evaluate whether it makes sense. Avoid dark patterns that make it hard to ignore or override the AI. Build in friction for high-stakes decisions: require explicit confirmation before acting on suggestions. Log everything for audit trails. And stay engaged with the medical and regulatory communities. The rules are evolving in real time. Builders who participate in shaping those rules will fare better than those who ignore them.

Found this useful? Read more from the blog →