Boston Children’s Hospital, Harvard University and OpenAI used an AI-assisted workflow to reanalyze 376 previously unsolved rare-disease cases, surfacing leads that were later confirmed as diagnoses in 18 patients.

The work was published on June 18 in NEJM AI (a journal from the New England Journal of Medicine group) and summarized by OpenAI.

Researchers used OpenAI’s o3 Deep Research model to review de-identified clinical and genomic information from cases that had already gone through expert analysis without a diagnosis.

Defining the AI’s role in diagnosis

The model did not diagnose patients or make clinical decisions. It claims to have produced evidence-linked candidate explanations for specialists to examine. The firm’s clarified that a finding counted as a diagnosis only after expert review, additional testing, classification of the variant as pathogenic or likely pathogenic, confirmation by a CLIA-certified laboratory and return of the result by the clinical team.

“The bottleneck is time. An expert can devote only so much of their day to any one particular person,” Dr. Catherine Brownstein of Boston Children’s Hospital’s Manton Center for Orphan Disease Research said in OpenAI’s summary of the work.

Brownstein’s comment captures the practical point of the study. The workflow was not presented as a consumer diagnostic tool but was used as an evidence-synthesis layer over existing genomic pipelines, where sequencing is only one part of the challenge.

Unresolved cases also have to be kept aligned with changing medical literature, variant databases and gene-disease knowledge.

Structuring the case data for review

For each case, the team assembled a de-identified packet containing Human Phenotype Ontology (HPO) terms, clinician notes where available, age and gender metadata and a filtered variant table. The table covered rarity, predicted protein effect, ClinVar classification and signal quality across available family members.

The model was then asked to propose the most plausible molecular explanation and provide the evidence and reasoning behind it. Researchers reviewed the outputs using the ACMG/AMP variant-interpretation framework, which recommends standard terms such as pathogenic, likely pathogenic, uncertain significance, likely benign and benign for variants in Mendelian disease. At least two team members reviewed each candidate, with disagreements resolved by consensus.

Breaking down the diagnostic yield

The confirmed diagnoses represented an additional diagnostic yield of 4.8% across the 376 unsolved cases. The highest absolute number came from 100 neurodevelopmental cases, where 10 diagnoses were established.

The workflow also surfaced four diagnoses among 61 neuromuscular cases, two among 200 cases of sudden unexpected death in pediatrics cases and two among 15 early psychosis cases.

The study also identified seven “rediscoveries,” where pathogenic or likely pathogenic findings had been established externally but were not available in the local research record at the time of review.

Validating the workflow on known cases

Before applying the workflow to unsolved cases, the team tested it on cases with known diagnoses. OpenAI’s summary of the study said the workflow recovered the correct gene and variant in duplicate runs for 48 of 51 solved cases, returned the correct diagnosis in duplicate runs for 45 of 57 neuromuscular cases and named the correct gene in all 15 cases in a long-read genome set.

Alan Beggs, director of the Manton Center for Orphan Disease Research, put the data problem plainly: “Researchers like Catherine and me can’t possibly keep 8,000 different diseases in our heads. That’s the power of AI.”

The broader diagnostic problem is large, but the study’s result is deliberately narrow. NIH’s Genetic and Rare Diseases Information Center says more than 10,000 rare diseases affect millions of people in the U.S., and patients often face challenges getting a diagnosis.

The OpenAI-Boston Children’s study tested whether expert-led AI reanalysis could surface candidates in cases that had already resisted specialist review.

Acknowledging study limitations and future steps

The limits are as important as the yield. The study was retrospective, the cohorts were heterogeneous and reviewers were not blinded to the model’s confidence scores. Researchers did not measure time saved, cost, clinician effort, false-positive workload or changes in care.

The work also did not systematically evaluate several clinically relevant forms of genetic variation, including repeat expansions, deep-intronic changes and mosaicism.

Those limits narrow the operational takeaway for hospitals, specialty clinics and diagnostic labs. The study describes AI in a governed clinical workflow: structured data in, evidence-linked hypotheses out, expert review in the middle and laboratory confirmation before any result reaches a family.

OpenAI said the Manton Center for Orphan Disease Research will lead further work, supported by an OpenAI Foundation grant, to develop a platform-agnostic, low-cost genetics AI copilot for clinical teams.

OpenAI also said prospective, multi-center studies should compare LLM-assisted reanalysis with standard practice on diagnostic yield, time to candidate, clinician effort, false-positive burden, cost and effects on care.

Personalized Feed
Personalized Feed