Sensitivity analysis of an embryo grading AI model to different focal planes

Presented at: Milan, Italy

Authors: Justina Hyunjii Choa, Camelia Brumara, Paxton Maeder-Yorka, Olesii Barashb, Jonas Malmstenc, Nikica Zaninovicc, Denny Sakkasd, Kathleen Millere, Michael Levyf, Matthew David VerMilyeag, Kevin Loewkea

Study question:

What is the sensitivity of an embryo-grading AI model to different focal planes and how do we obtain consistent scores across focal planes?

Summary answer:

Test time augmentation (TTA) and ensemble modeling reduces sensitivity of the AI model to different focal planes while maintaining performance.

What is known already:

When prioritizing embryos for transfer, embryologists assess the 3D morphological features under a microscope and assign a score that reflects the embryo quality. In comparison, AI-based embryo grading models typically take one 2D focal plane of an embryo and output a score based on that focal plane. AI models such as CNNs are known to be sensitive to perturbations in its input. In order to reduce sensitivity and generalization error and thus improve predictive performance, techniques such as ensemble learning and test-time augmentation can be used.

Study design, size, duration:

Historical, de-identified images of blastocyst-stage embryos were collected from 11 IVF clinics in the United States for cycles between 2015-2020. 5,100 blastocysts were matched to pregnancy outcomes as determined by fetal heartbeat. 2,900 blastocysts were matched to aneuploid PGT-A results and added to the negative training group to reduce selection bias. Data was split to 70% for training and 30% for testing. A set of 10 embryos were used for focal plane sensitivity.

Participants/materials, setting, methods:

A single model (Resnet18), a three-model (Resnet18), and a six-model (Resnet18 and Efficientnet-b1) ensemble with and without TTA were trained to rank embryos according to their likelihood of reaching clinical pregnancy. TTA involved taking the average scores from 4 flipped and rotated copies of the original input image. Manual grades were mapped to numeric scores for comparison. The AUC was used to evaluate the ability of the models to rank embryos.

Main results and the role of chance:

Focal plane sensitivity was calculated as the range, or difference between the maximum and minimum score, for an embryo at different focal planes. Between 12 and 100 focal plane images were available for each of the 10 embryos. On average, the focal plane range was 0.26 for the single model, 0.22 for the single model with test-time augmentation, 0.14 for a 3-model ensemble with test-time augmentation, and 0.11 for a 6-model ensemble with test-time augmentation. Test-time augmentation on the single model reduced the range by 17%; whereas ensembling with test-time augmentation reduced the range by 46% for the 3-model ensemble and 60% for the 6-model ensemble. Reduction in range did not compromise performance. The AUC for the test set for all embryos was 0.73 for the single model, 0.74 for the single model with test-time augmentation, 0.75 for the three-model ensemble with TTA and 0.74 for the six-model ensemble with TTA. All models outperformed manual grading, which was estimated to have an AUC of 0.67 for all embryos.

Limitations, reasons for caution:

Our analysis on focal plane sensitivity was limited to a small sample size of 10 embryos, so more samples will be needed to confirm our findings.

Wider implications of the findings:

Test-time augmentation and ensemble techniques can be used to reduce sensitivity while maintaining model performance. By reducing sensitivity to different focal planes, an AI model can produce one reliable score for a single embryo as is done currently in practice with manual grading.

Research

Sensitivity analysis of an embryo grading AI model to different focal planes