Using Natural Language Processing to Generate Text Summaries of Genomic Findings in Precision Oncology

DNA test inforgraphic

Creating a knowledge database of accurate cancer data to improve models related to precision oncology

Project summary

Cancer is a collection of diseases caused by mutations to our DNA which result in tumours. Cancers are grouped based on the location they start in as well as common mutations. However, every tumour is specific to the individual it arises in. Research has shown value in approaching cancer treatment on a case-by-case basis. While this process has shown success in clinical trials, it is time-consuming and requires a high level of expertise. To make such treatment standard-of-care, it needs to be scaled to meet demand. Analyzing and interpreting the tumour data is one of the largest bottlenecks. This is due to the large number of mutations, as well as the ever-growing body of literature. Thus far, work to address this has focused on reducing the time spent on literature review by creating databases, termed knowledge bases (KBs), to store information about cancer mutations. These KBs are then used to determine the impact of mutations when they are observed in a tumour. Researchers have turned to crowd-sourcing as a way to speed up creating such resources. While this allows us to enter content faster, reliable systems still require a secondary review, resulting in many entries pending review. Additionally, as the knowledge we have collected for each mutation increases, so does the review burden on the expert responsible for the interpretation and prioritization of the findings for each case.

With advances in computing architecture, we are able to create large natural language processing (NLP) models (ex. GPT-4) to produce text which can be used to answer questions, summarize long articles, or generate text. These models are commonly referred to as large language models (LLMs). Current state-of-the-art LLMs are able to produce fluent text, often indistinguishable from human-written text. However, these LLMs are prone to errors, emphasizing the importance of fact-checking, especially in fields like medicine. We plan to use the latest developments in Natural Language Processing (NLP) and tap into the knowledge of highly skilled experts at both BC Cancer and the University of Washington. Our goal is to build a dataset containing verified facts about cancer. This dataset will be used to develop a fact-checking NLP model designed specifically for cancer-related information. This model will be used to address the review bottleneck in curating content for KBs and, along with the use of LLMs, the summarization bottleneck faced by analysts reviewing individual cases.

Quotes

"I am delighted to receive the Marathon of Hope Health Informatics & Data Science Award. This recognition underscores the importance of our research in advancing data analysis in precision oncology using natural language processing. I am grateful for the opportunity and look forward to continuing my research with the support of the Marathon of Hope Cancer Centres Network."

  • Caralyn Reisle, HI&DS Awards

"This research seeks to enhance efficiency and precision in cancer treatment by automating the interpretation and reporting of genomic data using artificial intelligence approaches. By leveraging natural language processing (NLP) and collaborating with experts in precision oncology, this project aims to overcome current bottlenecks in data analysis and reporting, potentially leading to faster turnaround times for test results and broader access to personalized treatment options for cancer patients."

  • Dr. Steven Jones, mentor